MedicalBenchmark
Back to blog

Two Weeks Later: 22 New Models and a Triple 200/200 in MIR 2026

From February 5 to 20, 2026, we added 22 new models to the benchmark. In just 15 days we went from 99.5% to 100%: Gemini 3.1 Pro Preview arrives with 200/200, Qwen3.5 397B A17B breaks the open-weights ceiling in the global ranking, and MedGemma leaves an uncomfortable lesson about what "health specialization" really means.

Technical storytelling with new charts about the perfect tie, the time-based tiebreaker, and what happens to a benchmark once it hits the ceiling.

Equipo MedBenchFebruary 20, 202610 min read
MIR 2026BenchmarkGemini 3.1Qwen3.5Claude Opus 4.6Update

On February 5, 2026, we published "199 out of 200: AI Only Fails Once in MIR 2026". At the time, 199/200 looked like a reasonable ceiling: it was already better than any historical human score, and the exam (200 valid questions) does not leave much room.

Fifteen days later, that ceiling no longer exists.

Between February 5, 2026 and February 20, 2026, we incorporated 22 new models into the benchmark, and all 22 are already evaluated in MIR 2026 and in the global cumulative ranking.

The picture changes for two reasons:

  1. Performance reaches 200/200 (a perfect score).
  2. Once there is a perfect score, the problem stops being "who gets more right" and becomes "how do you compare those who tie."

1. Two Weeks in One Picture

Custom (ALMA/MIRI)
Frontier
Specialized
Long tail

Models added after February 5, 2026. Right-side label: MIR 2026 position.

This chart is the best summary: a fortnight with 22 additions can look like a release note, but in a benchmark "with a ceiling" (200 questions) it is something else: it is a push that changes what the ranking means.

What matters is not only that there are "more models," but that several land directly in the top tier. Concretely:

This post is the story of that fortnight: what we saw, what we learned, and, above all, why the ranking changes nature once it runs out of headroom.


2. The Perfect Tie and the New Time-Based Tiebreaker

Today, the top of MIR 2026 looks like this:

  1. ALMA200/200
  2. MIRI200/200
  3. Gemini 3.1 Pro Preview200/200

The difference is the tiebreaker. When multiple models hit 200/200, we order them by time-to-ceiling (sync timestamp): first the one that got there earlier, then those that reached it later.

That prevents an obvious bias: a model released weeks later has a technological advantage over one evaluated earlier. If you do not penalize that delay, the ranking rewards "showing up late."

In this update, that time-based ordering places Gemini 3.1 Pro Preview behind ALMA and MIRI, even though it also reaches 100%.

We will not go deep into ALMA/MIRI here because they have their own analysis in "ALMA and MIRI: Agentic RAG", but it was important to keep them in view as the real reference for today's ceiling.


3. Gemini: The 3.1 Pro Jump and the Flash vs Pro Paradox

MIR 2024
MIR 2025
MIR 2026

3-year stack to compare cumulative global ranking across Gemini 3 Flash, Gemini 3 Pro, and Gemini 3.1 Pro.

If we look first at the global cumulative ranking (sum of MIR 2024, 2025, and 2026), the comparison between the three Gemini models looks like this:

The read is more interesting than it looks. In "global cumulative" you are not rewarding a snapshot, but a trajectory: consistency across three exams. And there, for now, Flash still leads.

Now: in MIR 2026, the central fact of this fortnight is that Gemini 3.1 Pro Preview arrives with 200/200. That is: a new model lands that, by definition, cannot "improve" further on this exam.

Operational paradox: in MIR 2026, Flash keeps a better accuracy/cost ratio than Pro, and MedGemma remains far behind despite being health-specific.

There are two stories at once:

  1. The ceiling story: 3.1 Pro hits 200/200. Once you reach the maximum, the ranking no longer has resolution to distinguish "small improvements." That is why the time tiebreaker becomes necessary.
  2. The efficiency story: Flash wins the Flash vs Pro duel again in this benchmark, at a fraction of the cost. And that is not an accident: Gemini 3 Flash was explicitly launched as a model meant to push the "efficient frontier" (quality per latency/cost), not as a "smaller" version that resigns itself to losing.[1]

And there is an additional layer: Google positions 3.1 Pro as a reasoning jump for longer, harder tasks (including coding/agentic work). Part of that bet has already landed, even as a preview, inside developer tooling like GitHub Copilot.[2]

Also, the time gap is short: in Google's public records, Gemini 3 Pro Preview appears in November 2025, and Gemini 3.1 Pro is announced on February 19, 2026.[9]


4. Qwen3.5 397B A17B: A Hierarchy Shift in Open-Data

Qwen
Meta
DeepSeek
Z.ai

Top open-data models in global ranking. Qwen3.5 397B A17B leads this block at position #15.

If we exclude custom models (ALMA/MIRI) and look at the open-data/open-weights block, the most important move of this fortnight is:

This jump is not cosmetic. It is a signal that the Qwen3.5 family is pushing a new phase for open-weights: not only "very good per euro," but capable of competing in the top tier of cumulative precision. Historically, that was cathedral territory.

Proprietary
Open Weights
Open Source (OSI)
Update window (Feb 5-20, 2026)

All 303 models evaluated on the MIR 2026 by launch date. Each dot is a model; red = proprietary, blue = open weights, green = open source (OSI). More recent models tend to achieve higher net scores, but proprietary models maintain the upper edge.

The vertical band (from Feb 5 to Feb 20, 2026) is the fortnight of this post. It shows what matters: it is not "one model that rises"; it is a whole band of additions that drops, all at once, into a zone where there used to be few points.

And it is not a single isolated model. Qwen3.5 Plus also lands strong (#52 globally), while older Qwen models keep populating the open top tier.[3]

Two context notes, without requiring deep background:

  • Qwen3.5 is presented as an agentic-first family and, in its largest model, publishes details like long context (262k tokens) and a default "thinking mode," a pattern we are now seeing repeat across multiple frontier families.[3]
  • Strategically, the release fits the broader move by Chinese labs toward open-weights as "platform": opening weights to accelerate ecosystem, while keeping training as the durable competitive advantage.[4]

5. MedGemma: The Case That Forces Honesty

There is a recurring temptation in medical AI: to assume that "vertical" equals "better." That is why the model with the strongest narrative pull was MedGemma.

Current results:

It is not a bad absolute score: 172/200 is still respectable. But it is clearly low for what the name suggests in a MIR benchmark.

And here is the uncomfortable lesson: declared specialization is not measured specialization. A model can be trained for biomedical domains and still perform worse on a MIR-like exam, because MIR is not "just medicine." It is medicine in Spanish, in MCQ format, with exam-style traps, and with a very specific topic distribution.

External context: MedGemma was presented as a health-oriented model family, built on Gemma and trained/evaluated on specific medical tasks (text and, depending on the variant, multimodal). That strategic move matters: "opening" a medical model that can run locally is an important step for research and for sensitive deployments.[5]

But the benchmark is an unforgiving judge: in this first competitive snapshot of MIR, MedGemma sits far from the SOTA frontier.


6. Claude Opus 4.6: Global Improvement, MIR 2026 Stagnation

Global score comparison in the Opus family: 4.6 slightly improves over 4.5 and widens the gap vs 4.1.

If you have been following the public conversation these weeks, it is easy to think that "coding models" are the new universal SOTA. The problem is that MIR does not reward the same thing as SWE-bench.

The addition of Claude Opus 4.6 leaves a nuanced conclusion:

  • In the global ranking, the score rises slightly: Opus 4.1 (556.333 net) → Opus 4.5 (568 net) → Opus 4.6 (570.667 net).
  • In global position, Opus 4.6 climbs to #27, versus #33 (4.5) and #57 (4.1).
  • In MIR 2026, Opus 4.6 is #20 (197/200), tied in correct answers with Opus 4.5.
  • In MIR 2026 cost, Opus 4.6 is slightly above 4.5 (4.888935 € vs 4.620485 €).

This fits what we see in the market: Opus 4.6 is positioned for complex coding and agentivity tasks, not for medical MCQ exams.[6] If you want the full argument, we develop it in "The Swiss Army Knife and the Scalpel".

And here comes the critical point: GPT-5.3-Codex is still missing from the benchmark because it is not available via a public API under comparable conditions. OpenAI presents it as the tip of the spear in coding, but its own launch communication puts access in products and leaves API access as "pending."[7] In the public API changelog, the available model is gpt-5.2-codex, not 5.3.[8]

The criticism is simple: without comparable API access, there is no fair comparison. And without fair comparison, there is no evidence, only marketing.


7. What We Learned in Just Two Weeks

If I had to summarize this fortnight for different profiles (clinical, technical, product), I would keep six learnings:

  1. The benchmark is no longer in "small increments" mode; it is in "frontier jumps" mode, week by week.
  2. Once you hit 100%, the ranking needs new rules: the time-based tiebreaker stops being optional.
  3. The efficiency vs size paradox (Flash vs Pro) does not disappear; it coexists with the 3.1 Pro jump.
  4. Qwen3.5 enters a zone that few open-weights had entered before: a real top-15 global position.
  5. A health model is not "better" by label: specialization must be measured in the exact environment.
  6. The bottleneck for evaluating the "code wars" remains the same: homogeneous API access.

The underlying conclusion does not change, but now it is more forceful: the benchmark's 2026 evolution is happening in weeks, not quarters. And that forces every "update" to be treated like a mini era shift.

If the curve keeps this slope, the next cut may move the podium again.


Notas y Referencias

  1. Official and external context on Gemini Flash as an efficiency strategy (not just a 'small model'): Google Developers Blog (Gemini 3 Flash, Dec 17, 2025) developers.googleblog.com and technical launch coverage techcrunch.com.
  2. Gemini 3.1 Pro Preview (Feb 19, 2026) and its arrival in developer tooling: 9to5Google 9to5google.com and GitHub Copilot changelog github.blog.
  3. Qwen3.5 397B A17B official model card (architecture, capabilities, positioning). huggingface.co/Qwen/Qwen3.5-397B-A17B.
  4. Context on Qwen3.5 launch and its agentic/open-weights focus: Economic Times (Feb 16, 2026) economictimes.indiatimes.com.
  5. MedGemma: official model card (Google Developers) developers.google.com and Hugging Face card (example) huggingface.co/google/medgemma-27b-text-it.
  6. Anthropic: Claude Opus 4.6 announcement and documentation anthropic.com and product page anthropic.com/claude/opus.
  7. OpenAI: GPT-5.3-Codex launch and note about availability/API (Feb 5, 2026) openai.com.
  8. OpenAI API changelog (Jan 14, 2026): availability of gpt-5.2-codex in API and no reference to 5.3 in the public changelog. platform.openai.com.
  9. Google Gemini API changelog (public reference for catalog/dates): ai.google.dev.