MedicalBenchmark
199 out of 200: AI Only Fails Once in MIR 2026

199 out of 200: AI Only Fails Once in MIR 2026

Final results of the largest medical AI benchmark in Spanish. Three models tie with 199 correct answers out of 200 valid questions. A 'Flash' model leads for the third consecutive year. Exhaustive analysis of 290 models evaluated with data on cost, speed, tokens, and accuracy.

Equipo MedBenchFebruary 5, 202618 min read
MIR 2026BenchmarkGemini FlashGPT-5Definitive Results

On January 24, 2026, more than 12,000 candidates faced the most controversial MIR exam of the last decade. But while the medical community debated annulments, scoring scales, and administrative chaos, at Medical Benchmark we were executing something unprecedented: 290 artificial intelligence models answering the exam's 210 questions in real time, before anyone knew the correct answers.

The final results are, simply put, devastating.

Three AI models correctly answered 199 of the 200 valid questions on the MIR 2026. A single mistake. 99.5% accuracy. No human being in MIR history has ever achieved a comparable score.[1]


1. The Impossible Podium: Three-Way Tie at 199/200

For the first time in the three-year history of MedBench, three AI models have achieved exactly the same net score: 198.67 net (199 correct, 1 wrong, 0 blank).

Gemini 3 Flash

Google
Cheapest
199/200
Net score198.67
Total cost0.33 €
Time/question4.2s
Total tokens210K
Reasoning tokens0
Avg. confidence100%
Only missP118
SpecialtyDermatology

o3

OpenAI
Balanced
199/200
Net score198.67
Total cost1.86 €
Time/question7.3s
Total tokens311K
Reasoning tokens71K
Avg. confidence100%
Only missP157
SpecialtyPharmacology

GPT-5

OpenAI
Most reasoning
199/200
Net score198.67
Total cost1.97 €
Time/question18s
Total tokens420K
Reasoning tokens135K
Avg. confidence100%
Only missP77
SpecialtyInternal Medicine

The three co-winners represent two tech giants with radically different philosophies:

  • Google Gemini 3 Flash Preview

    : A model designed to be fast and economical. Total cost of the complete exam: 0.33 € (thirty-three euro cents). Average time per question: 4.2 seconds. No explicit reasoning tokens. Although the model allows a configurable reasoning token budget, in this benchmark we deliberately ran it with 0 reasoning tokens.
  • OpenAI o3

    : OpenAI's advanced reasoning model. Cost: 1.86 €. Generates 71,000 internal reasoning tokens before answering. Time: 7.3 seconds per question.
  • OpenAI GPT-5

    : OpenAI's flagship. Cost: 1.97 €. The most reasoning-intensive with 135,000 dedicated tokens. But also the slowest of the three: 18 seconds per question.

How to break the tie?

At MedBench, when there's a tie in net score, the tiebreaker criterion is the total exam cost (lower cost wins). This criterion reflects a crucial practical reality: if two models have identical accuracy, the one that achieves it more efficiently is objectively superior from a clinical deployment perspective.

With this criterion, Gemini 3 Flash Preview is the official winner of MIR 2026, with a cost 5.7 times lower than o3 and 6 times lower than GPT-5.


2. The Complete Ranking: The Top 15

Google
OpenAI
Anthropic

Top 15 AI models in MIR 2026 by net score (final results)

The concentration of scores at the high end is extraordinary. The top 10 models move in a range of just 1.33 net points (from 198.67 to 197.33). This reflects both the quality of current models and the relative "ease" of the MIR 2026 for AI systems, a phenomenon we analyzed in depth in our previous article about the perfect storm of MIR 2026.

Key ranking data:

  • 3 models with 199/200 (99.5% accuracy)
  • 9 models with 198/200 (99.0%)
  • 8 models with 197/200 (98.5%)
  • All Top 20 exceed 98% accuracy (196/200 or more)
  • 58 models exceed 95% accuracy
  • 119 models exceed 90%

To put this in context: the best known human result on the MIR 2025 was 174 correct and 25 errors (87% accuracy, 165.67 net).[2] This year's three winners have 99.5%.


3. David vs. Goliath: The Flash Paradox

This is perhaps the most counterintuitive and fascinating conclusion of the entire benchmark: a "Flash" model — designed for speed and low cost, not for maximum intelligence — has been the best or tied for first place in Spain's most demanding medical exam for three consecutive years.

*Sonar Deep Research has web access, enabling it to look up published exam answers online

Gemini Flash's track record:

ExamFlash PositionNetCostOfficial WinnerNote
MIR 2024#2 (tie in net with #3-#5)193.330.32 €Sonar Deep Research (193.67)Sonar has web access
MIR 2025#1190.670.34 €Gemini 3 FlashUndisputed winner
MIR 2026#1 (tie with o3 and GPT-5)198.670.33 €Gemini 3 Flash (by cost)Three-way tie

The MIR 2024 case deserves special mention. The nominal winner was Perplexity Sonar Deep Research with 193.67 net versus Flash's 193.33. However, Sonar Deep Research is a model with real-time web search access. Since MIR answers are published on multiple academy and medical forum websites a few days after the exam[3], it cannot be ruled out that Sonar directly consulted these sources. If we exclude models with web access, Gemini Flash has effectively been the best model for three consecutive years.

Why does a "light" model outperform the most expensive ones?

This result defies the intuition that "bigger = better". There are several complementary hypotheses:

  1. Architectural efficiency over raw size. Google has invested heavily in distillation optimizations and token efficiency.[4] Gemini 3 Flash generates more concise and direct responses: in independent tests, it completed tasks with 26% fewer tokens than equivalent Pro models.[5]

  2. MIR as a test of factual knowledge, not deep reasoning. Most MIR 2026 questions required direct recognition of clinical patterns, not complex chains of reasoning. A model that "knows" the answer directly doesn't need to "think" 135,000 tokens to reach it.

  3. Fewer reasoning tokens = fewer opportunities for error. Models with extensive reasoning chains (chain-of-thought) can "convince themselves" of incorrect answers through elaborate but erroneous internal reasoning. Flash, with 0 reasoning tokens, simply responds to what it "knows".

  4. The "smarter, not bigger" paradigm. As Barclays notes in its AI outlook report for 2026[6], the industry is shifting from pure parameter scaling toward intelligent optimization. Gemini 3 Flash is the perfect example of this trend.

The underlying reflection: If a model that costs 0.33 € per exam can correctly answer 199 of 200 questions, what real added value do models that cost 100 or 660 times more provide when they get the same or even fewer correct?


4. Anatomy of the Single Mistake

Each of the three winners failed exactly one different question. No mistake is repeated among them, suggesting these are stochastic errors, not systematic knowledge gaps:

ModelFailed QuestionAnsweredCorrectSpecialty
Gemini 3 FlashQuestion 118CBDermatology
o3Question 157CDPharmacology
GPT-5Question 77CAInternal Medicine

Curiously, all three models answered "C" on their single failed question. Beyond the anecdote, what's relevant is that if we combined the answers of the three models using a majority voting system, the result would be a perfect 200/200: each question that one fails, the other two get right.

This opens a fascinating reflection on ensemble systems in medical AI: a committee of three complementary models could achieve perfect accuracy on this exam.


5. The Plot Twist: The Provisional Answer Key and ChatGPT's Shadow

Before the final results were published (with 7 annulled questions), the Ministry's provisional answer key only contemplated 4 annulments (questions 13, 50, 64, and 161). With that answer key, the ranking was significantly different.

The three additional questions that were annulled in the final answer key were 139 (lupus and anemia), 142 (thyroiditis), and 208 (cirrhosis). The impact of these annulments was asymmetric:

ParámetroTendencia MIR 2026Implicación
Models with 0/3 correct on annulled questions+1.00 netMaximum benefit. Penalties for failing those questions disappear. Example: Gemini 3 Flash.
Models with 1/3 correct-0.33 netSlight negative impact. They lose 1 correct answer but eliminate 2 penalties. Example: o3.
Models with 2/3 correct-1.67 netModerate impact. They lose 2 correct answers and only eliminate 1 penalty. Example: GPT-5.
Models with 3/3 correct-3.00 netMaximum harm. They lose 3 correct answers with no compensation. Example: o1.

Impact of the 3 additional annulments (Q139, Q142, Q208) on net score by prior correct answers

Who was leading with the provisional answer key?

With only 4 annulled questions, GPT-5 and o1 co-led with an approximate net score of 193.33 (correctly answering the 3 questions that would later be annulled). Gemini 3 Flash, which failed all three, occupied a more distant position.

The annulment of these three questions caused the largest ranking movement in the benchmark: Gemini Flash rose 9 positions (from #11 to #2), while o1 fell 7 positions (from co-leader to #8).

The uncomfortable hypothesis

There's a detail we cannot ignore. Among the candidate community and in specialized forums, rumors have circulated — which we must expressly qualify as unconfirmed and presumed — about the possibility that some MIR 2026 questions could have been elaborated, totally or partially, with the assistance of generative AI tools like ChatGPT.[7]

If these rumors were true (and we reiterate that we have no evidence confirming this), it would explain an observable pattern in our data: models from the GPT/OpenAI family obtained especially high performance on the provisional answer key, precisely on questions that were later annulled for containing ambiguities or errors. An AI model would tend to "get right" questions generated by a similar AI, as they would share writing biases and formulation patterns.

Editorial note: This hypothesis is speculative and is not intended to be a categorical statement or a description of reality. The annulment of questions is a routine process in the MIR that can be due to multiple legitimate factors, including clinical ambiguity, updating of medical guidelines, and drafting errors.


6. No Possible Contamination: Blinded Methodology

A crucial aspect of our benchmark that confers maximum credibility is the timing of evaluations:

  • MIR exam date: January 24, 2026
  • Execution date for all models: January 25, 2026
  • Publication of provisional answer key: January 26, 2026

All evaluations were executed BEFORE the correct answers were published. No model could have been trained, fine-tuned, or contaminated with the MIR 2026 answers, because they simply didn't exist when the evaluations were run.

This makes MedBench one of the few medical AI benchmarks in the world where data contamination is physically impossible.[8] The models responded with their pre-existing medical knowledge, exactly like a human candidate.

Additionally, all models received the same system prompt, without clues about the exam year or additional information that could bias the answers.


7. Deep Metrics Analysis

Beyond net score, MedBench records detailed metrics for each model on each question: cost, tokens, response time, and confidence. These data reveal fascinating patterns.

7.1. Cost: From 0.33 € to 217 €

Total cost per full exam (210 questions). Gemini 3 Flash leads at 0.33 € vs 217 € for o1-pro, with equal or higher accuracy

The cost dispersion is brutal:

  • Gemini 3 Flash

    : 0.33 € per complete exam (210 questions). That is, 0.0016 € per question.
  • o1-pro

    : 217 € per exam. 1.08 € per question. And it gets a worse result (98.5% vs 99.5%).
  • o3 Deep Research

    : 167.82 €. Needs 3.6 minutes per question and consumes 6.6 million tokens.

Gemini Flash's cost-benefit ratio is, objectively, unbeatable. Obtaining the maximum score for 0.33 € makes any higher spending on models with equal or inferior performance inefficient.

7.2. Response Speed

Google
OpenAI
Anthropic

Average time per question for Top 15 models. o3 Deep Research needs 218 seconds per question (3.6 minutes), while GPT-5.1 Chat answers in 3.2 seconds

Speed matters in real clinical contexts. A diagnostic support system that takes 3 minutes to respond has very different utility from one that responds in 3 seconds.

The fastest models in the Top 15:

  1. GPT-5.1 Chat

    : 3.2 seconds/question
  2. GPT-5 Codex

    : 3.9 seconds/question
  3. Gemini 3 Flash

    : 4.2 seconds/question

The slowest:

7.3. Tokens: Does Thinking More Help?

Token breakdown by type. o3 Deep Research consumes 6.6M tokens per exam (off scale). Gemini 3 Flash: 210K total tokens with no explicit reasoning

One of the most interesting questions revealed by our data: do reasoning tokens improve the result?

For Gemini 3 Flash, the 0 value reflects a methodological choice on our side: even though the model supports a reasoning budget, we intentionally evaluated it with no reasoning tokens.

ModelReasoning TokensAccuracyNet
Gemini 3 Flash099.5%198.67
o371K99.5%198.67
GPT-5135K99.5%198.67
GPT-5.1 Chat6K99.0%197.33
o1146K99.0%197.33
o3 Deep Research1,741K99.0%197.33

The answer is clear: no, at least not on this exam. The model with 0 reasoning tokens gets the same result as the model with 135,000, and a better result than the model with 1.7 million. This suggests that the MIR 2026 is primarily an exam of pattern recognition and factual knowledge, where "deep thinking" doesn't add marginal value.

7.4. Confidence: All Confident, All Correct

The average confidence reported by the Top 10 models is consistently close to 100%. This indicates that modern models not only get it right, but know they're getting it right. Confidence calibration is a crucial indicator for clinical applications: a model that says "I'm 100% sure" and gets it right 99.5% of the time is extraordinarily reliable.


8. AI vs. Humans: The Gap Widens

Comparison between the best AI score and the best known human result per exam session. MIR 2026: human result pending official publication

The historical evolution is unequivocal:

  • MIR 2024: The best AI surpassed the best human by 7 net points (193.67 vs 186.67). AI led by 3.7%.
  • MIR 2025: The gap jumped to 25 net points (190.67 vs 165.67). AI led by 15.1%.
  • MIR 2026: With 198.67 net and the human result still pending official publication[9], we project an even larger gap.

Even in the hypothetical case that the best human on the MIR 2026 matched the historical human record of 190 correct answers (MIR 2024), their net score would depend on the number of errors. Assuming optimal performance of 190 correct and 10 errors (186.67 net), the gap with AI would be 12 net points.

The question is no longer whether AI is better than humans on the MIR. The question is how much better.


9. Historical Evolution: Three Years of Benchmarking

Best AI
Best Human
Top 5 Avg
Top 10 Avg

AI accuracy evolution vs. best human in MIR (2024-2026). MIR 2026: human result pending official publication

The MIR 2025, considered the most difficult of the three years analyzed (long statements, "testament" questions, high cognitive load), caused a temporary drop in the accuracy of all models. However, the general trend is clear:

MetricMIR 2024MIR 2025MIR 2026
Best accuracy97.5%96.5%99.5%
Top 5 average97.5%96.0%99.3%
Top 10 average97.5%95.8%99.2%
Models >95%181158
Models >90%6852119
Models evaluated291290290

The MIR 2026 leap is explained by the convergence of two factors: the continuous improvement of models (especially the GPT-5.x and Gemini 3 generation) and the lower relative difficulty of the exam.


10. The Power Map: Who Dominates the Benchmark?

Provider distribution in the MIR 2026 benchmark Top 20

OpenAI numerically dominates the Top 20 with 11 models, reflecting its strategy of proliferating variants (GPT-5, GPT-5.1, GPT-5.2, Chat, Codex, Pro, Image versions, etc.).

Google places 6 models with an opposite strategy: fewer variants but more differentiated (Flash vs Pro, different versions of Gemini 2.5 and 3).

Anthropic places 3 models in the Top 20 (Claude Opus 4.5 at #14, Claude Opus 4.6 at #15 and Claude Opus 4.1 at #18), confirming its position as the third relevant player.

However, quality over quantity favors Google: with 6 models in the Top 20, it places #1 (Gemini Flash) and four models in the top 15. OpenAI needs 11 models to dominate numerically.


11. Final Reflections: What Does All This Mean?

For the medical community

The MIR 2026 marks a turning point. An AI system that gets 99.5% of an exam designed to select the country's best doctors right is not a technological curiosity: it's a paradigm shift.

This doesn't mean AI will replace doctors. The MIR evaluates theoretical knowledge in test format, not clinical skills like empathy, patient communication, physical examination, or decision-making under extreme uncertainty. But it does demonstrate that AI can be an extraordinary ally as a diagnostic support system and as a training tool.

For the AI community

The victory of a Flash model over frontier models that cost up to 660 times more forces a rethinking of fundamental assumptions:

  • Brute parameter scaling has diminishing returns in well-defined factual knowledge domains.
  • Architectural efficiency matters more than size in many real contexts.
  • Current medical benchmarks may be reaching their ceiling as a measure of AI capability. When 3 models approach 100%, the exam stops discriminating.

For the future of MedBench

Given results so close to perfection, our benchmark must evolve. We're working on:

  • Multimodal evaluations with clinical images and imaging tests
  • Reasoning quality metrics, not just the final answer
  • Complex clinical case benchmarks that require information integration across multiple steps
  • Evaluation of hallucinations and calibrated confidence in contexts of uncertainty

At Medical Benchmark we will continue to document and analyze the evolution of artificial intelligence in medicine with rigor, transparency, and independence. All data is available on our rankings platform.

Notas y Referencias

  1. The best known human result in recent MIR history is 190 correct answers and 10 errors (MIR 2024), equivalent to 186.67 net points. The AI's 199 correct answers surpass this record by 12 net points.
  2. Best human result data for MIR 2025 obtained from official publications of the Ministry of Health.
  3. MIR prep academies publish their provisional corrections hours after the exam. Models with web access like Sonar Deep Research could access these answers during the evaluation.
  4. Google Blog: Gemini 3 Flash: frontier intelligence built for speed (December 2025)
  5. Engadget: Google's Gemini 3 Flash model outperforms GPT-5.2 in some benchmarks (December 2025)
  6. Barclays Private Bank: AI in 2026: Smarter, not bigger
  7. Rumors circulated on social media and MIR candidate forums. There is no confirmed public evidence that the Ministry of Health used generative AI tools in the preparation of MIR 2026 exam questions.
  8. Luengo Vera, Ferro Picon, et al.: Evaluating LLMs on the Spanish MIR Exam: A Comparative Analysis 2024/2025 (arXiv, 2025)
  9. According to the official call, the Ministry of Health has until February 24, 2026 to publish the definitive results with ranking numbers. Given the context of administrative incidents in this edition, it is possible that the deadline will be exhausted.