
199 out of 200: AI Only Fails Once in MIR 2026
Final results of the largest medical AI benchmark in Spanish. Three models tie with 199 correct answers out of 200 valid questions. A 'Flash' model leads for the third consecutive year. Exhaustive analysis of 290 models evaluated with data on cost, speed, tokens, and accuracy.
On January 24, 2026, more than 12,000 candidates faced the most controversial MIR exam of the last decade. But while the medical community debated annulments, scoring scales, and administrative chaos, at Medical Benchmark we were executing something unprecedented: 290 artificial intelligence models answering the exam's 210 questions in real time, before anyone knew the correct answers.
The final results are, simply put, devastating.
Three AI models correctly answered 199 of the 200 valid questions on the MIR 2026. A single mistake. 99.5% accuracy. No human being in MIR history has ever achieved a comparable score.[1]
1. The Impossible Podium: Three-Way Tie at 199/200
For the first time in the three-year history of MedBench, three AI models have achieved exactly the same net score: 198.67 net (199 correct, 1 wrong, 0 blank).
Gemini 3 Flash
Googleo3
OpenAIGPT-5
OpenAIThe three co-winners represent two tech giants with radically different philosophies:
-
Google Gemini 3 Flash Preview
: A model designed to be fast and economical. Total cost of the complete exam: 0.33 € (thirty-three euro cents). Average time per question: 4.2 seconds. No explicit reasoning tokens. Although the model allows a configurable reasoning token budget, in this benchmark we deliberately ran it with 0 reasoning tokens. -
OpenAI o3
: OpenAI's advanced reasoning model. Cost: 1.86 €. Generates 71,000 internal reasoning tokens before answering. Time: 7.3 seconds per question. -
OpenAI GPT-5
: OpenAI's flagship. Cost: 1.97 €. The most reasoning-intensive with 135,000 dedicated tokens. But also the slowest of the three: 18 seconds per question.
How to break the tie?
At MedBench, when there's a tie in net score, the tiebreaker criterion is the total exam cost (lower cost wins). This criterion reflects a crucial practical reality: if two models have identical accuracy, the one that achieves it more efficiently is objectively superior from a clinical deployment perspective.
With this criterion, Gemini 3 Flash Preview is the official winner of MIR 2026, with a cost 5.7 times lower than o3 and 6 times lower than GPT-5.
2. The Complete Ranking: The Top 15
Top 15 AI models in MIR 2026 by net score (final results)
The concentration of scores at the high end is extraordinary. The top 10 models move in a range of just 1.33 net points (from 198.67 to 197.33). This reflects both the quality of current models and the relative "ease" of the MIR 2026 for AI systems, a phenomenon we analyzed in depth in our previous article about the perfect storm of MIR 2026.
Key ranking data:
- 3 models with 199/200 (99.5% accuracy)
- 9 models with 198/200 (99.0%)
- 8 models with 197/200 (98.5%)
- All Top 20 exceed 98% accuracy (196/200 or more)
- 58 models exceed 95% accuracy
- 119 models exceed 90%
To put this in context: the best known human result on the MIR 2025 was 174 correct and 25 errors (87% accuracy, 165.67 net).[2] This year's three winners have 99.5%.
3. David vs. Goliath: The Flash Paradox
This is perhaps the most counterintuitive and fascinating conclusion of the entire benchmark: a "Flash" model — designed for speed and low cost, not for maximum intelligence — has been the best or tied for first place in Spain's most demanding medical exam for three consecutive years.
*Sonar Deep Research has web access, enabling it to look up published exam answers online
Gemini Flash's track record:
| Exam | Flash Position | Net | Cost | Official Winner | Note |
|---|---|---|---|---|---|
| MIR 2024 | #2 (tie in net with #3-#5) | 193.33 | 0.32 € | Sonar Deep Research (193.67) | Sonar has web access |
| MIR 2025 | #1 | 190.67 | 0.34 € | Gemini 3 Flash | Undisputed winner |
| MIR 2026 | #1 (tie with o3 and GPT-5) | 198.67 | 0.33 € | Gemini 3 Flash (by cost) | Three-way tie |
The MIR 2024 case deserves special mention. The nominal winner was Perplexity Sonar Deep Research with 193.67 net versus Flash's 193.33. However, Sonar Deep Research is a model with real-time web search access. Since MIR answers are published on multiple academy and medical forum websites a few days after the exam[3], it cannot be ruled out that Sonar directly consulted these sources. If we exclude models with web access, Gemini Flash has effectively been the best model for three consecutive years.
Why does a "light" model outperform the most expensive ones?
This result defies the intuition that "bigger = better". There are several complementary hypotheses:
-
Architectural efficiency over raw size. Google has invested heavily in distillation optimizations and token efficiency.[4] Gemini 3 Flash generates more concise and direct responses: in independent tests, it completed tasks with 26% fewer tokens than equivalent Pro models.[5]
-
MIR as a test of factual knowledge, not deep reasoning. Most MIR 2026 questions required direct recognition of clinical patterns, not complex chains of reasoning. A model that "knows" the answer directly doesn't need to "think" 135,000 tokens to reach it.
-
Fewer reasoning tokens = fewer opportunities for error. Models with extensive reasoning chains (chain-of-thought) can "convince themselves" of incorrect answers through elaborate but erroneous internal reasoning. Flash, with 0 reasoning tokens, simply responds to what it "knows".
-
The "smarter, not bigger" paradigm. As Barclays notes in its AI outlook report for 2026[6], the industry is shifting from pure parameter scaling toward intelligent optimization. Gemini 3 Flash is the perfect example of this trend.
The underlying reflection: If a model that costs 0.33 € per exam can correctly answer 199 of 200 questions, what real added value do models that cost 100 or 660 times more provide when they get the same or even fewer correct?
4. Anatomy of the Single Mistake
Each of the three winners failed exactly one different question. No mistake is repeated among them, suggesting these are stochastic errors, not systematic knowledge gaps:
| Model | Failed Question | Answered | Correct | Specialty |
|---|---|---|---|---|
| Gemini 3 Flash | Question 118 | C | B | Dermatology |
| o3 | Question 157 | C | D | Pharmacology |
| GPT-5 | Question 77 | C | A | Internal Medicine |
Curiously, all three models answered "C" on their single failed question. Beyond the anecdote, what's relevant is that if we combined the answers of the three models using a majority voting system, the result would be a perfect 200/200: each question that one fails, the other two get right.
This opens a fascinating reflection on ensemble systems in medical AI: a committee of three complementary models could achieve perfect accuracy on this exam.
5. The Plot Twist: The Provisional Answer Key and ChatGPT's Shadow
Before the final results were published (with 7 annulled questions), the Ministry's provisional answer key only contemplated 4 annulments (questions 13, 50, 64, and 161). With that answer key, the ranking was significantly different.
The three additional questions that were annulled in the final answer key were 139 (lupus and anemia), 142 (thyroiditis), and 208 (cirrhosis). The impact of these annulments was asymmetric:
| Parámetro | Tendencia MIR 2026 | Implicación |
|---|---|---|
| Models with 0/3 correct on annulled questions | +1.00 net | Maximum benefit. Penalties for failing those questions disappear. Example: Gemini 3 Flash. |
| Models with 1/3 correct | -0.33 net | Slight negative impact. They lose 1 correct answer but eliminate 2 penalties. Example: o3. |
| Models with 2/3 correct | -1.67 net | Moderate impact. They lose 2 correct answers and only eliminate 1 penalty. Example: GPT-5. |
| Models with 3/3 correct | -3.00 net | Maximum harm. They lose 3 correct answers with no compensation. Example: o1. |
Impact of the 3 additional annulments (Q139, Q142, Q208) on net score by prior correct answers
Who was leading with the provisional answer key?
With only 4 annulled questions, GPT-5 and o1 co-led with an approximate net score of 193.33 (correctly answering the 3 questions that would later be annulled). Gemini 3 Flash, which failed all three, occupied a more distant position.
The annulment of these three questions caused the largest ranking movement in the benchmark: Gemini Flash rose 9 positions (from #11 to #2), while o1 fell 7 positions (from co-leader to #8).
The uncomfortable hypothesis
There's a detail we cannot ignore. Among the candidate community and in specialized forums, rumors have circulated — which we must expressly qualify as unconfirmed and presumed — about the possibility that some MIR 2026 questions could have been elaborated, totally or partially, with the assistance of generative AI tools like ChatGPT.[7]
If these rumors were true (and we reiterate that we have no evidence confirming this), it would explain an observable pattern in our data: models from the GPT/OpenAI family obtained especially high performance on the provisional answer key, precisely on questions that were later annulled for containing ambiguities or errors. An AI model would tend to "get right" questions generated by a similar AI, as they would share writing biases and formulation patterns.
Editorial note: This hypothesis is speculative and is not intended to be a categorical statement or a description of reality. The annulment of questions is a routine process in the MIR that can be due to multiple legitimate factors, including clinical ambiguity, updating of medical guidelines, and drafting errors.
6. No Possible Contamination: Blinded Methodology
A crucial aspect of our benchmark that confers maximum credibility is the timing of evaluations:
- MIR exam date: January 24, 2026
- Execution date for all models: January 25, 2026
- Publication of provisional answer key: January 26, 2026
All evaluations were executed BEFORE the correct answers were published. No model could have been trained, fine-tuned, or contaminated with the MIR 2026 answers, because they simply didn't exist when the evaluations were run.
This makes MedBench one of the few medical AI benchmarks in the world where data contamination is physically impossible.[8] The models responded with their pre-existing medical knowledge, exactly like a human candidate.
Additionally, all models received the same system prompt, without clues about the exam year or additional information that could bias the answers.
7. Deep Metrics Analysis
Beyond net score, MedBench records detailed metrics for each model on each question: cost, tokens, response time, and confidence. These data reveal fascinating patterns.
7.1. Cost: From 0.33 € to 217 €
Total cost per full exam (210 questions). Gemini 3 Flash leads at 0.33 € vs 217 € for o1-pro, with equal or higher accuracy
The cost dispersion is brutal:
-
Gemini 3 Flash
: 0.33 € per complete exam (210 questions). That is, 0.0016 € per question. -
o1-pro
: 217 € per exam. 1.08 € per question. And it gets a worse result (98.5% vs 99.5%). -
o3 Deep Research
: 167.82 €. Needs 3.6 minutes per question and consumes 6.6 million tokens.
Gemini Flash's cost-benefit ratio is, objectively, unbeatable. Obtaining the maximum score for 0.33 € makes any higher spending on models with equal or inferior performance inefficient.
7.2. Response Speed
Average time per question for Top 15 models. o3 Deep Research needs 218 seconds per question (3.6 minutes), while GPT-5.1 Chat answers in 3.2 seconds
Speed matters in real clinical contexts. A diagnostic support system that takes 3 minutes to respond has very different utility from one that responds in 3 seconds.
The fastest models in the Top 15:
-
GPT-5.1 Chat
: 3.2 seconds/question -
GPT-5 Codex
: 3.9 seconds/question -
Gemini 3 Flash
: 4.2 seconds/question
The slowest:
-
o3 Deep Research
: 218 seconds/question (3 min 38 sec) -
GPT-5.2 Pro
: 31.8 seconds/question -
Gemini 2.5 Pro Preview 05-06
: 24.2 seconds/question
7.3. Tokens: Does Thinking More Help?
Token breakdown by type. o3 Deep Research consumes 6.6M tokens per exam (off scale). Gemini 3 Flash: 210K total tokens with no explicit reasoning
One of the most interesting questions revealed by our data: do reasoning tokens improve the result?
For Gemini 3 Flash, the 0 value reflects a methodological choice on our side: even though the model supports a reasoning budget, we intentionally evaluated it with no reasoning tokens.
| Model | Reasoning Tokens | Accuracy | Net |
|---|---|---|---|
| Gemini 3 Flash | 0 | 99.5% | 198.67 |
| o3 | 71K | 99.5% | 198.67 |
| GPT-5 | 135K | 99.5% | 198.67 |
| GPT-5.1 Chat | 6K | 99.0% | 197.33 |
| o1 | 146K | 99.0% | 197.33 |
| o3 Deep Research | 1,741K | 99.0% | 197.33 |
The answer is clear: no, at least not on this exam. The model with 0 reasoning tokens gets the same result as the model with 135,000, and a better result than the model with 1.7 million. This suggests that the MIR 2026 is primarily an exam of pattern recognition and factual knowledge, where "deep thinking" doesn't add marginal value.
7.4. Confidence: All Confident, All Correct
The average confidence reported by the Top 10 models is consistently close to 100%. This indicates that modern models not only get it right, but know they're getting it right. Confidence calibration is a crucial indicator for clinical applications: a model that says "I'm 100% sure" and gets it right 99.5% of the time is extraordinarily reliable.
8. AI vs. Humans: The Gap Widens
Comparison between the best AI score and the best known human result per exam session. MIR 2026: human result pending official publication
The historical evolution is unequivocal:
- MIR 2024: The best AI surpassed the best human by 7 net points (193.67 vs 186.67). AI led by 3.7%.
- MIR 2025: The gap jumped to 25 net points (190.67 vs 165.67). AI led by 15.1%.
- MIR 2026: With 198.67 net and the human result still pending official publication[9], we project an even larger gap.
Even in the hypothetical case that the best human on the MIR 2026 matched the historical human record of 190 correct answers (MIR 2024), their net score would depend on the number of errors. Assuming optimal performance of 190 correct and 10 errors (186.67 net), the gap with AI would be 12 net points.
The question is no longer whether AI is better than humans on the MIR. The question is how much better.
9. Historical Evolution: Three Years of Benchmarking
AI accuracy evolution vs. best human in MIR (2024-2026). MIR 2026: human result pending official publication
The MIR 2025, considered the most difficult of the three years analyzed (long statements, "testament" questions, high cognitive load), caused a temporary drop in the accuracy of all models. However, the general trend is clear:
| Metric | MIR 2024 | MIR 2025 | MIR 2026 |
|---|---|---|---|
| Best accuracy | 97.5% | 96.5% | 99.5% |
| Top 5 average | 97.5% | 96.0% | 99.3% |
| Top 10 average | 97.5% | 95.8% | 99.2% |
| Models >95% | 18 | 11 | 58 |
| Models >90% | 68 | 52 | 119 |
| Models evaluated | 291 | 290 | 290 |
The MIR 2026 leap is explained by the convergence of two factors: the continuous improvement of models (especially the GPT-5.x and Gemini 3 generation) and the lower relative difficulty of the exam.
10. The Power Map: Who Dominates the Benchmark?
Provider distribution in the MIR 2026 benchmark Top 20
OpenAI numerically dominates the Top 20 with 11 models, reflecting its strategy of proliferating variants (GPT-5, GPT-5.1, GPT-5.2, Chat, Codex, Pro, Image versions, etc.).
Google places 6 models with an opposite strategy: fewer variants but more differentiated (Flash vs Pro, different versions of Gemini 2.5 and 3).
Anthropic places 3 models in the Top 20 (Claude Opus 4.5 at #14, Claude Opus 4.6 at #15 and Claude Opus 4.1 at #18), confirming its position as the third relevant player.
However, quality over quantity favors Google: with 6 models in the Top 20, it places #1 (Gemini Flash) and four models in the top 15. OpenAI needs 11 models to dominate numerically.
11. Final Reflections: What Does All This Mean?
For the medical community
The MIR 2026 marks a turning point. An AI system that gets 99.5% of an exam designed to select the country's best doctors right is not a technological curiosity: it's a paradigm shift.
This doesn't mean AI will replace doctors. The MIR evaluates theoretical knowledge in test format, not clinical skills like empathy, patient communication, physical examination, or decision-making under extreme uncertainty. But it does demonstrate that AI can be an extraordinary ally as a diagnostic support system and as a training tool.
For the AI community
The victory of a Flash model over frontier models that cost up to 660 times more forces a rethinking of fundamental assumptions:
- Brute parameter scaling has diminishing returns in well-defined factual knowledge domains.
- Architectural efficiency matters more than size in many real contexts.
- Current medical benchmarks may be reaching their ceiling as a measure of AI capability. When 3 models approach 100%, the exam stops discriminating.
For the future of MedBench
Given results so close to perfection, our benchmark must evolve. We're working on:
- Multimodal evaluations with clinical images and imaging tests
- Reasoning quality metrics, not just the final answer
- Complex clinical case benchmarks that require information integration across multiple steps
- Evaluation of hallucinations and calibrated confidence in contexts of uncertainty
At Medical Benchmark we will continue to document and analyze the evolution of artificial intelligence in medicine with rigor, transparency, and independence. All data is available on our rankings platform.
Notas y Referencias
- The best known human result in recent MIR history is 190 correct answers and 10 errors (MIR 2024), equivalent to 186.67 net points. The AI's 199 correct answers surpass this record by 12 net points.
- Best human result data for MIR 2025 obtained from official publications of the Ministry of Health.
- MIR prep academies publish their provisional corrections hours after the exam. Models with web access like Sonar Deep Research could access these answers during the evaluation.
- Google Blog: Gemini 3 Flash: frontier intelligence built for speed (December 2025)
- Engadget: Google's Gemini 3 Flash model outperforms GPT-5.2 in some benchmarks (December 2025)
- Barclays Private Bank: AI in 2026: Smarter, not bigger
- Rumors circulated on social media and MIR candidate forums. There is no confirmed public evidence that the Ministry of Health used generative AI tools in the preparation of MIR 2026 exam questions.
- Luengo Vera, Ferro Picon, et al.: Evaluating LLMs on the Spanish MIR Exam: A Comparative Analysis 2024/2025 (arXiv, 2025)
- According to the official call, the Ministry of Health has until February 24, 2026 to publish the definitive results with ranking numbers. Given the context of administrative incidents in this edition, it is possible that the deadline will be exhausted.