MedicalBenchmark
The Swiss Army Knife and the Scalpel: Why the Best Coding Models Fail the MIR

The Swiss Army Knife and the Scalpel: Why the Best Coding Models Fail the MIR

Claude Opus 4.6 and GPT-5.2-Codex are the most advanced AI coding models. But in MIR 2026, a Flash model costing just 0.34 euros humiliates them. Analysis of the agentic paradox with data from 290 models.

Equipo MedBenchFebruary 6, 202615 min read
MIR 2026Agentic ModelsClaude Opus 4.6GPT-5.2-CodexGemini Flash

On February 5, 2026, artificial intelligence had a day that only comes once a decade. At 10:00 AM Pacific Time, Anthropic published a blog post with a headline that seemed pulled from science fiction: "Claude Opus 4.6: the model that coordinates teams of AI agents to solve problems no single model could tackle".[1] Forty minutes later, OpenAI fired back: "Introducing GPT-5.3-Codex, the first model that partially built itself".[2]

The tech press headlines were predictable: "The AI agent wars," "The model that codes like a team of 10 engineers," "The singularity has a name." On Terminal-Bench 2.0 — the reference benchmark for agentic programming tasks — Claude Opus 4.6 set an absolute record with 65.4%, shattering the previous high of 57.2% held by its predecessor, Opus 4.5.[3] On SWE-Bench Pro, GPT-5.3-Codex also set a new high.[4]

But here at Medical Benchmark, the data tells a very different story.

While the world celebrated the arrival of the most advanced coding models in history, we already had the results of 290 models evaluated on the MIR 2026. And the verdict is uncomfortable: the best agentic coding models are mediocre at medicine. A "Flash" model costing 34 cents crushes them all.

And as for GPT-5.3-Codex, OpenAI's brand-new release: we were unable to evaluate it. It is only available through ChatGPT (app, CLI, and IDE extensions). It has no public API.[5] At MedBench we evaluate models through the OpenRouter API, so GPT-5.3-Codex is, for now, the great absentee from our ranking.


1. The Code Gladiators

Before showing the data, it helps to understand what these models are and why they matter. The three protagonists of this story share a common trait: they are designed to be code agents — AI systems that don't just answer questions, but autonomously execute complex programming tasks, coordinating tools, reading files, running tests, and debugging errors.

Claude Opus 4.6 (Anthropic)

Anthropic's flagship. Launched on February 5, 2026. One-million-token context window. Ability to coordinate teams of specialized agents ("agent teams"). Record on Terminal-Bench 2.0 with 65.4%. Designed for adaptive reasoning — it can decide how much to "think" before responding.[1]

Claude Opus 4.5 (Anthropic)

The previous flagship. For months it was the most advanced coding model on the market. 57.2% on Terminal-Bench. It remains extraordinarily capable, but Opus 4.6 surpasses it on every programming metric.

GPT-5.2-Codex (OpenAI)

Launched in December 2025 as "OpenAI's most advanced agentic coding model." Optimized for long contexts, reliable tool calling, and multi-step tasks. Top 3 on SWE-Bench Verified.[6]

GPT-5.3-Codex (OpenAI) — The Great Absentee

Launched on the same day as Opus 4.6. According to OpenAI, it is the first model whose training used early versions of itself for debugging and evaluation. Records on SWE-Bench Pro and other coding benchmarks. But it is only available via ChatGPT — it has no API endpoint, making its evaluation on MedBench impossible.[5]

What all these models have in common: they are optimized for multi-step tasks, tool use, and agent coordination. They are digital Swiss Army knives: they can cut, screw, open cans, and file. The question is: can they also operate?


2. The MIR Verdict

Agentic / Code
Generalist
Reasoning

Comparison of agentic/code models vs. generalist models on the MIR 2026. Agentic models (orange) perform worse than generalists (blue) despite being more expensive.

The numbers need no interpretation. They speak for themselves:

ModelTypePositionCorrectCost
Gemini 3 FlashGeneralist#1199/2000.34 €
o3Reasoning#2199/2001.94 €
GPT-5Reasoning#3199/2002.05 €
GPT-5.1 ChatGeneralist#4198/2000.65 €
Claude Opus 4.5Agentic#13197/2004.62 €
Claude Opus 4.6Agentic#15197/2004.89 €
GPT-5.2-CodexAgentic#26195/2001.67 €

The devastating data point: Claude Opus 4.6 costs 14 times more than Gemini Flash and gets 2 fewer questions right. GPT-5.2-Codex gets 4 fewer than a model that costs 5 times less. Between Opus 4.6 (#15) and Flash (#1) there are 14 models in between, most of them generalists with no special optimization for code.


3. Coding Is Not Diagnosing

Ranking in coding benchmarks (Terminal-Bench/SWE-Bench) vs. MIR 2026 ranking. The inversion is clear: the best at coding (short orange bar) are mediocre in medicine (long blue bar) and vice versa.

The chart above reveals an almost perfect inversion: the models that dominate programming benchmarks are relegated in the MIR, and vice versa.

  • Claude Opus 4.6: #1 on Terminal-Bench → #15 on MIR
  • GPT-5.2-Codex: Top 3 on SWE-Bench → #26 on MIR
  • Gemini 3 Flash: Doesn't compete on coding benchmarks → #1 on MIR
  • GPT-5.1 Chat: OpenAI's "basic" model → #4 on MIR

Why does this inversion happen? The answer lies in the nature of the MIR. The medical exam is fundamentally a test of factual knowledge and clinical pattern recognition. The majority of its 200 questions require the model to identify a clinical presentation, recall a protocol, or recognize a diagnostic association. It does not require coordinating tools, writing code, or executing multi-step tasks.

A model optimized for agentic programming has dedicated a significant portion of its training to learning how to use terminals, debug code, and coordinate agents. That training does not help — and potentially hurts — when the task is simply to answer "what is the first-line treatment for community-acquired pneumonia?"


4. The Opus 4.6 Case: Born Yesterday, Already Diagnosed

Correct answers
MIR Ranking (lower = better)

Evolution of Claude Opus on the MIR 2026. Opus 4.6 improves in coding (Terminal-Bench) but does not surpass Opus 4.5 in medicine: same accuracy, higher cost and worse ranking.

The evolution of the Claude Opus family on MIR 2026 is particularly revealing:

ModelMIR RankingCorrectCostTime/questionTerminal-Bench
Opus 4#44192/20010.46 €28s42%
Opus 4.1#20196/20011.10 €30s52%
Opus 4.5#13197/2004.62 €13.4s57%
Opus 4.6#15197/2004.89 €14.1s65%

Each new version of Opus is objectively better at programming: Opus 4 → 4.1 → 4.5 → 4.6 shows a steady progression on Terminal-Bench (42% → 52% → 57% → 65%). But in medicine, Opus 4.6 not only fails to improve on 4.5, it actually ranks lower (position #15 vs. #13).

How is that possible? Opus 4.6 gets the same 197 questions right as Opus 4.5, but costs 0.27 € more per exam (4.89 € vs. 4.62 €). At MedBench, when accuracy is tied, the cheaper model wins — and Opus 4.6 loses that tiebreaker.

The paradox is clear: Opus 4.6's greater agentic optimization provides zero benefit on a multiple-choice medical exam. Its one-million-token context window, its ability to coordinate agent teams, its adaptive reasoning — none of this helps when the task is choosing between A, B, C, or D on a cardiology question. It's like bringing a full surgical team to apply a band-aid.


5. The Fall of GPT-5.2-Codex: From Runner-Up to 26th Place

Evolution of the three OpenAI Codex models on the MIR (2024–2026). Bars show correct answers; labels show ranking. GPT-5.2-Codex (the most agentic) performs worse than its smaller siblings on the MIR 2026.

The story of GPT-5.2-Codex across three MIR exam cycles is a drama in three acts:

Exam CyclePositionCorrectAccuracy
MIR 2024#9194/20097.0%
MIR 2025#2192/20096.0%
MIR 2026#26195/20097.5%

Read that again: on MIR 2026, GPT-5.2-Codex got more questions right than ever (195 vs. 194 in 2024) and yet dropped 24 positions compared to 2025. How can you fall while getting more right?

Because everyone else improved far more. In 2025, 192 correct answers put you on the podium. In 2026, with 50 models exceeding 95% accuracy, 195 correct answers leave you in the pack.

And here is the most telling pattern: the "less agentic" versions of Codex models perform better on the MIR.

The more a Codex model is optimized for agentic coding capabilities, the worse it performs on medical knowledge. The pattern is consistent and troubling.


6. GPT-5.3-Codex: The Great Absentee

Launched on the same February 5 alongside Claude Opus 4.6, GPT-5.3-Codex is, according to OpenAI, the most advanced model ever created for programming. Its credentials are impressive: new records on SWE-Bench Pro, self-debugging capability, and the curious distinction of being "the first model that partially built itself."[2]

However, GPT-5.3-Codex does not appear in our ranking. The reason is simple: OpenAI has released it exclusively through ChatGPT — the desktop app, CLI, and IDE extensions. It has no public API endpoint.[5]

At MedBench, all models are evaluated through the OpenRouter API under controlled and identical conditions: same prompt, same temperature, same response format. Evaluating a model through a chat interface would introduce uncontrollable variables (system prompt, formatting, interface limitations) that would invalidate the comparison.

When GPT-5.3-Codex gets API access — OpenAI has said "soon" — we will evaluate it immediately. But for now, it is the elephant in the room: probably the most powerful agentic model in the world, and we cannot measure it.

The question hanging in the air: if even GPT-5 Codex (a less advanced model) only manages #5 on the MIR, would GPT-5.3-Codex truly be capable of beating Gemini Flash? The data suggests not — but without measuring it, that remains speculation.


7. Why Does This Happen? The Science of the Trade-Off

Agentic / Code
Flash / Lightweight
Generalist
Reasoning
Pro / Frontier

Top 40 MIR 2026 models: total exam cost vs. accuracy. Agentic models (orange, bordered) do not reach the upper-left zone (cheap and accurate), which is dominated by Flash and generalist models. Real data from MedBench.

The scatter plot visually confirms what the individual data points already suggested: there is a negative correlation between agentic capability and medical accuracy. The models most optimized for code (right side) tend to perform worse on the MIR (lower area).

Why? Four complementary hypotheses explain it:

7.1. The Specialization Trade-Off

Training an LLM is a near-zero-sum game. The RLHF and fine-tuning cycles dedicated to improving tool calling, code execution, and agent coordination are cycles that are not spent consolidating factual medical knowledge.

The analogy is direct: a surgeon who spends years specializing in hand microsurgery does not thereby become a better neurosurgeon. In fact, they may lose generalist competencies through disuse. Agentic models are the digital equivalent: extraordinarily good at their specialty (code), but not necessarily better — and sometimes worse — outside it.

7.2. The Overthinking Curse

Recent research on "overthinking" in chain-of-thought reasoning suggests that thinking more is not always thinking better.[7] Agentic models are optimized to reason in many steps, decompose complex problems, and iterate on solutions. But on direct multiple-choice questions, this capability can be counterproductive.

An illustrative data point: Claude Opus 4.6 with 0 reasoning tokens gets 197/200 right. o3 Deep Research with 1.7 million reasoning tokens gets 198/200 right. One more question for 500 times more tokens. The marginal return of "deep thinking" on multiple-choice medical questions is practically zero.

7.3. Tool Optimization Contaminates Knowledge

Training for tool calling (using tools, APIs, terminals) modifies the model's probability distribution in subtle but significant ways. A Codex model has been extensively trained to generate code, not to recall pharmacology. The model's internal representations are reorganized to prioritize syntactic patterns, APIs, and execution flows — at the potential expense of clinical patterns, therapeutic protocols, and diagnostic associations.

The MIR does not require tools. There are no files to read, tests to run, or agents to coordinate. It only requires memory and pattern recognition — precisely the capabilities that agentic training can erode.

7.4. The "Swiss Army Knife" Effect

A Swiss Army knife is an extraordinary tool for camping. It can cut bread, open cans, pull corks, and tighten screws. But no one would operate on a patient with one. For surgery, you need a scalpel: a simple, specialized tool that is extraordinarily precise at its single function.

Agentic models are digital Swiss Army knives: they can do many things well, but sacrifice depth for breadth. A Flash model that simply answers the question without overthinking — a scalpel — is more efficient for a multiple-choice exam than a model designed to coordinate teams of agents.

ParámetroTendencia MIR 2026Implicación
Specialization Trade-OffFuerteRLHF for code displaces medical knowledge. More agenticity → less factual accuracy.
Overthinking CurseModeradaMulti-step reasoning is counterproductive on direct MCQs. 1.7M tokens → +1 correct answer vs. 0 tokens.
Tool Calling ContaminationProbableTraining for code generation reorganizes internal representations, eroding clinical patterns.
Swiss Army Knife EffectClaroBreadth of capabilities sacrifices depth in specific domains. Flash > Opus on medical MCQs.

Summary of the four hypotheses on the agentic trade-off. The evidence suggests they are complementary, not mutually exclusive.


8. The Price of Complexity

Agentic / Code
Generalist
Reasoning

Cost per correct answer on the MIR 2026. o1-pro costs 641x more per correct answer than Gemini Flash, with lower accuracy.

If the agentic models are not more accurate in medicine, are they at least efficient? The data says no. The cost per correct answer reveals the magnitude of the waste:

ModelCost/correctvs. FlashCorrect
Gemini 3 Flash0.0017 €1x199/200
GPT-5.1 Chat0.0033 €1.9x198/200
GPT-5.2-Codex0.0086 €5x195/200
Claude Opus 4.60.0248 €14.6x197/200
o10.112 €65.9x198/200
o3 Deep Research0.883 €519x198/200
o1-pro1.09 €641x197/200

The question is unavoidable: in a healthcare system with a limited budget, would you pay 14 times more for 2 fewer correct answers? Or 641 times more for the same accuracy?

For a hospital looking to deploy AI as a diagnostic support tool, these numbers are decisive. If the goal is to maximize accuracy per euro spent, Gemini Flash is the optimal choice by an absurd margin. Agentic models have legitimate uses in complex medical settings (record integration, multi-step differential diagnosis), but for quick pattern-matching queries, they are an expensive solution to a cheap problem.


9. What This Means for Medical AI

The main lesson from this data is deceptively simple: you don't need the "best" AI model for medicine. You need the most appropriate one.

Agentic systems like Claude Opus 4.6 and GPT-5.2-Codex have their rightful place. If you need a system that reviews a 500-page clinical record, correlates lab results with symptoms, queries drug interaction databases, and generates a structured report — an agentic model is exactly what you need. That is its operating room.

But if you need to quickly determine whether a patient presenting with precordial pain, ST elevation, and elevated troponins is having a heart attack — there you need a scalpel, not a Swiss Army knife. And Gemini Flash, with its direct response in 4 seconds for 0.17 cents, is an extraordinarily sharp scalpel.

The importance of evaluating models in the specific domain of application cannot be overstated. Assuming the #1 model in programming will also be #1 in medicine is a mistake that, with MedBench's data on the table, no longer has any excuse. Every domain has its own rules and its own champions.


10. Conclusions: The Right Tool for the Job

The Swiss Army knife — Claude Opus 4.6, GPT-5.2-Codex — is an extraordinary tool. It can code like a team of engineers, coordinate agents, debug code, and automate complex workflows. On its home turf, it has no rival.

The scalpel — Gemini 3 Flash — does one thing: answer questions with devastating precision, at dizzying speed, for a negligible cost. On MIR 2026, where the task is exactly that, it needs nothing more.

Agentic models will revolutionize programming, automation, and probably dozens of industries. But medicine has its own rules. And on Spain's most important medical exam, a model costing 34 cents has once again proven that more expensive, bigger, and more complex does not always mean better.

The next time someone tells you that the world's best AI model will solve every problem, remember: it depends on the problem. A surgeon doesn't need a Swiss Army knife. They need a scalpel.

Explore the complete MIR 2026 rankings and compare all 290 evaluated models at MedBench Rankings.


Notas y Referencias

  1. Anthropic Blog: Introducing Claude Opus 4.6. February 5, 2026.
  2. OpenAI Blog: GPT-5.3-Codex: The Most Advanced Coding Agent. February 5, 2026.
  3. Terminal-Bench 2.0 Leaderboard. Claude Opus 4.6 reached 65.4%, surpassing the previous record of 57.2% held by Opus 4.5. terminal-bench.com.
  4. SWE-Bench Pro Leaderboard. GPT-5.3-Codex sets a new high in autonomous resolution of real GitHub issues.
  5. GPT-5.3-Codex is only available through ChatGPT (app, CLI, and IDE extensions). OpenAI has indicated that API access will be available 'soon.' Without an API, it cannot be evaluated on MedBench under controlled conditions.
  6. OpenAI: GPT-5.2-Codex. Launched in December 2025.
  7. Research on 'overthinking' in chain-of-thought reasoning models shows diminishing returns with excessive chain-of-thought length on direct-answer tasks. See also: MedBench: 199 out of 200 for analysis of reasoning tokens vs. accuracy.
  8. The complete MIR 2026 results with 290 models are available at MedBench Rankings. Detailed methodology at our methodology section.