MedicalBenchmark
The Cathedral and the Bazaar: Open Source vs Proprietary in MIR 2026

The Cathedral and the Bazaar: Open Source vs Proprietary in MIR 2026

The top 33 positions in the MIR 2026 ranking are all proprietary models. We analyze the gap between open and closed models, the real taxonomy of open source in AI, and why RAG outperforms fine-tuning for customizing medical AI.

MedBench TeamFebruary 9, 202618 min read
MIR 2026Open SourceOpen WeightsLlama 4DeepSeekQwenRAG

In 1999, Eric S. Raymond published The Cathedral and the Bazaar, an essay that changed the history of software.[1] His thesis was simple: the closed development model (the cathedral, where a select group designs in silence) cannot compete in the long run with the open model (the bazaar, where thousands of developers collaborate in public). Linux proved him right. Apache, Firefox, Android, Kubernetes -- the bazaar won the software war.

Twenty-six years later, artificial intelligence is fighting the same battle. But the data from the MIR 2026 suggest that, at least today, the cathedral holds a crushing advantage. And that many models that proclaim themselves part of the "bazaar" are, in reality, cathedrals with their doors slightly ajar.


1. The Wall of 33

The most striking finding from our benchmark with 290 models evaluated is this: the top 33 positions in the MIR 2026 ranking are all proprietary models. Not a single open one. Not one.

Pos.ModelCorrectAccuracyCostType
#1Gemini 3 Flash199/20099.5%0.34 EURProprietary
#2o3199/20099.5%1.94 EURProprietary
#3GPT-5199/20099.5%2.05 EURProprietary
#4GPT-5.1 Chat198/20099.0%0.65 EURProprietary
#5GPT-5 Codex198/20099.0%0.89 EURProprietary
..................
#33o4 Mini High194/20097.0%1.95 EURProprietary
#34Llama 4 Maverick194/20097.0%0.11 EUROpen Weights

The gap between the best proprietary model and the best open weights model is 5 questions and 2.5 percentage points of accuracy. In net score (with MIR penalty), the difference is 6.67 net points: 198.67 vs. 192.00.

For a MIR exam candidate, that difference is equivalent to ~250 positions in the ranking. For a researcher, it is the difference between a system that borders on perfection and one that is "merely" excellent.


2. The Battlefield Map

Proprietary
Open Weights

Top MIR 2026 models: the top 33 positions are all proprietary (purple). The first open weights (green) appears at #34.

The chart speaks for itself. The purple zone (proprietary) dominates the top positions without a single crack. The green zone (open weights) appears from position 34 onward and becomes denser in the 40-70 range. The red line marks the boundary: the "wall of 33."

But the story is not just black and white. If we look at the numbers:

  • Top 10: 0 open weights (0%)
  • Top 20: 0 open weights (0%)
  • Top 50: 6 open weights (12%)
  • Top 100: 35 open weights (35%)
  • Total: 175 open weights out of 290 models (60%)

Open models are the majority in volume but a minority among the elite. It is like track and field: thousands of amateur runners, but the 33 who break 2:03 in the marathon are all high-performance professionals with the largest training budgets.


3. The Open Source Illusion: A Taxonomy for Non-Experts

Before going further, we need to clarify a misunderstanding that pollutes the debate: most "open source" models are not open source. They are open weights.

The difference matters. A lot.

In October 2024, the Open Source Initiative (OSI) published the first official definition of what "open source" means when applied to AI models.[2] According to this definition, a model is open source if and only if it publishes:

  1. The model weights (freely downloadable and usable)
  2. The training code (scripts, configuration, hyperparameters)
  3. The training data (or a description sufficient to reproduce them)
  4. Documentation of the complete process

Proprietary

Closed code, closed weights, undisclosed training data. Only accessible via paid API.

Secret recipe: you can eat at the restaurant, but you don't know the ingredients or how it's cooked.

Examples: GPT-5, Gemini 3, Claude Opus 4.6, Grok 4

Open Weights

Downloadable weights, but training data and code not published. You can use the model, not reproduce it.

You get the prepared dish: you can reheat and serve it, but you don't know the exact recipe.

Examples: Llama 4, DeepSeek R1, Qwen3, Mistral Large

Open Source (OSI)

Weights, code, data and training process published. Meets OSI v1.0 definition. Fully reproducible.

Full recipe published: ingredients, quantities, temperatures and times. Anyone can reproduce it.

Examples: OLMo 2 (AllenAI), Pythia (EleutherAI), BLOOM

AI model taxonomy by openness. Based on OSI v1.0 definition (Open Source Initiative, October 2024).

The cooking recipe analogy explains it well:

  • Proprietary = you can eat at the restaurant, but the recipe is secret. You cannot replicate the dish at home.
  • Open weights = you are given the prepared dish. You can reheat it, serve it, even add spices. But you do not know the exact ingredients, the quantities, or the cooking times.
  • OSI open source = you are given the complete recipe, with ingredients, quantities, temperatures, and times. Anyone can reproduce the dish identically.

How many models in the top 100 of our benchmark meet the full OSI definition? Fewer than 5. The OLMo models from AllenAI, some models from EleutherAI... and little else. Llama 4, DeepSeek R1, Qwen3, Mistral -- all are open weights, not open source. They are cathedrals that have opened their doors so you can see the nave, but the architect's blueprints remain under lock and key.

This does not diminish their merit. Open weights are extraordinarily useful: they enable local execution, weight inspection, fine-tuning, and deployment without API dependency. But calling them "open source" is technically incorrect and creates false expectations about reproducibility.


4. The Champions of the Bazaar

That said, the open weights models in MIR 2026 are impressive. Let us review the main families:

Meta: Llama 4 Maverick (#34)

The undisputed champion of the open world. 194 correct answers (97% accuracy) for 0.11 EUR for the full exam. It is the model with the best quality-to-price ratio in the entire ranking -- open or closed. To reach its accuracy level in the proprietary world, the cheapest option is Grok 4.1 Fast at 0.15 EUR: 36% more expensive.

Llama 4 Maverick uses a Mixture of Experts (MoE) architecture with 400B total parameters but only 17B active per token. It is an efficient giant. Its smaller sibling, Llama 4 Scout, achieves 90% at just 0.06 EUR -- probably the cheapest model in the world with professional medical-level performance.

DeepSeek

The Chinese startup that shook the industry in January 2025 with R1 and its reasoning-focused approach. In MIR 2026:

DeepSeek stands out for publishing detailed papers about its training process -- coming closer to the spirit of open source than most competitors.[3]

Qwen (Alibaba)

The largest family, with 38 models in our benchmark. Their best results:

Qwen3 is Alibaba's MoE series, with flexible parameter activation and native support for reasoning (thinking mode).[4]

Mistral

The French company continues its tradition of efficient models:

StepFun

The surprise: StepFun Step 3.5 Flash (#64) achieves 189 correct (94.5%) at a cost of 0.00 EUR -- literally free through OpenRouter. It is a Chinese model with reasoning tokens that offers professional medical-level performance at no cost whatsoever.


5. The Gap That Narrows (But Does Not Fully Close)

Proprietary
Open Weights
Open Source (OSI)

All 290 models evaluated on the MIR 2026 by launch date. Each dot is a model; red = proprietary, blue = open weights, green = open source (OSI). More recent models tend to achieve higher net scores, but proprietary models maintain the upper edge.

The chart shows the 290 models evaluated in MIR 2026 by release date. The Y-axis is the net score (MIR net points, after deducting the penalty for errors). The colors distinguish three categories: red for proprietary, blue for open weights, and green for open source (OSI). The trend is clear: more recent models achieve better net scores, but proprietary models (red) always maintain the upper edge.

Best proprietary
Best open weights

Evolution of the gap between the best proprietary model and the best open weights across 3 MIR editions. The gap shrank from 12 to 5 questions.

If we look only at the best from each category:

EditionBest proprietaryBest open weightsGap
MIR 2024195 (Sonar Deep Research)183 (DeepSeek V3)12
MIR 2025193 (Gemini 3 Flash)188 (Llama 4 Maverick)5
MIR 2026199 (Gemini 3 Flash / o3 / GPT-5)194 (Llama 4 Maverick)5

The gap narrowed dramatically between 2024 and 2025 (from 12 to 5 questions), but has plateaued at 5 between 2025 and 2026. Proprietary models made a huge leap (from 193 to 199), and open ones did too (from 188 to 194), but both advanced in parallel.

Will the gap close completely? Probably not soon. The three models that reached 199/200 (Gemini 3 Flash, o3, GPT-5) were trained with compute budgets that no open weights project can currently match. When the ceiling is 200 questions and you are already at 199, each additional question costs exponentially more.


6. The Chinese Ecosystem: DeepSeek, Qwen, and the Third Way

Qwen
DeepSeek
Moonshot
Zhipu
ByteDance
StepFun

Chinese models on the MIR 2026. Qwen (Alibaba), DeepSeek, Moonshot, Zhipu (GLM), ByteDance (Seed) and StepFun compete strongly in the 94-97% segment.

China deserves a section of its own. Of the 175 open weights models evaluated, a significant proportion comes from Chinese labs: Alibaba (Qwen), DeepSeek, Zhipu (GLM), ByteDance (Seed), MoonshotAI (Kimi), and StepFun.

What is notable is not just their quantity but their diversity of approaches:

  • Qwen bets on massive MoE models with flexible reasoning
  • DeepSeek differentiates itself by publishing detailed papers and optimizing training costs
  • Zhipu (GLM 4.7) combines open weights with reasoning at a competitive cost
  • ByteDance (Seed 1.6) enters with strength from its expertise in recommendation systems
  • StepFun offers free models with reasoning -- a business model that defies market logic

This ecosystem represents a "third way": neither the closed cathedral of Silicon Valley (OpenAI, Anthropic, Google) nor the pure bazaar of Western open source (EleutherAI, AllenAI). It is a model where large tech corporations publish weights as a platform strategy, keeping their data and training processes as a competitive advantage.


7. Cost vs. Accuracy: The Invisible Advantage

Proprietary
Open Weights

Cost vs. accuracy on the MIR 2026. Open weights (green) dominate the lower-left zone: high accuracy at low cost. Llama 4 Maverick (97%, €0.11) is the sweet spot.

Here lies the story that position-based rankings do not tell. If we shift the criterion from "best" to "best per euro spent," the landscape changes radically.

Open weights dominate the lower-left corner of the chart: high accuracy, low cost. Some data points:

For a hospital that needs to process thousands of daily queries, the difference between 0.11 EUR and 2.05 EUR per query is the difference between a viable project and a prohibitive one. At 1,000 daily queries, Llama 4 Maverick costs 110 EUR/day. GPT-5 costs 2,050 EUR/day. Over a year: 40,150 EUR vs. 748,250 EUR.

And that assumes you use the cloud API. If you deploy Llama 4 Maverick on your own servers, the marginal cost per query approaches zero (only electricity and hardware amortization).


8. The Temptation of Fine-Tuning

This is where many medical AI projects stumble. The reasoning is appealing:

If we have the model weights, we can fine-tune it with our clinical data and create a specialized model that outperforms the generalists.

It sounds logical. In practice, it is wrong.

ParámetroTendencia MIR 2026Implicación
Catastrophic ForgettingHigh riskThe model loses general knowledge when specializing. It may perform worse in areas it previously mastered.
Training DataScarce and expensiveHigh-quality annotated clinical data is scarce, requires ethical approval, and suffers from selection bias.
Training CostHighEven fine-tuning a 70B parameter model requires A100/H100 GPUs for hours to days.
MaintenanceOngoingEach new base model requires repeating the fine-tuning. Llama 4 today, Llama 5 tomorrow -- the cycle never ends.
Real-World ResultsDisappointingStudies show that RAG outperforms fine-tuning in most medical question-answering tasks.

Risks of fine-tuning language models for medical applications

The fundamental problem is that fine-tuning modifies the model's weights -- its "internal knowledge" -- with a relatively small amount of specialized data. This creates an unstable equilibrium: if you fine-tune too much, the model loses generality (catastrophic forgetting); if you fine-tune too little, you gain no significant specialization.


9. RAG and Agents: The Alternative That Works

Recent research points in a different direction: do not modify the model, but orchestrate it.

RAG (Retrieval-Augmented Generation) involves connecting the model to an external knowledge base. Instead of "teaching" it medicine by injecting data into its weights, you give it access to a search system that retrieves relevant information in real time. The model does not "know" the answer -- it finds and synthesizes it.

Medical agents go one step further: they orchestrate multiple tools (search, clinical calculators, drug databases, clinical practice guidelines) to solve complex queries.

RAG vs. Fine-Tuning in medical tasks. Data from: MDPI Bioengineering 2025 (BLEU), PMC systematic review (hallucinations), medRxiv 2025 (agents).

The data is compelling:

  • BLEU Score: RAG achieves 0.41 vs. 0.063 for fine-tuning (6.5x better) in medical question-answering tasks.[5]
  • Hallucinations: RAG reduces hallucinations to 0% in contexts with reference data, vs. 12.5% for fine-tuning on out-of-distribution medical questions.[6]
  • Medical agents: Agentic systems with RAG achieve a median accuracy of 93% in clinical tasks, vs. 57% for non-agentic models -- an improvement of +36 percentage points.[7]

The explanation is intuitive: in medicine, knowledge changes constantly. New clinical guidelines, new drugs, new evidence. A fine-tuned model has its knowledge "frozen" in its weights. A RAG system updates its knowledge base in real time. It is the difference between a textbook (which becomes outdated) and a library with subscriptions to every scientific journal.


10. The Elephant in the Room: Privacy and Sovereignty

There is an argument in favor of open weights that no benchmark can capture: technological sovereignty.

When a hospital sends patient data to the OpenAI or Google API, that data leaves the institution's control. It does not matter how many clauses the data processing agreements contain -- the GDPR (Art. 22) and HIPAA demand guarantees that a cloud API cannot provide at the same level as an on-premises deployment.[8]

With open weights, a hospital can:

  1. Deploy Llama 4 Maverick on its own servers -- no data leaves the building
  2. Connect it via RAG to its internal clinical guidelines -- customization without fine-tuning
  3. Fully audit it -- weight and behavior inspection
  4. Comply with European regulations -- data never crosses borders

This is especially relevant in Europe, where the AI Act and medical device regulation (MDR) impose strict traceability and control requirements that are easier to meet with local deployments.

For countries like Spain, where the healthcare system is public and handles data for 47 million people, technological sovereignty is not a luxury: it is an obligation. An open weights model running on public infrastructure (such as Spain's RES supercomputing centers) offers a path more compatible with this obligation than a permanent dependency on American APIs.

That said, there is a third way that combines the best of both worlds: using high-performance proprietary models on clouds where the client controls the datacenter location and has contractual guarantees that the information never reaches the provider. Services like Amazon Bedrock (which offers Anthropic's models, among others) allow deploying Claude in a specific European region, with client-managed encryption and the guarantee that data is not used to train models or shared with third parties. For a hospital that needs the accuracy of a top proprietary model without giving up control of its data, this architecture offers a viable balance between performance and sovereignty.


11. MedGemma: The Bridge Between Worlds

In June 2025, Google took a step that blurs the boundary between cathedral and bazaar: it published MedGemma, a family of open weights models specifically trained for medicine.[9]

MedGemma 27B, based on Gemma 3, achieves 87.7% on MedQA (the reference medical benchmark in English) -- a result that would have been a world record just 18 months earlier. Google published it with downloadable weights, training process documentation, and tools for additional fine-tuning.

Why would a proprietary giant publish an open medical model? The answer has multiple layers:

  • Regulatory legitimacy: Offering auditable models facilitates the approval of AI-based medical products
  • Ecosystem strategy: If MedGemma becomes the standard for medical AI, Google captures value at the infrastructure layer (TPUs, Vertex AI)
  • Open research: Medical advances accelerate when the community can iterate on a shared base model

It is not the only example. Meta has published guidelines for medical use of Llama.[10] Alibaba has funded medical research with Qwen. The trend is clear: the major labs are converging toward a hybrid model where the base model is open and value is captured at the services layer.


12. Conclusions: The Cathedral Is No Longer Alone

After analyzing 290 models in MIR 2026, these are our conclusions:

1. The gap exists but is closing. The top 33 positions are proprietary, but the difference between the best closed model (199/200) and the best open one (194/200) is only 5 questions. In 2024, it was 12.

2. Taxonomy matters. Most "open source" models are actually open weights. Only a handful meet the OSI v1.0 definition. This has practical implications: you can use an open weights model, but you cannot reproduce its training.

3. Fine-tuning is not the answer. The data shows that RAG and agentic systems outperform fine-tuning in medical tasks: better response quality, zero hallucinations, and +36pp accuracy with agents. The winning strategy is intelligent orchestration, not weight modification.

4. The real advantage of open weights is sovereignty. The ability to run the model on your own servers, without dependency on external APIs, in compliance with GDPR and healthcare regulation -- that is priceless.

5. The future is not cathedral vs. bazaar. It is open base model + intelligent orchestration + proprietary data. A hospital that deploys Llama 4 Maverick with RAG over its clinical guidelines combines the best of both worlds: the power of a 400B parameter model with the customization of its own data, without fine-tuning and without sending sensitive information to third parties.

Eric S. Raymond was right: the bazaar eventually overtakes the cathedral. But in medical AI, the 2026 bazaar is not a chaotic fair of individual contributions. It is an ecosystem where Meta, Alibaba, DeepSeek, and Google publish entire cathedrals -- and the community furnishes them, connects them, and puts them to work.

The cathedral is no longer alone. And that, for medicine, is excellent news.


Notas y Referencias

  1. Raymond, E. S. (1999). The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. O'Reilly Media. The original essay was presented in 1997 and published as a book in 1999.
  2. Open Source Initiative (2024). The Open Source AI Definition v1.0. Published October 28, 2024. opensource.org/ai/open-source-ai-definition
  3. DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. One of the most detailed papers on the training process of a reasoning model.
  4. Qwen Team (2025). Qwen3 Technical Report. qwenlm.github.io/blog/qwen3. Description of the MoE architecture and thinking mode.
  5. Soman, S. et al. (2025). Comparative Evaluation of RAG and Fine-Tuning for Medical Question Answering. MDPI Bioengineering, 12(2), 123. RAG achieved BLEU 0.41 vs. 0.063 for fine-tuning in medical responses.
  6. Pal, A. et al. (2025). A Systematic Review of Retrieval-Augmented Generation in Medical AI. PMC. RAG eliminated hallucinations (0%) when contextual reference documents were provided.
  7. Schmidgall, S. et al. (2025). AgentMD: A Systematic Review of AI Agents in Medicine. medRxiv. Medical agents improved accuracy by a median of +36 percentage points over non-agentic models.
  8. General Data Protection Regulation (GDPR), Art. 22: Automated individual decision-making. The GDPR establishes the right not to be subject to decisions based solely on automated processing, with regulated exceptions.
  9. Google Health AI (2025). MedGemma: Open Models for Medical AI. June 2025. MedGemma 27B achieved 87.7% on MedQA with open weights based on Gemma 3.
  10. Meta AI (2025). Llama for Healthcare: Best Practices and Safety Guidelines. Official guidelines for using Llama in healthcare applications.