Documentation
What is MedicalBenchmark
MedicalBenchmark is an independent evaluation platform that measures AI model performance on official medical exams in Spain, primarily the MIR.
Our mission is to provide objective, reproducible, and openly accessible data so that researchers, healthcare professionals, and developers can understand the real capabilities of AI in medicine.
Independent evaluation
No affiliation with any AI provider. We evaluate all models using the same standardized protocol.
Official exams
We use real MIR questions published by the Spanish Ministry of Health.
+280 AI models
The most comprehensive database of medical AI evaluations in Spanish, including proprietary and open-source models.
Open data
All results, responses, and metrics are publicly available to foster open research.
The MIR exam
The MIR (Médico Interno Residente) is Spain's national exam for accessing specialized medical training. It is a standardized, public, and highly competitive test.
Each MIR edition consists of 200 valid questions plus 10 reserve questions (210 total). Each question has 4 answer options, of which only one is correct.
Scoring system
Correct answer
+3 points
Incorrect answer
-1 point
Blank answer
0 points
Net score formula
The net score represents the effective number of correct answers, discounting the penalty for incorrect responses. It is the official MIR metric.
Some questions may be annulled after the exam is published. Annulled questions do not count towards the net score and are excluded from evaluation.
How models are evaluated
All models are evaluated under a standardized zero-shot protocol, meaning they receive no prior examples or specific training for the exam.
In zero-shot evaluation, the model receives each question in isolation, without prior examples (few-shot) or MIR-specific training instructions.
Prompt preparation
Each question is formatted with a standardized prompt that includes the statement, answer options, and a clear instruction to select a single option.
Sending to the model
The question is sent to the model's API without additional context, prior examples, or specialized system prompts.
Answer extraction
The model's response is analyzed to extract the selected option (A, B, C, or D) using multiple parsing methods.
Metrics calculation
The response, tokens used, response time, cost are recorded, and the score is calculated according to the official MIR system.
Results publication
Results are published on the platform with full transparency: every individual response is verifiable.
Understanding the results
Each evaluated model has a complete profile with multiple metrics. Here's how to interpret each one.
Accuracy
Percentage of questions answered correctly out of all valid questions. It's the most intuitive metric: 80% means the model got 8 out of 10 questions right.
Net score (Netas)
Official MIR score that accounts for the penalty for incorrect answers. Better reflects real performance than pure accuracy.
Score
Final score calculated as 3 × Net. This is the metric officially used to rank MIR candidates.
Discriminatory questions
Questions where Frontier models (highest performance) disagree on the correct answer. Especially useful for analyzing the boundaries of AI knowledge.
Tokens
Amount of text processed (input) and generated (output) by the model, measured in tokens. Directly impacts cost.
Cost
Estimated cost in USD for evaluating the model on the entire exam, based on each API's public pricing.
Medical specialties
MIR questions cover over 30 medical specialties. Each question is classified by specialty, allowing analysis of model performance by area of knowledge.
You can filter results by specialty on each model's detail page.
Question types
Each MIR question is classified by the type of clinical reasoning it requires. The 14 types reflect the competencies assessed in medical training.
Question type breakdown is available on each model's profile.
Data integrity
A benchmark's reliability depends on data integrity. We take specific measures to ensure fair and uncontaminated evaluations.
MIR 2026 is our virgin benchmark: no model was trained on these questions, as they were published after their training data cutoff dates.
No contamination
The most recent exams were not available during model training, eliminating the risk of memorization.
Fair comparison
All models receive exactly the same prompt, under the same conditions, with no advantages for any provider.
Reproducibility
We publish the exact prompts, responses, and configurations so any researcher can reproduce our results.
How to use the platform
MedicalBenchmark offers multiple ways to explore and analyze medical AI evaluation data.
Explore rankings
Check the complete model rankings by exam. Filter by model type, sort by different metrics, and compare results.
View rankingsView exam questions
Explore MIR questions and see how each model answered. Identify error patterns and questions particularly difficult for AI.
View examsCompare models
Access each model's detailed profile to see performance broken down by specialty, question type, and efficiency metrics.
View rankingsAccess the data
Download complete datasets for research or request API access to integrate data into your own analysis tools.
View datasetsGlossary
Definitions of key terms used on the platform.
- Accuracy
- Percentage of correct answers out of all valid exam questions.
- Net score (Netas)
- Official MIR metric. Calculated as: Correct - (Incorrect / 3). Reflects real performance by penalizing incorrect answers.
- Score
- Final MIR score, calculated as 3 × Net. The metric used to rank candidates.
- Zero-shot
- Evaluation method where the model receives no prior examples or specific training for the task. The question is presented directly.
- MIR
- Médico Interno Residente. Spain's national exam for accessing specialized medical training.
- Prompt
- Input text sent to the AI model. In our case, it includes the MIR question formatted with its answer options.
- Token
- Minimum unit of text processed by language models. Approximately equivalent to 3/4 of a word in English.
- Frontier (model)
- State-of-the-art AI models with the highest performance. Includes models like GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, etc.
- Multimodal
- A model's ability to process both text and images. Relevant for MIR questions that include clinical images.
- Discriminatory questions
- Questions where the highest-performing AI models (Frontier) disagree on the correct answer.
- Virgin benchmark
- An exam whose questions did not exist during the training of evaluated models, guaranteeing zero data contamination.
- Open Source
- Models whose code and weights are publicly available for free download and use.
- API
- Programming interface that allows programmatic access to MedicalBenchmark data.
Frequently asked questions
Answers to the most common questions about MedicalBenchmark.
Are the results reliable?
Yes. Each evaluation follows a standardized and reproducible protocol. We publish all individual responses so any researcher can verify the results. Additionally, our data has been validated in peer-reviewed scientific publications.
How often are the rankings updated?
Rankings are continuously updated as we evaluate new models or new versions are released. Each MIR edition is added when the Ministry of Health officially publishes the questions and answers.
Why do you use the MIR and not other exams?
The MIR is Spain's most important medical exam, with questions designed by experts and statistically validated. It is public, standardized, and covers the full spectrum of medicine. Furthermore, being in Spanish, it allows evaluating models in a language other than English.
Which models are included?
We evaluate over 280 models, including proprietary models (GPT-4, Claude, Gemini, etc.) and open-source ones (LLaMA, Mistral, Qwen, etc.). Anyone can propose a model for evaluation.
Can I download the data?
Yes. We offer complete datasets on the Datasets page, including questions, each model's responses, and detailed metrics. For programmatic access, we also have an API available.
How is it different from other medical benchmarks?
MedicalBenchmark stands out by using real official exams (not synthetic), evaluating in Spanish, including the official MIR scoring system with penalties, and offering a virgin benchmark with uncontaminated exams.
How can I contribute or collaborate?
You can propose models for evaluation, report errors, suggest improvements, or collaborate on research. Visit our contact page for more information.
How much does MedicalBenchmark cost?
The platform is completely free. All data, rankings, and analyses are openly available. We believe that transparency in medical AI evaluation benefits the entire community.
Ready to explore?
Check out the AI model rankings on MIR exams and discover how they perform in medicine.