MedicalBenchmark

Your complete guide to MedicalBenchmark

Documentation

Everything you need to know about how we evaluate AI models on official medical exams in Spain. Guide for researchers and healthcare professionals.

What is MedicalBenchmark

MedicalBenchmark is an independent evaluation platform that measures AI model performance on official medical exams in Spain, primarily the MIR.

Our mission is to provide objective, reproducible, and openly accessible data so that researchers, healthcare professionals, and developers can understand the real capabilities of AI in medicine.

Independent evaluation

No affiliation with any AI provider. We evaluate all models using the same standardized protocol.

Official exams

We use real MIR questions published by the Spanish Ministry of Health.

+280 AI models

The most comprehensive database of medical AI evaluations in Spanish, including proprietary and open-source models.

Open data

All results, responses, and metrics are publicly available to foster open research.

The MIR exam

The MIR (Médico Interno Residente) is Spain's national exam for accessing specialized medical training. It is a standardized, public, and highly competitive test.

Each MIR edition consists of 200 valid questions plus 10 reserve questions (210 total). Each question has 4 answer options, of which only one is correct.

Scoring system

Correct answer

+3 points

Incorrect answer

-1 point

Blank answer

0 points

Net score formula

Net = Correct - (Incorrect / 3)

The net score represents the effective number of correct answers, discounting the penalty for incorrect responses. It is the official MIR metric.

Some questions may be annulled after the exam is published. Annulled questions do not count towards the net score and are excluded from evaluation.

How models are evaluated

All models are evaluated under a standardized zero-shot protocol, meaning they receive no prior examples or specific training for the exam.

In zero-shot evaluation, the model receives each question in isolation, without prior examples (few-shot) or MIR-specific training instructions.

1

Prompt preparation

Each question is formatted with a standardized prompt that includes the statement, answer options, and a clear instruction to select a single option.

2

Sending to the model

The question is sent to the model's API without additional context, prior examples, or specialized system prompts.

3

Answer extraction

The model's response is analyzed to extract the selected option (A, B, C, or D) using multiple parsing methods.

4

Metrics calculation

The response, tokens used, response time, cost are recorded, and the score is calculated according to the official MIR system.

5

Results publication

Results are published on the platform with full transparency: every individual response is verifiable.

Understanding the results

Each evaluated model has a complete profile with multiple metrics. Here's how to interpret each one.

Accuracy

Percentage of questions answered correctly out of all valid questions. It's the most intuitive metric: 80% means the model got 8 out of 10 questions right.

Net score (Netas)

Official MIR score that accounts for the penalty for incorrect answers. Better reflects real performance than pure accuracy.

Score

Final score calculated as 3 × Net. This is the metric officially used to rank MIR candidates.

Discriminatory questions

Questions where Frontier models (highest performance) disagree on the correct answer. Especially useful for analyzing the boundaries of AI knowledge.

Tokens

Amount of text processed (input) and generated (output) by the model, measured in tokens. Directly impacts cost.

Cost

Estimated cost in USD for evaluating the model on the entire exam, based on each API's public pricing.

Medical specialties

MIR questions cover over 30 medical specialties. Each question is classified by specialty, allowing analysis of model performance by area of knowledge.

AllergologyAnesthesiology and ResuscitationCardiologyPalliative CareDermatologyEndocrinology and NutritionInfectious DiseasesEpidemiologyStatisticsPharmacologyGastroenterologyGeneticsGeriatricsGynecology and ObstetricsHematologyImmunologyLegal Medicine and BioethicsNephrologyPulmonologyNeurologyOphthalmologyMedical OncologyENTPediatricsHealth Planning and ManagementPsychiatryRadiology-EmergencyRheumatologyTraumatologyUrology

You can filter results by specialty on each model's detail page.

Question types

Each MIR question is classified by the type of clinical reasoning it requires. The 14 types reflect the competencies assessed in medical training.

DiagnosisTreatmentTestsInterpretationPathophysiologyRiskPreventionPrognosisEpidemiologyBiostatisticsEthicsLegalPharmacologyAnatomy

Question type breakdown is available on each model's profile.

Data integrity

A benchmark's reliability depends on data integrity. We take specific measures to ensure fair and uncontaminated evaluations.

MIR 2026 is our virgin benchmark: no model was trained on these questions, as they were published after their training data cutoff dates.

No contamination

The most recent exams were not available during model training, eliminating the risk of memorization.

Fair comparison

All models receive exactly the same prompt, under the same conditions, with no advantages for any provider.

Reproducibility

We publish the exact prompts, responses, and configurations so any researcher can reproduce our results.

How to use the platform

MedicalBenchmark offers multiple ways to explore and analyze medical AI evaluation data.

Explore rankings

Check the complete model rankings by exam. Filter by model type, sort by different metrics, and compare results.

View rankings

View exam questions

Explore MIR questions and see how each model answered. Identify error patterns and questions particularly difficult for AI.

View exams

Compare models

Access each model's detailed profile to see performance broken down by specialty, question type, and efficiency metrics.

View rankings

Access the data

Download complete datasets for research or request API access to integrate data into your own analysis tools.

View datasets

Glossary

Definitions of key terms used on the platform.

Accuracy
Percentage of correct answers out of all valid exam questions.
Net score (Netas)
Official MIR metric. Calculated as: Correct - (Incorrect / 3). Reflects real performance by penalizing incorrect answers.
Score
Final MIR score, calculated as 3 × Net. The metric used to rank candidates.
Zero-shot
Evaluation method where the model receives no prior examples or specific training for the task. The question is presented directly.
MIR
Médico Interno Residente. Spain's national exam for accessing specialized medical training.
Prompt
Input text sent to the AI model. In our case, it includes the MIR question formatted with its answer options.
Token
Minimum unit of text processed by language models. Approximately equivalent to 3/4 of a word in English.
Frontier (model)
State-of-the-art AI models with the highest performance. Includes models like GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, etc.
Multimodal
A model's ability to process both text and images. Relevant for MIR questions that include clinical images.
Discriminatory questions
Questions where the highest-performing AI models (Frontier) disagree on the correct answer.
Virgin benchmark
An exam whose questions did not exist during the training of evaluated models, guaranteeing zero data contamination.
Open Source
Models whose code and weights are publicly available for free download and use.
API
Programming interface that allows programmatic access to MedicalBenchmark data.

Frequently asked questions

Answers to the most common questions about MedicalBenchmark.

Are the results reliable?

Yes. Each evaluation follows a standardized and reproducible protocol. We publish all individual responses so any researcher can verify the results. Additionally, our data has been validated in peer-reviewed scientific publications.

How often are the rankings updated?

Rankings are continuously updated as we evaluate new models or new versions are released. Each MIR edition is added when the Ministry of Health officially publishes the questions and answers.

Why do you use the MIR and not other exams?

The MIR is Spain's most important medical exam, with questions designed by experts and statistically validated. It is public, standardized, and covers the full spectrum of medicine. Furthermore, being in Spanish, it allows evaluating models in a language other than English.

Which models are included?

We evaluate over 280 models, including proprietary models (GPT-4, Claude, Gemini, etc.) and open-source ones (LLaMA, Mistral, Qwen, etc.). Anyone can propose a model for evaluation.

Can I download the data?

Yes. We offer complete datasets on the Datasets page, including questions, each model's responses, and detailed metrics. For programmatic access, we also have an API available.

How is it different from other medical benchmarks?

MedicalBenchmark stands out by using real official exams (not synthetic), evaluating in Spanish, including the official MIR scoring system with penalties, and offering a virgin benchmark with uncontaminated exams.

How can I contribute or collaborate?

You can propose models for evaluation, report errors, suggest improvements, or collaborate on research. Visit our contact page for more information.

How much does MedicalBenchmark cost?

The platform is completely free. All data, rankings, and analyses are openly available. We believe that transparency in medical AI evaluation benefits the entire community.

Ready to explore?

Check out the AI model rankings on MIR exams and discover how they perform in medicine.