Evaluation Methodology
What is the MIR Exam?
The MIR (Medico Interno Residente) is the national exam that medical graduates must pass to access specialized healthcare training in Spain. It is the gold standard for evaluating professional-level medical knowledge.
200 official questions
Plus 10 reserve questions in case any are annulled
4 options per question
One correct answer, three distractors
Unified national exam
Identical for all candidates throughout Spain
Expert-crafted
Commission of specialists from the Ministry of Health
MIR 2026: A Virgin Benchmark
The MIR 2026 exam represents a unique opportunity in AI model evaluation: it was published AFTER the training cutoff date of all evaluated models.
This means no model could have seen these questions during training, guaranteeing a real zero-shot evaluation.
No training contamination
MIR 2026 questions did not exist when the models were trained
Real zero-shot evaluation
Models respond without having ever seen the questions before
Fair comparison between models
All models start from the same initial conditions
Official Scoring System
We use the official MIR exam scoring system, designed to penalize incorrect answers and discourage random guessing.
Correct answer
+3 points
Incorrect answer
-1 point
Blank answer
0 points
Net Score = Correct - (Wrong / 3)
The 'net score' formula balances the risk of answering incorrectly. For every 3 wrong answers, the equivalent of 1 correct answer is lost.
Score = 3 x Net Score
Officially annulled questions are not counted in the scoring.
Evaluation Protocol
We follow a standardized protocol to ensure reproducibility and comparability of results.
Prompt Preparation
Each question is contextualized with a specific prompt that places the model in the role of a Spanish resident physician taking the MIR exam.
Question Submission
The question is sent in structured XML format, including the statement, answer options, and images if any.
Response Processing
The model generates its response with complete clinical reasoning and selects an option.
Standardized Extraction
An automated system extracts the chosen option from the response text, handling different formats.
Score Calculation
The official MIR scoring system is applied and all metrics are recorded.
Prompt Design
The prompt is designed to contextualize the model in the Spanish healthcare system and the specific situation of the MIR exam.
You are a Spanish resident physician taking the MIR exam.
Analyze the following question and provide your answer.
<question>
{statement}
</question>
<options>
A) {option_a}
B) {option_b}
C) {option_c}
D) {option_d}
</options>
Reason your answer and at the end clearly indicate your choice
with the format: "My answer is: [letter]"Design rationale:
- Spanish context: explicit reference to the Spanish healthcare system
- Defined role: the model acts as a resident physician taking the exam
- Clear instructions: response format specified to facilitate extraction
- No additional hints: the model only receives the question information
Response Extraction
We use a robust extraction system to identify the option chosen by each model, regardless of variations in response format.
Secondary parsing model
A specialized model analyzes the response and extracts the chosen option
Search patterns
Regular expressions search for key phrases like 'My answer is:', 'The correct option is:', etc.
Retry system
If extraction fails, the model is asked to clarify its response
Confidence level
Extraction confidence is recorded for each response
Multimodal Support
The MIR exam includes questions with medical images (X-rays, ECGs, histological sections, etc.). Our system automatically detects and manages these questions.
Automatic detection
The system identifies which models have vision capability
Image submission
Medical images are sent along with the question text
Text-only models
For models without vision, it is indicated that the question contains an unavailable image
Separate metrics
Specific metrics are recorded for questions with and without images
Captured Metrics
We record multiple metrics for each response, allowing detailed analysis of each model's performance.
Response time
Total latency from submission to complete response (ms)
Input tokens
Number of tokens in the prompt sent to the model
Output tokens
Number of tokens generated in the response
Reasoning tokens
Tokens used in the reasoning process (if applicable)
Cost per query
Estimated cost in USD based on API pricing
Confidence level
Model's confidence in its response (if available)
Transparency and Reproducibility
We are committed to full transparency in our methodology. Any researcher can verify and reproduce our results.
Documented methodology
All details of the evaluation process are publicly documented
Public input data
MIR questions are public Ministry documents
Verifiable responses
Model responses are stored for subsequent verification
Open source
Evaluation code will be available for inspection and reproduction
Explore the Results
Check the detailed performance of each model on MIR 2026 questions.