MedicalBenchmark

Scientific rigor and transparency in medical AI evaluation

Evaluation Methodology

Our methodology ensures a fair, reproducible, and scientifically rigorous evaluation of artificial intelligence models in the medical field. We use Spain's official MIR exam as a standardized reference.

What is the MIR Exam?

The MIR (Medico Interno Residente) is the national exam that medical graduates must pass to access specialized healthcare training in Spain. It is the gold standard for evaluating professional-level medical knowledge.

200 official questions

Plus 10 reserve questions in case any are annulled

4 options per question

One correct answer, three distractors

Unified national exam

Identical for all candidates throughout Spain

Expert-crafted

Commission of specialists from the Ministry of Health

MIR 2026: A Virgin Benchmark

The MIR 2026 exam represents a unique opportunity in AI model evaluation: it was published AFTER the training cutoff date of all evaluated models.

This means no model could have seen these questions during training, guaranteeing a real zero-shot evaluation.

No training contamination

MIR 2026 questions did not exist when the models were trained

Real zero-shot evaluation

Models respond without having ever seen the questions before

Fair comparison between models

All models start from the same initial conditions

Official Scoring System

We use the official MIR exam scoring system, designed to penalize incorrect answers and discourage random guessing.

Correct answer

+3 points

Incorrect answer

-1 point

Blank answer

0 points

Net Score = Correct - (Wrong / 3)

Netas = Aciertos - (Fallos / 3)

The 'net score' formula balances the risk of answering incorrectly. For every 3 wrong answers, the equivalent of 1 correct answer is lost.

Score = 3 x Net Score

Score = 3 x Netas

Officially annulled questions are not counted in the scoring.

Evaluation Protocol

We follow a standardized protocol to ensure reproducibility and comparability of results.

1

Prompt Preparation

Each question is contextualized with a specific prompt that places the model in the role of a Spanish resident physician taking the MIR exam.

2

Question Submission

The question is sent in structured XML format, including the statement, answer options, and images if any.

3

Response Processing

The model generates its response with complete clinical reasoning and selects an option.

4

Standardized Extraction

An automated system extracts the chosen option from the response text, handling different formats.

5

Score Calculation

The official MIR scoring system is applied and all metrics are recorded.

Prompt Design

The prompt is designed to contextualize the model in the Spanish healthcare system and the specific situation of the MIR exam.

Prompt Template
You are a Spanish resident physician taking the MIR exam.
    Analyze the following question and provide your answer.
    <question>
    {statement}
    </question>
    <options>
    A) {option_a}
    B) {option_b}
    C) {option_c}
    D) {option_d}
    </options>
    Reason your answer and at the end clearly indicate your choice
    with the format: "My answer is: [letter]"

Design rationale:

  • Spanish context: explicit reference to the Spanish healthcare system
  • Defined role: the model acts as a resident physician taking the exam
  • Clear instructions: response format specified to facilitate extraction
  • No additional hints: the model only receives the question information

Response Extraction

We use a robust extraction system to identify the option chosen by each model, regardless of variations in response format.

Secondary parsing model

A specialized model analyzes the response and extracts the chosen option

Search patterns

Regular expressions search for key phrases like 'My answer is:', 'The correct option is:', etc.

Retry system

If extraction fails, the model is asked to clarify its response

Confidence level

Extraction confidence is recorded for each response

Multimodal Support

The MIR exam includes questions with medical images (X-rays, ECGs, histological sections, etc.). Our system automatically detects and manages these questions.

Automatic detection

The system identifies which models have vision capability

Image submission

Medical images are sent along with the question text

Text-only models

For models without vision, it is indicated that the question contains an unavailable image

Separate metrics

Specific metrics are recorded for questions with and without images

Captured Metrics

We record multiple metrics for each response, allowing detailed analysis of each model's performance.

Response time

Total latency from submission to complete response (ms)

Input tokens

Number of tokens in the prompt sent to the model

Output tokens

Number of tokens generated in the response

Reasoning tokens

Tokens used in the reasoning process (if applicable)

Cost per query

Estimated cost in USD based on API pricing

Confidence level

Model's confidence in its response (if available)

Transparency and Reproducibility

We are committed to full transparency in our methodology. Any researcher can verify and reproduce our results.

Documented methodology

All details of the evaluation process are publicly documented

Public input data

MIR questions are public Ministry documents

Verifiable responses

Model responses are stored for subsequent verification

Open source

Evaluation code will be available for inspection and reproduction

Explore the Results

Check the detailed performance of each model on MIR 2026 questions.