MedicalBenchmark

Scientific publications and academic transparency

Research Papers

Our work evaluating AI models in the medical field is backed by peer-reviewed scientific publications, ensuring transparency and reproducibility of our results.

Paper 2026

Upcoming

In preparation

We are working on a new research paper that will include the complete analysis of MIR 2026 results, with updated data and newly evaluated models.

Will include:

  • Complete MIR 2026 results analysis
  • Evaluation of state-of-the-art models
  • Year-over-year comparison 2024-2026
  • New multimodal evaluation metrics

Paper 2025

Available

Published

Evaluating Large Language Models on the Spanish Medical Intern Resident (MIR) Examination 2024/2025: A Comparative Analysis of Clinical Reasoning and Knowledge Application

Authors

Carlos Luengo Vera, Ignacio Ferro Picón, M. Teresa del Val Núñez, José Andrés Gómez Gandía, Antonio de Lucas Ancillo, Víctor Ramos Arroyo, Carlos Milán Figueredo

This study presents a comprehensive comparative evaluation of 22 large language models (LLMs) on the Spanish MIR exams of 2024 and 2025.

Key results

Key study metrics

22
Models evaluated
420
Questions analyzed
210
Questions per cycle
2024-2025
MIR cycles

Key study findings

Study objective

Comprehensive comparative evaluation of general-purpose and medically specialized language models on the Spanish MIR exams.

  • 22 language models (LLMs) evaluated
  • Official Spanish MIR exams 2024 and 2025
  • Analysis of clinical reasoning capabilities
  • Comparison between generalist and specialized models

Methodology

Rigorous evaluation framework based on official MIR exam questions with standard scoring system.

  • 210 official multiple-choice questions per cycle
  • Standard MIR scoring system (+3/-1/0)
  • Zero-shot evaluation without prior examples
  • Multimodal processing of medical images

Models evaluated

Wide selection of models including both generalist systems and those specialized in the medical domain.

  • OpenAI: GPT-4, GPT-4 Turbo, GPT-4o
  • Anthropic: Claude 3 (Opus, Sonnet, Haiku)
  • Google: Gemini Pro, Gemini Ultra
  • Specialized systems: Miri Pro

Study scope

Comprehensive evaluation covering multiple dimensions of medical knowledge and clinical skills.

  • Coverage of all MIR medical specialties
  • Questions with and without image support
  • Evaluation of diagnostic and therapeutic reasoning
  • Consistency analysis between exam cycles

Key findings

Significant results revealing important differences between the different types of evaluated models.

  • Specialized models outperform generalists
  • Higher accuracy in complex clinical reasoning
  • Variability in medical image interpretation
  • Performance decrease between 2024-2025 cycles

Conclusions

Important implications for the future of AI in medicine and medical education.

  • Potential of domain-specific fine-tuning
  • Critical importance of multimodal capabilities
  • Need for annually updated benchmarks
  • Potential applications in medical education
Loading chart...

Notable finding

A slight decrease in performance was observed between the 2024-2025 cycles, attributed to changes in question design to reduce memorization.

Explore our results

Check the updated AI model rankings or propose your own model for evaluation.