Research Papers
Paper 2026
UpcomingIn preparation
We are working on a new research paper that will include the complete analysis of MIR 2026 results, with updated data and newly evaluated models.
Will include:
- Complete MIR 2026 results analysis
- Evaluation of state-of-the-art models
- Year-over-year comparison 2024-2026
- New multimodal evaluation metrics
Paper 2025
AvailablePublished
Evaluating Large Language Models on the Spanish Medical Intern Resident (MIR) Examination 2024/2025: A Comparative Analysis of Clinical Reasoning and Knowledge Application
Authors
Carlos Luengo Vera, Ignacio Ferro Picón, M. Teresa del Val Núñez, José Andrés Gómez Gandía, Antonio de Lucas Ancillo, Víctor Ramos Arroyo, Carlos Milán Figueredo
This study presents a comprehensive comparative evaluation of 22 large language models (LLMs) on the Spanish MIR exams of 2024 and 2025.
Key results
Key study metrics
Key study findings
Study objective
Comprehensive comparative evaluation of general-purpose and medically specialized language models on the Spanish MIR exams.
- 22 language models (LLMs) evaluated
- Official Spanish MIR exams 2024 and 2025
- Analysis of clinical reasoning capabilities
- Comparison between generalist and specialized models
Methodology
Rigorous evaluation framework based on official MIR exam questions with standard scoring system.
- 210 official multiple-choice questions per cycle
- Standard MIR scoring system (+3/-1/0)
- Zero-shot evaluation without prior examples
- Multimodal processing of medical images
Models evaluated
Wide selection of models including both generalist systems and those specialized in the medical domain.
- OpenAI: GPT-4, GPT-4 Turbo, GPT-4o
- Anthropic: Claude 3 (Opus, Sonnet, Haiku)
- Google: Gemini Pro, Gemini Ultra
- Specialized systems: Miri Pro
Study scope
Comprehensive evaluation covering multiple dimensions of medical knowledge and clinical skills.
- Coverage of all MIR medical specialties
- Questions with and without image support
- Evaluation of diagnostic and therapeutic reasoning
- Consistency analysis between exam cycles
Key findings
Significant results revealing important differences between the different types of evaluated models.
- Specialized models outperform generalists
- Higher accuracy in complex clinical reasoning
- Variability in medical image interpretation
- Performance decrease between 2024-2025 cycles
Conclusions
Important implications for the future of AI in medicine and medical education.
- Potential of domain-specific fine-tuning
- Critical importance of multimodal capabilities
- Need for annually updated benchmarks
- Potential applications in medical education
Notable finding
A slight decrease in performance was observed between the 2024-2025 cycles, attributed to changes in question design to reduce memorization.
Explore our results
Check the updated AI model rankings or propose your own model for evaluation.