ALMA and MIRI achieve the highest possible score on the MIR 2026 exam with 100% accuracy

For three years, Medical Benchmark has evaluated over 300 artificial intelligence models on the MIR exam, the entrance test for specialized medical training in Spain. We have documented how the best generalist models -- Gemini, GPT, Claude -- have been approaching the 100% ceiling, missing fewer and fewer questions, costing less and less money, responding faster and faster.

But they always missed something.

Today we present the results of two models that break that barrier. They are not generalist models. They are not available online. They cannot be tested with a public API. They are custom models, built in Spain with a radically different architecture: Agentic RAG with specialized experts.

MIRI, developed by BinPar for PROMIR (by Editorial Medica Panamericana), has answered 596 out of 600 MIR questions correctly, with only 4 errors over three years and a perfect score of 200/200 on MIR 2026. And it did so at a total cost of $2.38 -- 13 times less than ALMA and comparable to the most affordable standard models.

ALMA, developed by BinPar with content from Editorial Medica Panamericana and Spanish Clinical Guidelines, has answered all 600 questions from the last three MIR exams -- plus all reserve questions -- without a single error.^[1] No AI model in the history of MedBench, and to our knowledge, no model on any medical benchmark in the world, has ever achieved a perfect cumulative score over three years.

1. The Results: The 100% Wall

Let's start with the numbers. No embellishments, no hyperbole. Just data.

ALMA's Data

Exam Session	Correct	Net Score	Accuracy	Cost	Time/question	Confidence	Reasoning Tokens
MIR 2024	200/200	200.00	100.0%	$9.99	54.7s	99.9%	71K
MIR 2025	200/200	200.00	100.0%	$11.02	50.8s	99.8%	78K
MIR 2026	200/200	200.00	100.0%	$10.56	54.3s	99.8%	66K
Cumulative	600/600	600.00	100.0%	$31.57

MIRI's Data

Exam Session	Correct	Errors	Net Score	Accuracy	Cost	Time/question	Confidence
MIR 2024	198/200	2	197.33	99.0%	$0.78	14.2s	99.9%
MIR 2025	198/200	2	197.33	99.0%	$0.82	15.3s	99.8%
MIR 2026	200/200	0	200.00	100.0%	$0.78	11.9s	100.0%
Cumulative	596/600	4	594.66	99.3%	$2.38

Now, let's put this in context with the best standard models in the benchmark.

Custom (RAG Agéntico)

Modelos estándar

ALMA y MIRI (modelos custom con RAG Agéntico) frente a los 10 mejores modelos estándar del benchmark MIR 2026

In MIR 2026, both ALMA and MIRI score 200/200: a perfect score. No standard model has ever achieved 200/200 in any of the three exam sessions. The best standard result in 2026 is 199/200, shared by three models (Gemini 3 Flash, o3, and GPT-5).

The difference may seem minimal -- a single correct answer -- but that one-answer difference, repeated systematically year after year, separates the extraordinary from the perfect.

Top 5 Standard Models in MIR 2026

Model	Correct	Net Score	Cost
Gemini 3 Flash	199/200	198.67	$0.34
o3	199/200	198.67	$1.94
GPT-5	199/200	198.67	$2.05
GPT-5.1 Chat	198/200	197.33	$0.65
GPT-5 Codex	198/200	197.33	$0.89

2. The Three-Year Perspective

One exam could be luck. Two, coincidence. Three years of consistent results are a pattern.

Custom (RAG Agéntico)

Modelos estándar

Preguntas correctas acumuladas en MIR 2024, 2025 y 2026 (máximo: 600). Solo se muestran los modelos con resultados en los 3 años.

What this chart shows is ALMA's absolute consistency: 200/200 in all three years, without exception. It not only answers all official questions correctly, but also all reserve questions (201-210) in each exam session. When official questions are annulled and reserves are used, ALMA has them all correct.

MIRI shows a fascinating progression: 198/200 in 2024, 198/200 in 2025, and finally 200/200 in 2026. The model has been improving until it reached perfection.

The best cumulative standard model, Gemini 3 Flash, reaches 590/600 -- an extraordinary result in absolute terms, but 10 correct answers behind ALMA.

0 errores

Custom

Estándar

Total de errores en MIR 2024 + 2025 + 2026 (máximo posible: 600). Menos es mejor.

The accumulated errors visualization is perhaps the most eloquent. ALMA presents an empty bar: zero errors in three years. MIRI accumulates only 4. The best standard model, Gemini 3 Flash, accumulates 10. The other models in the standard top 5 exceed a dozen errors.

Parámetro	Tendencia MIR 2026	Implicación
ALMA vs best standard	-10 errors	ALMA makes 0 errors compared to 10 by the best standard model (Gemini 3 Flash) over 3 years
MIRI vs best standard	-6 errors	MIRI makes only 4 errors compared to Flash's 10, at a cost only 2.3 times higher
MIRI vs ALMA	+4 errors	MIRI makes 4 more errors than ALMA, but its cost is 13.3 times lower ($2.38 vs $31.57)
ALMA: cost per error avoided	$2.92/error	Compared to Flash, ALMA costs $30.55 more but avoids 10 errors ($3.06 per error avoided)

Comparison of accumulated errors over 3 years: custom models vs best standard model

3. Anatomy of MIRI's Failures

MIRI fails exactly 2 questions on MIR 2024, 2 on MIR 2025, and 0 on MIR 2026. Let's analyze each failure.

MIR 2024: Questions 9 and 13

On MIR 2024, MIRI fails questions 9 and 13. Both are among the first 25 questions of the exam, which are common across all versions (V0-V4).

MIR 2025: Questions 181 and 201

On MIR 2025, MIRI fails questions 181 and 201. Question 201 is a reserve question -- which means that, unlike ALMA which answers all reserves correctly, MIRI misses one.

MIR 2026: Perfection

On MIR 2026, MIRI does not fail any question. Neither the 200 official ones nor the 10 reserves. The model has evolved to achieve perfect performance.

Improvement Pattern

MIRI's evolution illustrates one of the fundamental advantages of the Agentic RAG architecture: the ability to continuously improve without retraining the base model. Each iteration of the corpus and expert configuration produces measurable incremental improvements.

MIR 2024

2 errores

MIR 2025

2 errores

MIR 2026

Perfección

Exam Session	MIRI Errors	MIRI Evolution
MIR 2024	2	Baseline
MIR 2025	2	Maintenance
MIR 2026	0	Perfection

4. ALMA: Anatomy of Perfection

ALMA is the model developed by BinPar with content from Editorial Medica Panamericana, the leading medical publisher in the Spanish-speaking world, and a selection of clinical guidelines. It is designed as a clinical reference tool for healthcare professionals: practicing physicians, specialists in training, and professionals who need to consult and validate up-to-date clinical knowledge within a healthcare organization or health service.

It is currently used by tens of thousands of professionals at CATSalut (the Catalan health service).

The corpus: clinical guidelines and recommendations

ALMA's fundamental advantage lies in both its architecture and its corpus. Editorial Medica Panamericana has one of the most comprehensive medical literature catalogs in Spanish, including:

Content specifically designed for competitive exam preparation (including the MIR)
Reference treatises across all medical specialties
Clinical guidelines from major scientific societies
Updated protocols based on the most recent scientific evidence
Training material designed and reviewed by specialists

This corpus has been processed and optimized for consumption by language models, creating a specialized synthetic corpus that maximizes the density of relevant information per token.^[2]

The orchestrator: Claude Sonnet 4.5 on Bedrock Aragon

ALMA's orchestrator model is Claude Sonnet 4.5 with extended reasoning, running on Amazon Bedrock in the Aragon datacenter (Spain). This choice is deliberate: it ensures that all inference data -- the medical questions, clinical contexts, and responses -- are processed within the European Union, with the strictest legal and privacy guarantees.^[3]

Detailed Metrics

Metric	MIR 2024	MIR 2025	MIR 2026
Accuracy	100.0%	100.0%	100.0%
Cost per exam	$9.99	$11.02	$10.56
Cost per question	$0.048	$0.052	$0.050
Time per question	54.2s	50.8s	54.3s
Average confidence	99.9%	99.8%	99.8%
Reasoning tokens	71K	78K	66K

The average cost of ~$10.50 per exam (approximately 10 EUR at current exchange rates) is significant compared to standard models like Gemini Flash ($0.34), but it must be put in context: ALMA does not fail any question. In three years. Including reserves. The cost of an error in a real clinical context can be infinitely greater than $10.

The average time of ~53 seconds per question reflects the iterative nature of the architecture: the orchestrator consults multiple experts (specialized virtual agents), evaluates their responses, can request clarifications, and synthesizes a final answer. Each question receives the equivalent of a "medical board" among ~32 specialists.

600/600: Unprecedented

To understand the magnitude of this result, it is worth remembering that:

No standard model among the ~290 evaluated has ever achieved 200/200 in a single exam session.
The best cumulative standard is 590/600 (Gemini 3 Flash) -- 10 errors.
ALMA not only answers all 200 official questions correctly, but also the 10 reserves from each year (210/210 x 3).

5. MIRI: Precision for the General Public

MIRI is the model developed by BinPar for PROMIR, the MIR preparation platform by Editorial Medica Panamericana. If ALMA is designed for professionals working in a clinical environment, MIRI is designed for medical students, residents, MIR exam candidates, and independent professionals who need to resolve questions quickly and accurately.

Design Philosophy

MIRI's architecture follows the same principles as ALMA -- central orchestrator + specialized experts + knowledge corpus -- but with a different optimization profile:

Priority on cost and speed, without sacrificing critical accuracy
Fast response times (~13 seconds per question vs ~53 for ALMA)
Optimized cost ($0.78-$0.82 per full exam)

The Value Proposition

ALMA

MIRI

Estándar

Coste acumulado (3 exámenes) vs. precisión acumulada (3 años). Los modelos custom alcanzan mayor precisión a un coste competitivo.

This chart reveals the strategic position of each model:

ALMA (gold dot, upper right): maximum accuracy (100%), moderate cost ($31.57 cumulative). It is the "no compromise" option where accuracy is the only thing that matters.
MIRI (teal dot, upper center): near-perfect accuracy (99.3%), minimal cost ($2.38 cumulative). It is the best value-for-money option on the market.
Gemini 3 Flash (gray dot, lower left): excellent accuracy (98.3%), unbeatable cost ($1.02 cumulative). But 10 more errors than ALMA and 6 more than MIRI.

6. Architecture: The Agentic RAG

How is it possible for custom models to consistently outperform the best generalist models in the world? The answer lies in the architecture.

Orquestador

LLM de razonamiento avanzado

Analiza la preguntaSelecciona expertosSintetiza respuesta

Consulta iterativa

Especialidades Clínicas

CardiologíaNeumologíaNeurologíaNefrologíaEndocrinologíaReumatologíaHematologíaOncología

Especialidades Quirúrgicas

Cirugía GeneralTraumatologíaUrologíaORLOftalmologíaDermatologíaGinecologíaObstetricia

Ciencias Básicas y Diagnósticas

FarmacologíaMicrobiologíaAnatomía PatológicaRadiologíaBioestadísticaMedicina PreventivaPediatríaPsiquiatría

Soporte y Contexto

Legislación SanitariaGestión ClínicaÉtica MédicaUrgenciasMedicina InternaGeriatríaPaliativosM. Familiar

Corpus sintético especializado

Optimizado para consumo por LLMs, no para lectura humana

~32

Expertos

Multi

Iteraciones

Razonamiento

Arquitectura RAG Agéntico: el orquestador analiza cada pregunta, selecciona los expertos relevantes y sintetiza sus respuestas en múltiples iteraciones

Agentic RAG (Retrieval-Augmented Generation with agents) represents the most advanced evolution of traditional RAG systems.^[5] While a standard RAG retrieves relevant documents and passes them to the model in a single step, Agentic RAG introduces a radically superior level of sophistication.

The Orchestrator

At the center of the architecture sits an advanced reasoning model that acts as a conductor. When it receives a medical question, the orchestrator does not simply search for information: it analyzes the question, identifies which specialties are relevant, and decides which experts to consult.

This process is iterative. If an expert's response is insufficient or contradicts another's, the orchestrator can:

Reformulate the query and ask again
Consult additional experts not initially considered
Request deeper analysis on a specific aspect
Cross-reference responses between multiple experts

This pattern of iterative, multi-agent consultation has been shown to consistently outperform direct LLM usage in both medicine and other specialized domains.^[6]

The ~32 Specialized Experts

Each expert is a RAG system specialized in a specific medical discipline (cardiology, pulmonology, pharmacology, etc.). It has access to a subset of the corpus optimized for its specialty and is configured to answer questions within its domain with maximum accuracy.

The key is intelligent subdelegation: the experts are not simply models with a different prompt. Each one has its own knowledge base, its own context, and can in turn delegate sub-queries to other experts when it detects that a question crosses boundaries between specialties.

This design aligns with recent research on multi-agent systems for medical diagnosis,^[7] specialized agent orchestration^[8] and agent graph optimization.^[9]

Multimodal Support

Both ALMA and MIRI process questions with clinical images (X-rays, electrocardiograms, dermatological photographs, etc.). The multimodal system allows experts to analyze images within their specialized context: a virtual cardiologist analyzes an ECG with the same level of detail it would dedicate to a textual report.

Synthetic Corpus Optimized for LLMs

A crucial innovation is the nature of the corpus. It is not about copying textbooks and passing them to the model. The corpus has been synthesized and reformatted specifically to maximize comprehension by language models.^[10]

The original medical documents -- clinical guidelines, protocols, treatises -- are processed through a pipeline that:

Extracts the clinically relevant information
Eliminates redundancy and human-oriented formatting
Restructures the information into formats that LLMs process more efficiently
Enriches with cross-specialty relationships^[11]

The result is a corpus that a human would find difficult to read, but that an LLM processes with maximum efficiency.

Reasoning in English

Although MIR questions are in Spanish and answers are generated in Spanish, all internal reasoning and communication between the orchestrator and experts is conducted in English.^[12]

This decision is based on a well-documented empirical reality: current LLMs, regardless of their multilingual support, have a richer and more efficient internal representation in English.^[13] English tokens encode more semantic information per token, reasoning is more precise, and chains of thought produce fewer errors.

In practice, this means that ALMA and MIRI:

Receive the question in Spanish
Internally translate it to English for reasoning
The experts reason and communicate in English (providing translation directives for medical terminology that requires it)
The orchestrator synthesizes the final answer in English
The answer is translated to Spanish for output

This pipeline adds a layer of complexity, but the benefit in accuracy more than compensates for the additional token cost.

🇪🇸

Question ES

English reasoning zone

Translation

Experts reason EN

Orchestrator synthesizes EN

🇪🇸

Answer ES

Multilingual processing pipeline: the question is translated to English for internal reasoning and the answer is returned in Spanish

7. Technical Innovations

Beyond the general architecture, ALMA and MIRI incorporate several technical innovations that contribute to their exceptional performance.

7.1. Synthetic Corpus for LLMs

Synthetic data generation for training and use with LLMs is a rapidly evolving field.^[10] In the medical context, frameworks like MedSyn have demonstrated that synthetic data can significantly improve performance on clinical tasks.^[11]

The fundamental difference between the ALMA/MIRI corpus and conventional synthetic data is the objective: it is not about generating data to train (fine-tune) a model, but rather creating a corpus optimized for retrieval and consultation (RAG). This allows updating knowledge without modifying the base model's weights.

Guías clínicas, protocolos

Extrae

Información clínicamente relevante

Elimina

Redundancia y formato humano

Reestructura

Formatos eficientes para LLMs

Enriquece

Relaciones entre especialidades

Corpus sintético optimizado

Pipeline de procesamiento del corpus: los documentos médicos se transforman en un formato optimizado para consumo por modelos de lenguaje

7.2. Incremental Updates with RLM

One of the critical challenges of any medical AI system is keeping knowledge up to date. Clinical guidelines change, new clinical trials are published, therapeutic protocols are updated.

ALMA and MIRI use an incremental update system based on Recursive Language Models (RLM).^[14] Instead of rebuilding the entire corpus when there is an update, the system:

Detects which corpus fragments have become outdated
Generates new synthesized versions of the updated information
Integrates the new fragments while maintaining coherence with the rest of the corpus
Verifies that the update does not introduce contradictions

This process is monitored in real time and allows the corpus to be kept continuously up to date, without service interruptions.

7.3. Token Caching and Infinite Context

With ~32 experts and multiple consultation iterations, the number of tokens processed per question can be enormous. To keep costs under control and speed at acceptable levels, the system implements advanced token caching techniques.

KV-Cache optimization is fundamental to the efficiency of modern LLMs.^[15] Techniques like SnapKV allow compressing the attention cache without significant performance loss.^[16] Systems like LMCache take this optimization a step further, allowing cache sharing across multiple queries.^[17]

ALMA and MIRI implement a technique we call memory tree with subdelegation: the orchestrator maintains a context tree where each branch corresponds to a consulted expert. When an expert needs to consult another, a new branch is created that inherits relevant context from the parent without duplicating tokens. This allows maintaining "conversations" between experts efficiently.

7.4. Reasoning in English

As mentioned in the architecture section, all internal reasoning is conducted in English. Recent research confirms that multilingual LLMs tend to "think" in English internally, regardless of the input language.^[12] Other studies on multilingual reasoning corroborate that performance on complex reasoning tasks improves significantly when English is enforced as the internal processing language.^[13]

From a token efficiency perspective, English offers greater semantic representativeness per token: the same medical idea expressed in English typically requires fewer tokens than in Spanish, which reduces costs and allows processing more context within the model's attention window.

8. Data Sovereignty: Bedrock in Aragon

In the context of an AI model that processes medical information -- potentially including patient clinical data in future deployments -- data sovereignty is not a technical detail: it is a fundamental legal and ethical requirement.

ALMA and Bedrock Aragon

ALMA's orchestrator model runs on Amazon Bedrock, specifically in the Aragon (Spain) datacenter. This configuration guarantees:

Processing within the EU: all inference data is processed on servers located on Spanish territory, within the jurisdiction of the European Union.
No Anthropic access to data: by running Claude through Bedrock, Amazon acts as a data processor under contract with the client. Anthropic, the developer of Claude, has no access to the queries, contexts, or generated responses. This is fundamentally different from using Anthropic's direct API.
GDPR compliance: processing complies with the EU General Data Protection Regulation, including the principles of data minimization, purpose limitation, and processing security.
AI Act compatibility: the architecture is designed to comply with the requirements of the European AI Regulation, which classifies medical AI systems as "high risk" and imposes specific obligations for transparency, documentation, and human oversight.^[18]

The experts: specialized models with guarantees

The expert models -- smaller and more specialized than the orchestrator -- run with the same security guarantees. The separation between the orchestrator (which sees the complete question) and the experts (which receive fragmented and decontextualized queries) provides an additional layer of protection: no individual expert has access to the complete clinical context of a case.

🇪🇺

UE/España — Bedrock Aragón

GDPRAI Act

Pregunta médica

Orquestador

Expertos especializados

Corpus médico

Respuesta

Residencia de datos en España

Anthropic

Sin acceso a datos de inferencia

Arquitectura de soberanía de datos: todo el procesamiento ocurre dentro de la UE, sin acceso del proveedor del modelo a los datos de inferencia

Parámetro	Tendencia MIR 2026	Implicación
Processing location	Spain (EU)	Amazon datacenter in Aragon. All data remains on Spanish territory.
Model provider access	No access	Anthropic does not access inference data when used through Bedrock.
GDPR compliance	Complete	Amazon as data processor, BinPar as data controller.
AI Act (high risk)	Designed	Architecture prepared for AI Act transparency and oversight requirements.

Sovereignty and data protection guarantees in the ALMA architecture

Implications for the Healthcare Sector

The demonstration that it is possible to achieve perfect performance without sending medical data outside the EU has profound implications for AI adoption in the European healthcare sector. Historically, concerns about data sovereignty have been one of the main barriers to implementing medical AI systems in European hospitals and health centers.^[19]

ALMA demonstrates that this dilemma between performance and privacy is a false dilemma: it is possible to have both.

9. Implications for Medical AI

The results of ALMA and MIRI reinforce and extend conclusions we already pointed to in previous articles, but with unprecedented force.

Agentic RAG > Fine-tuning

In our previous analysis on "The Cathedral and the Bazaar", we argued that customization through RAG offers fundamental advantages over fine-tuning for medical applications. ALMA and MIRI are the definitive empirical demonstration of this thesis.

Recent studies on AI agents in clinical medicine confirm that agentic systems consistently outperform base models, even when the latter have been specifically fine-tuned for the medical domain.^[20] The reason is simple: a fine-tuned model modifies its weights statically, while an agentic RAG system can query dynamically updated information.

RAG vs. Fine-Tuning in medical tasks. Data from: MDPI Bioengineering 2025 (BLEU), PMC systematic review (hallucinations), medRxiv 2025 (agents).

Customization Without Modifying Weights

ALMA and MIRI use the same base models that are publicly available (Claude for ALMA, confidential model for MIRI). The performance difference does not come from modifications to the models, but from:

The corpus -- what information is provided to them
The architecture -- how the query is organized
The experts -- how knowledge is specialized
The iteration -- how many times the answer is refined

This means that ALMA/MIRI's advantage is reproducible by any organization that has access to quality medical corpora and the technical capacity to implement an agentic architecture.

The Future: Continuous Corpus Updates

Perhaps the most relevant long-term implication is that ALMA and MIRI can continuously improve without needing to retrain models. When a new clinical guideline is published, a therapeutic protocol is updated, or a new diagnostic association is discovered, it is enough to update the corpus. The system incorporates the new knowledge immediately.

This model of "knowledge as a service" -- where intelligence resides in the corpus and architecture, not in the model's weights -- could redefine how medical AI systems are developed and deployed in the next decade.

10. Conclusions

ALMA Demonstrates That Perfection Is Achievable

600 questions. Three years of exams designed to select the best doctors in Spain. Zero errors. ALMA demonstrates that, with the right architecture, the appropriate corpus, and the necessary investment, it is possible to build a medical AI system that does not fail. Not "almost never." Never.

MIRI Demonstrates That Excellence Is Accessible

596/600 at a cost of $2.38. MIRI demonstrates that near-perfect accuracy does not require astronomical budgets. A medical student can access a system that outperforms any standard model on the market for less than the cost of a coffee.

The Agentic Approach Surpasses Any Generalist Model

No generalist model -- not Gemini, not GPT-5, not Claude, not any of the ~290 evaluated -- has ever achieved 200/200 in a single exam session. ALMA achieves it in all three. MIRI achieves it in the most recent one. Specialization through experts, combined with an advanced reasoning orchestrator, produces results that the "one model for everything" approach cannot match.

Data Sovereignty Is Compatible With Maximum Performance

ALMA processes all its inference in Spain, without sending data outside the EU, without Anthropic accessing the queries. And it still achieves a perfect result. Privacy and performance are not conflicting objectives.

What Comes Next

These results open the door to real clinical deployments of medical AI systems based on Agentic RAG. Not as substitutes for clinical judgment, but as diagnostic support systems with demonstrated and verifiable reliability.

At Medical Benchmark, we will continue evaluating both standard and custom models, documenting the state of the art with the rigor and transparency that characterize our platform. All results are available on our rankings platform.

ALMA and MIRI have been evaluated under the same conditions as all other models in the benchmark: same prompt, same questions, same timing. The results are verifiable and reproducible. Although the evaluations were conducted after each exam took place, the models have no access to the internet or any information about the results or correct answers to the questions, so there is no possibility of data contamination.

Notas y Referencias

ALMA answers correctly not only the 200 official questions (valid after annulments), but also the 10 reserve questions (201-210) from each exam session. Total: 210/210 x 3 years = 630/630 including reserves, 600/600 considering only valid exam questions.
Long, Y., et al. "LLMs Meet Synthetic Data Generation: A Survey". ACL 2024. Synthetic data generation for LLMs enables the creation of corpora optimized for retrieval and reasoning. Link
Amazon Bedrock in the eu-south-2 region (Aragon, Spain). Anthropic does not access inference data in Bedrock deployments. AWS Bedrock data protection documentation
Calculation: 0.995^600 ~ 0.049, meaning a model with 99.5% accuracy per question has approximately a 4.9% probability of answering 600 consecutive questions correctly. ALMA achieves this with 100% accuracy per question.
Singh, A., et al. "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG". arXiv:2501.09136, 2025. Link
"MA-RAG: Multi-Agent Retrieval-Augmented Generation". arXiv:2505.20096, 2025. Multi-agent RAG systems outperform traditional RAG in accuracy and reasoning capability. Link
Zuo, Y., et al. "KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis". arXiv:2412.16833, 2024. Link
Zhang, C., et al. "AgentOrchestra: Orchestrating Specialized Agents for Complex Tasks". arXiv:2506.12508, 2025. Link
Zhuge, M., et al. "GPTSwarm: Language Agents as Optimizable Graphs". ICML 2024. Link
Long, Y., et al. "LLMs Meet Synthetic Data Generation: A Survey". ACL 2024. Link
Kumichev, A., et al. "MedSyn: LLM-based Synthetic Medical Text Generation Framework". arXiv:2408.02056, 2024. Link
Schut, L., Gal, Y., Farquhar, S. "Do Multilingual LLMs Think In English?". ICML 2025. Multilingual models process internally in English even with inputs in other languages. Link
"Multilingual Reasoning: A Survey of Challenges and Approaches". 2025. Reasoning in English produces better results than in other languages, even for tasks in those languages. Link
Zhang, T., Kraska, T., Khattab, O. "Recursive Language Models". arXiv:2512.24601, 2025. Link
Luohe, S., et al. "A Survey on KV-Cache Optimization for Large Language Models". arXiv:2407.18003, COLM 2024. Link
Li, Y., et al. "SnapKV: LLM Knows What You are Looking for Before Generation". NeurIPS 2024. Link
"LMCache: Efficient KV-Cache Management for Large Language Models". arXiv:2510.09665, 2025. Link
Minssen, T., et al. "The EU AI Act and Its Implications for Medical Products". npj Digital Medicine, 2024. Link
"The EU AI Act: Implications for Healthcare AI Systems". 2024. Medical AI systems are classified as high risk under the AI Act, requiring conformity assessments and human oversight.
"AI Agents in Clinical Medicine: Promise and Challenges". PMC, 2025. AI agents outperform base models in clinical tasks by combining reasoning with access to specialized knowledge.