For three years, Medical Benchmark has evaluated over 300 artificial intelligence models on the MIR exam, the entrance test for specialized medical training in Spain. We have documented how the best generalist models -- Gemini, GPT, Claude -- have been approaching the 100% ceiling, missing fewer and fewer questions, costing less and less money, responding faster and faster.
But they always missed something.
Today we present the results of two models that break that barrier. They are not generalist models. They are not available online. They cannot be tested with a public API. They are custom models, built in Spain with a radically different architecture: Agentic RAG with specialized experts.
MIRI, developed by BinPar for PROMIR (by Editorial Medica Panamericana), has answered 596 out of 600 MIR questions correctly, with only 4 errors over three years and a perfect score of 200/200 on MIR 2026. And it did so at a total cost of $2.38 -- 13 times less than ALMA and comparable to the most affordable standard models.
ALMA, developed by BinPar with content from Editorial Medica Panamericana and Spanish Clinical Guidelines, has answered all 600 questions from the last three MIR exams -- plus all reserve questions -- without a single error.[1] No AI model in the history of MedBench, and to our knowledge, no model on any medical benchmark in the world, has ever achieved a perfect cumulative score over three years.
1. The Results: The 100% Wall
Let's start with the numbers. No embellishments, no hyperbole. Just data.
ALMA's Data
| Exam Session | Correct | Errors | Net Score | Accuracy | Cost | Time/question | Confidence | Reasoning Tokens |
|---|---|---|---|---|---|---|---|---|
| MIR 2024 | 200/200 | 0 | 200.00 | 100.0% | $9.99 | 54.7s | 99.9% | 71K |
| MIR 2025 | 200/200 | 0 | 200.00 | 100.0% | $11.02 | 50.8s | 99.8% | 78K |
| MIR 2026 | 200/200 | 0 | 200.00 | 100.0% | $10.56 | 54.3s | 99.8% | 66K |
| Cumulative | 600/600 | 0 | 600.00 | 100.0% | $31.57 |
MIRI's Data
| Exam Session | Correct | Errors | Net Score | Accuracy | Cost | Time/question | Confidence |
|---|---|---|---|---|---|---|---|
| MIR 2024 | 198/200 | 2 | 197.33 | 99.0% | $0.78 | 14.2s | 99.9% |
| MIR 2025 | 198/200 | 2 | 197.33 | 99.0% | $0.82 | 15.3s | 99.8% |
| MIR 2026 | 200/200 | 0 | 200.00 | 100.0% | $0.78 | 11.9s | 100.0% |
| Cumulative | 596/600 | 4 | 594.66 | 99.3% | $2.38 |
Now, let's put this in context with the best standard models in the benchmark.
ALMA y MIRI (modelos custom con RAG Agéntico) frente a los 10 mejores modelos estándar del benchmark MIR 2026
In MIR 2026, both ALMA and MIRI score 200/200: a perfect score. No standard model has ever achieved 200/200 in any of the three exam sessions. The best standard result in 2026 is 199/200, shared by three models (Gemini 3 Flash, o3, and GPT-5).
The difference may seem minimal -- a single correct answer -- but that one-answer difference, repeated systematically year after year, separates the extraordinary from the perfect.
Top 5 Standard Models in MIR 2026
| Model | Correct | Net Score | Cost |
|---|---|---|---|
| Gemini 3 Flash | 199/200 | 198.67 | $0.34 |
| o3 | 199/200 | 198.67 | $1.94 |
| GPT-5 | 199/200 | 198.67 | $2.05 |
| GPT-5.1 Chat | 198/200 | 197.33 | $0.65 |
| GPT-5 Codex | 198/200 | 197.33 | $0.89 |
2. The Three-Year Perspective
One exam could be luck. Two, coincidence. Three years of consistent results are a pattern.
Preguntas correctas acumuladas en MIR 2024, 2025 y 2026 (máximo: 600). Solo se muestran los modelos con resultados en los 3 años.
What this chart shows is ALMA's absolute consistency: 200/200 in all three years, without exception. It not only answers all official questions correctly, but also all reserve questions (201-210) in each exam session. When official questions are annulled and reserves are used, ALMA has them all correct.
MIRI shows a fascinating progression: 198/200 in 2024, 198/200 in 2025, and finally 200/200 in 2026. The model has been improving until it reached perfection.
The best cumulative standard model, Gemini 3 Flash, reaches 590/600 -- an extraordinary result in absolute terms, but 10 correct answers behind ALMA.
Total de errores en MIR 2024 + 2025 + 2026 (máximo posible: 600). Menos es mejor.
The accumulated errors visualization is perhaps the most eloquent. ALMA presents an empty bar: zero errors in three years. MIRI accumulates only 4. The best standard model, Gemini 3 Flash, accumulates 10. The other models in the standard top 5 exceed a dozen errors.
| Parámetro | Tendencia MIR 2026 | Implicación |
|---|---|---|
| ALMA vs best standard | -10 errors | ALMA makes 0 errors compared to 10 by the best standard model (Gemini 3 Flash) over 3 years |
| MIRI vs best standard | -6 errors | MIRI makes only 4 errors compared to Flash's 10, at a cost only 2.3 times higher |
| MIRI vs ALMA | +4 errors | MIRI makes 4 more errors than ALMA, but its cost is 13.3 times lower ($2.38 vs $31.57) |
| ALMA: cost per error avoided | $2.92/error | Compared to Flash, ALMA costs $30.55 more but avoids 10 errors ($3.06 per error avoided) |
Comparison of accumulated errors over 3 years: custom models vs best standard model
3. Anatomy of MIRI's Failures
MIRI fails exactly 2 questions on MIR 2024, 2 on MIR 2025, and 0 on MIR 2026. Let's analyze each failure.
MIR 2024: Questions 9 and 13
On MIR 2024, MIRI fails questions 9 and 13. Both are among the first 25 questions of the exam, which are common across all versions (V0-V4).
MIR 2025: Questions 181 and 201
On MIR 2025, MIRI fails questions 181 and 201. Question 201 is a reserve question -- which means that, unlike ALMA which answers all reserves correctly, MIRI misses one.
MIR 2026: Perfection
On MIR 2026, MIRI does not fail any question. Neither the 200 official ones nor the 10 reserves. The model has evolved to achieve perfect performance.
Improvement Pattern
MIRI's evolution illustrates one of the fundamental advantages of the Agentic RAG architecture: the ability to continuously improve without retraining the base model. Each iteration of the corpus and expert configuration produces measurable incremental improvements.
MIR 2024
2 erroresMIR 2025
2 erroresMIR 2026
Perfección| Exam Session | MIRI Errors | ALMA Errors | MIRI Evolution |
|---|---|---|---|
| MIR 2024 | 2 | 0 | Baseline |
| MIR 2025 | 2 | 0 | Maintenance |
| MIR 2026 | 0 | 0 | Perfection |
4. ALMA: Anatomy of Perfection
ALMA is the model developed by BinPar with content from Editorial Medica Panamericana, the leading medical publisher in the Spanish-speaking world, and a selection of clinical guidelines. It is designed as a clinical reference tool for healthcare professionals: practicing physicians, specialists in training, and professionals who need to consult and validate up-to-date clinical knowledge within a healthcare organization or health service.
It is currently used by tens of thousands of professionals at CATSalut (the Catalan health service).
The corpus: clinical guidelines and recommendations
ALMA's fundamental advantage lies in both its architecture and its corpus. Editorial Medica Panamericana has one of the most comprehensive medical literature catalogs in Spanish, including:
- Content specifically designed for competitive exam preparation (including the MIR)
- Reference treatises across all medical specialties
- Clinical guidelines from major scientific societies
- Updated protocols based on the most recent scientific evidence
- Training material designed and reviewed by specialists
This corpus has been processed and optimized for consumption by language models, creating a specialized synthetic corpus that maximizes the density of relevant information per token.[2]
The orchestrator: Claude Sonnet 4.5 on Bedrock Aragon
ALMA's orchestrator model is Claude Sonnet 4.5 with extended reasoning, running on Amazon Bedrock in the Aragon datacenter (Spain). This choice is deliberate: it ensures that all inference data -- the medical questions, clinical contexts, and responses -- are processed within the European Union, with the strictest legal and privacy guarantees.[3]
Detailed Metrics
| Metric | MIR 2024 | MIR 2025 | MIR 2026 |
|---|---|---|---|
| Accuracy | 100.0% | 100.0% | 100.0% |
| Cost per exam | $9.99 | $11.02 | $10.56 |
| Cost per question | $0.048 | $0.052 | $0.050 |
| Time per question | 54.2s | 50.8s | 54.3s |
| Average confidence | 99.9% | 99.8% | 99.8% |
| Reasoning tokens | 71K | 78K | 66K |
The average cost of ~$10.50 per exam (approximately 10 EUR at current exchange rates) is significant compared to standard models like Gemini Flash ($0.34), but it must be put in context: ALMA does not fail any question. In three years. Including reserves. The cost of an error in a real clinical context can be infinitely greater than $10.
The average time of ~53 seconds per question reflects the iterative nature of the architecture: the orchestrator consults multiple experts (specialized virtual agents), evaluates their responses, can request clarifications, and synthesizes a final answer. Each question receives the equivalent of a "medical board" among ~32 specialists.
600/600: Unprecedented
To understand the magnitude of this result, it is worth remembering that:
- No standard model among the ~290 evaluated has ever achieved 200/200 in a single exam session.
- The best cumulative standard is 590/600 (Gemini 3 Flash) -- 10 errors.
- ALMA not only answers all 200 official questions correctly, but also the 10 reserves from each year (210/210 x 3).
5. MIRI: Precision for the General Public
MIRI is the model developed by BinPar for PROMIR, the MIR preparation platform by Editorial Medica Panamericana. If ALMA is designed for professionals working in a clinical environment, MIRI is designed for medical students, residents, MIR exam candidates, and independent professionals who need to resolve questions quickly and accurately.
Design Philosophy
MIRI's architecture follows the same principles as ALMA -- central orchestrator + specialized experts + knowledge corpus -- but with a different optimization profile:
- Priority on cost and speed, without sacrificing critical accuracy
- Fast response times (~13 seconds per question vs ~53 for ALMA)
- Optimized cost ($0.78-$0.82 per full exam)
The Value Proposition
Coste acumulado (3 exámenes) vs. precisión acumulada (3 años). Los modelos custom alcanzan mayor precisión a un coste competitivo.
This chart reveals the strategic position of each model:
- ALMA (gold dot, upper right): maximum accuracy (100%), moderate cost ($31.57 cumulative). It is the "no compromise" option where accuracy is the only thing that matters.
- MIRI (teal dot, upper center): near-perfect accuracy (99.3%), minimal cost ($2.38 cumulative). It is the best value-for-money option on the market.
- Gemini 3 Flash (gray dot, lower left): excellent accuracy (98.3%), unbeatable cost ($1.02 cumulative). But 10 more errors than ALMA and 6 more than MIRI.
6. Architecture: The Agentic RAG
How is it possible for custom models to consistently outperform the best generalist models in the world? The answer lies in the architecture.
Orquestador
LLM de razonamiento avanzado
Especialidades Clínicas
Especialidades Quirúrgicas
Ciencias Básicas y Diagnósticas
Soporte y Contexto
Corpus sintético especializado
Optimizado para consumo por LLMs, no para lectura humana
~32
Expertos
Multi
Iteraciones
EN
Razonamiento
Arquitectura RAG Agéntico: el orquestador analiza cada pregunta, selecciona los expertos relevantes y sintetiza sus respuestas en múltiples iteraciones
Agentic RAG (Retrieval-Augmented Generation with agents) represents the most advanced evolution of traditional RAG systems.[5] While a standard RAG retrieves relevant documents and passes them to the model in a single step, Agentic RAG introduces a radically superior level of sophistication.
The Orchestrator
At the center of the architecture sits an advanced reasoning model that acts as a conductor. When it receives a medical question, the orchestrator does not simply search for information: it analyzes the question, identifies which specialties are relevant, and decides which experts to consult.
This process is iterative. If an expert's response is insufficient or contradicts another's, the orchestrator can:
- Reformulate the query and ask again
- Consult additional experts not initially considered
- Request deeper analysis on a specific aspect
- Cross-reference responses between multiple experts
This pattern of iterative, multi-agent consultation has been shown to consistently outperform direct LLM usage in both medicine and other specialized domains.[6]
The ~32 Specialized Experts
Each expert is a RAG system specialized in a specific medical discipline (cardiology, pulmonology, pharmacology, etc.). It has access to a subset of the corpus optimized for its specialty and is configured to answer questions within its domain with maximum accuracy.
The key is intelligent subdelegation: the experts are not simply models with a different prompt. Each one has its own knowledge base, its own context, and can in turn delegate sub-queries to other experts when it detects that a question crosses boundaries between specialties.
This design aligns with recent research on multi-agent systems for medical diagnosis,[7] specialized agent orchestration[8] and agent graph optimization.[9]
Multimodal Support
Both ALMA and MIRI process questions with clinical images (X-rays, electrocardiograms, dermatological photographs, etc.). The multimodal system allows experts to analyze images within their specialized context: a virtual cardiologist analyzes an ECG with the same level of detail it would dedicate to a textual report.
Synthetic Corpus Optimized for LLMs
A crucial innovation is the nature of the corpus. It is not about copying textbooks and passing them to the model. The corpus has been synthesized and reformatted specifically to maximize comprehension by language models.[10]
The original medical documents -- clinical guidelines, protocols, treatises -- are processed through a pipeline that:
- Extracts the clinically relevant information
- Eliminates redundancy and human-oriented formatting
- Restructures the information into formats that LLMs process more efficiently
- Enriches with cross-specialty relationships[11]
The result is a corpus that a human would find difficult to read, but that an LLM processes with maximum efficiency.
Reasoning in English
Although MIR questions are in Spanish and answers are generated in Spanish, all internal reasoning and communication between the orchestrator and experts is conducted in English.[12]
This decision is based on a well-documented empirical reality: current LLMs, regardless of their multilingual support, have a richer and more efficient internal representation in English.[13] English tokens encode more semantic information per token, reasoning is more precise, and chains of thought produce fewer errors.
In practice, this means that ALMA and MIRI:
- Receive the question in Spanish
- Internally translate it to English for reasoning
- The experts reason and communicate in English (providing translation directives for medical terminology that requires it)
- The orchestrator synthesizes the final answer in English
- The answer is translated to Spanish for output
This pipeline adds a layer of complexity, but the benefit in accuracy more than compensates for the additional token cost.
Question ES
English reasoning zone
Translation
Experts reason EN
Orchestrator synthesizes EN
Answer ES
Multilingual processing pipeline: the question is translated to English for internal reasoning and the answer is returned in Spanish
7. Technical Innovations
Beyond the general architecture, ALMA and MIRI incorporate several technical innovations that contribute to their exceptional performance.
7.1. Synthetic Corpus for LLMs
Synthetic data generation for training and use with LLMs is a rapidly evolving field.[10] In the medical context, frameworks like MedSyn have demonstrated that synthetic data can significantly improve performance on clinical tasks.[11]
The fundamental difference between the ALMA/MIRI corpus and conventional synthetic data is the objective: it is not about generating data to train (fine-tune) a model, but rather creating a corpus optimized for retrieval and consultation (RAG). This allows updating knowledge without modifying the base model's weights.
Guías clínicas, protocolos
Extrae
Información clínicamente relevante
Elimina
Redundancia y formato humano
Reestructura
Formatos eficientes para LLMs
Enriquece
Relaciones entre especialidades
Corpus sintético optimizado
Pipeline de procesamiento del corpus: los documentos médicos se transforman en un formato optimizado para consumo por modelos de lenguaje
7.2. Incremental Updates with RLM
One of the critical challenges of any medical AI system is keeping knowledge up to date. Clinical guidelines change, new clinical trials are published, therapeutic protocols are updated.
ALMA and MIRI use an incremental update system based on Recursive Language Models (RLM).[14] Instead of rebuilding the entire corpus when there is an update, the system:
- Detects which corpus fragments have become outdated
- Generates new synthesized versions of the updated information
- Integrates the new fragments while maintaining coherence with the rest of the corpus
- Verifies that the update does not introduce contradictions
This process is monitored in real time and allows the corpus to be kept continuously up to date, without service interruptions.
7.3. Token Caching and Infinite Context
With ~32 experts and multiple consultation iterations, the number of tokens processed per question can be enormous. To keep costs under control and speed at acceptable levels, the system implements advanced token caching techniques.
KV-Cache optimization is fundamental to the efficiency of modern LLMs.[15] Techniques like SnapKV allow compressing the attention cache without significant performance loss.[16] Systems like LMCache take this optimization a step further, allowing cache sharing across multiple queries.[17]
ALMA and MIRI implement a technique we call memory tree with subdelegation: the orchestrator maintains a context tree where each branch corresponds to a consulted expert. When an expert needs to consult another, a new branch is created that inherits relevant context from the parent without duplicating tokens. This allows maintaining "conversations" between experts efficiently.
7.4. Reasoning in English
As mentioned in the architecture section, all internal reasoning is conducted in English. Recent research confirms that multilingual LLMs tend to "think" in English internally, regardless of the input language.[12] Other studies on multilingual reasoning corroborate that performance on complex reasoning tasks improves significantly when English is enforced as the internal processing language.[13]
From a token efficiency perspective, English offers greater semantic representativeness per token: the same medical idea expressed in English typically requires fewer tokens than in Spanish, which reduces costs and allows processing more context within the model's attention window.
8. Data Sovereignty: Bedrock in Aragon
In the context of an AI model that processes medical information -- potentially including patient clinical data in future deployments -- data sovereignty is not a technical detail: it is a fundamental legal and ethical requirement.
ALMA and Bedrock Aragon
ALMA's orchestrator model runs on Amazon Bedrock, specifically in the Aragon (Spain) datacenter. This configuration guarantees:
-
Processing within the EU: all inference data is processed on servers located on Spanish territory, within the jurisdiction of the European Union.
-
No Anthropic access to data: by running Claude through Bedrock, Amazon acts as a data processor under contract with the client. Anthropic, the developer of Claude, has no access to the queries, contexts, or generated responses. This is fundamentally different from using Anthropic's direct API.
-
GDPR compliance: processing complies with the EU General Data Protection Regulation, including the principles of data minimization, purpose limitation, and processing security.
-
AI Act compatibility: the architecture is designed to comply with the requirements of the European AI Regulation, which classifies medical AI systems as "high risk" and imposes specific obligations for transparency, documentation, and human oversight.[18]
The experts: specialized models with guarantees
The expert models -- smaller and more specialized than the orchestrator -- run with the same security guarantees. The separation between the orchestrator (which sees the complete question) and the experts (which receive fragmented and decontextualized queries) provides an additional layer of protection: no individual expert has access to the complete clinical context of a case.
UE/España — Bedrock Aragón
Pregunta médica
Orquestador
Expertos especializados
Corpus médico
Respuesta
Anthropic
Sin acceso a datos de inferencia
Arquitectura de soberanía de datos: todo el procesamiento ocurre dentro de la UE, sin acceso del proveedor del modelo a los datos de inferencia
| Parámetro | Tendencia MIR 2026 | Implicación |
|---|---|---|
| Processing location | Spain (EU) | Amazon datacenter in Aragon. All data remains on Spanish territory. |
| Model provider access | No access | Anthropic does not access inference data when used through Bedrock. |
| GDPR compliance | Complete | Amazon as data processor, BinPar as data controller. |
| AI Act (high risk) | Designed | Architecture prepared for AI Act transparency and oversight requirements. |
Sovereignty and data protection guarantees in the ALMA architecture
Implications for the Healthcare Sector
The demonstration that it is possible to achieve perfect performance without sending medical data outside the EU has profound implications for AI adoption in the European healthcare sector. Historically, concerns about data sovereignty have been one of the main barriers to implementing medical AI systems in European hospitals and health centers.[19]
ALMA demonstrates that this dilemma between performance and privacy is a false dilemma: it is possible to have both.
9. Implications for Medical AI
The results of ALMA and MIRI reinforce and extend conclusions we already pointed to in previous articles, but with unprecedented force.
Agentic RAG > Fine-tuning
In our previous analysis on "The Cathedral and the Bazaar", we argued that customization through RAG offers fundamental advantages over fine-tuning for medical applications. ALMA and MIRI are the definitive empirical demonstration of this thesis.
Recent studies on AI agents in clinical medicine confirm that agentic systems consistently outperform base models, even when the latter have been specifically fine-tuned for the medical domain.[20] The reason is simple: a fine-tuned model modifies its weights statically, while an agentic RAG system can query dynamically updated information.
RAG vs. Fine-Tuning in medical tasks. Data from: MDPI Bioengineering 2025 (BLEU), PMC systematic review (hallucinations), medRxiv 2025 (agents).
Customization Without Modifying Weights
ALMA and MIRI use the same base models that are publicly available (Claude for ALMA, confidential model for MIRI). The performance difference does not come from modifications to the models, but from:
- The corpus -- what information is provided to them
- The architecture -- how the query is organized
- The experts -- how knowledge is specialized
- The iteration -- how many times the answer is refined
This means that ALMA/MIRI's advantage is reproducible by any organization that has access to quality medical corpora and the technical capacity to implement an agentic architecture.
The Future: Continuous Corpus Updates
Perhaps the most relevant long-term implication is that ALMA and MIRI can continuously improve without needing to retrain models. When a new clinical guideline is published, a therapeutic protocol is updated, or a new diagnostic association is discovered, it is enough to update the corpus. The system incorporates the new knowledge immediately.
This model of "knowledge as a service" -- where intelligence resides in the corpus and architecture, not in the model's weights -- could redefine how medical AI systems are developed and deployed in the next decade.
10. Conclusions
ALMA Demonstrates That Perfection Is Achievable
600 questions. Three years of exams designed to select the best doctors in Spain. Zero errors. ALMA demonstrates that, with the right architecture, the appropriate corpus, and the necessary investment, it is possible to build a medical AI system that does not fail. Not "almost never." Never.
MIRI Demonstrates That Excellence Is Accessible
596/600 at a cost of $2.38. MIRI demonstrates that near-perfect accuracy does not require astronomical budgets. A medical student can access a system that outperforms any standard model on the market for less than the cost of a coffee.
The Agentic Approach Surpasses Any Generalist Model
No generalist model -- not Gemini, not GPT-5, not Claude, not any of the ~290 evaluated -- has ever achieved 200/200 in a single exam session. ALMA achieves it in all three. MIRI achieves it in the most recent one. Specialization through experts, combined with an advanced reasoning orchestrator, produces results that the "one model for everything" approach cannot match.
Data Sovereignty Is Compatible With Maximum Performance
ALMA processes all its inference in Spain, without sending data outside the EU, without Anthropic accessing the queries. And it still achieves a perfect result. Privacy and performance are not conflicting objectives.
What Comes Next
These results open the door to real clinical deployments of medical AI systems based on Agentic RAG. Not as substitutes for clinical judgment, but as diagnostic support systems with demonstrated and verifiable reliability.
At Medical Benchmark, we will continue evaluating both standard and custom models, documenting the state of the art with the rigor and transparency that characterize our platform. All results are available on our rankings platform.
ALMA and MIRI have been evaluated under the same conditions as all other models in the benchmark: same prompt, same questions, same timing. The results are verifiable and reproducible. Although the evaluations were conducted after each exam took place, the models have no access to the internet or any information about the results or correct answers to the questions, so there is no possibility of data contamination.
Notas y Referencias
- ALMA answers correctly not only the 200 official questions (valid after annulments), but also the 10 reserve questions (201-210) from each exam session. Total: 210/210 x 3 years = 630/630 including reserves, 600/600 considering only valid exam questions.
- Long, Y., et al. "LLMs Meet Synthetic Data Generation: A Survey". ACL 2024. Synthetic data generation for LLMs enables the creation of corpora optimized for retrieval and reasoning. Link
- Amazon Bedrock in the eu-south-2 region (Aragon, Spain). Anthropic does not access inference data in Bedrock deployments. AWS Bedrock data protection documentation
- Calculation: 0.995^600 ~ 0.049, meaning a model with 99.5% accuracy per question has approximately a 4.9% probability of answering 600 consecutive questions correctly. ALMA achieves this with 100% accuracy per question.
- Singh, A., et al. "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG". arXiv:2501.09136, 2025. Link
- "MA-RAG: Multi-Agent Retrieval-Augmented Generation". arXiv:2505.20096, 2025. Multi-agent RAG systems outperform traditional RAG in accuracy and reasoning capability. Link
- Zuo, Y., et al. "KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis". arXiv:2412.16833, 2024. Link
- Zhang, C., et al. "AgentOrchestra: Orchestrating Specialized Agents for Complex Tasks". arXiv:2506.12508, 2025. Link
- Zhuge, M., et al. "GPTSwarm: Language Agents as Optimizable Graphs". ICML 2024. Link
- Long, Y., et al. "LLMs Meet Synthetic Data Generation: A Survey". ACL 2024. Link
- Kumichev, A., et al. "MedSyn: LLM-based Synthetic Medical Text Generation Framework". arXiv:2408.02056, 2024. Link
- Schut, L., Gal, Y., Farquhar, S. "Do Multilingual LLMs Think In English?". ICML 2025. Multilingual models process internally in English even with inputs in other languages. Link
- "Multilingual Reasoning: A Survey of Challenges and Approaches". 2025. Reasoning in English produces better results than in other languages, even for tasks in those languages. Link
- Zhang, T., Kraska, T., Khattab, O. "Recursive Language Models". arXiv:2512.24601, 2025. Link
- Luohe, S., et al. "A Survey on KV-Cache Optimization for Large Language Models". arXiv:2407.18003, COLM 2024. Link
- Li, Y., et al. "SnapKV: LLM Knows What You are Looking for Before Generation". NeurIPS 2024. Link
- "LMCache: Efficient KV-Cache Management for Large Language Models". arXiv:2510.09665, 2025. Link
- Minssen, T., et al. "The EU AI Act and Its Implications for Medical Products". npj Digital Medicine, 2024. Link
- "The EU AI Act: Implications for Healthcare AI Systems". 2024. Medical AI systems are classified as high risk under the AI Act, requiring conformity assessments and human oversight.
- "AI Agents in Clinical Medicine: Promise and Challenges". PMC, 2025. AI agents outperform base models in clinical tasks by combining reasoning with access to specialized knowledge.