Emerging Topics in Integrated ML Systems

The Agentic Shift:
Clinical Decision Support
in the Age of Autonomous AI

Multi-agent LLM systems are transforming oncology diagnosis. But performance on benchmarks is only half the story.

Agentic AI Healthcare Explainability & Trust ELEC0139 · UCL EEE · 2026
~90%
Clinical accuracy — agentic AI
Ferber et al., Nature Cancer 2025
30.3%
Accuracy — base LLM alone
Same study, GPT-4 without tools
~30%
Clinical scenarios with hallucinations
npj Digital Medicine 2026 benchmark
The ~90% vs 30.3% contrast comes from Ferber et al.'s evaluation on 20 multimodal oncology cases — a rigorous proof of concept, but a small sample that demonstrates feasibility rather than establishing general clinical safety across diverse populations, hospital settings, or real-world workflows. These figures should be read as evidence of architectural potential, not deployment readiness.
01 The Domain

From Passive Analysis to Active Decision-Making: The Case for Agentic AI in Clinical Care

1.1 The Domain: Medicine's Data Problem

Modern healthcare generates data at a scale that exceeds human capacity to process it. A single oncology patient may generate large volumes of information across their care pathway: genomic sequencing results, histopathology slides, radiological images, electronic health records spanning years of clinical notes, and a continuously expanding body of treatment guidelines. The clinician tasked with integrating this information into a treatment decision does so under time pressure, with incomplete access to the latest evidence, and without computational tools capable of reasoning across all of it simultaneously.

Fig. 1 — The multimodal data challenge in oncology: four independent data silos, one clinician, one decision.
Genomics EGFR · KRAS · TP53 Whole-genome sequencing Radiology CT · PET · MRI Segmentation · staging Pathology H&E slides · IHC Histopathology analysis EHR Clinical notes · labs Treatment history Clinical Decision ⚡ cognitive overload NO SHARED REPRESENTATION · NO CROSS-MODAL REASONING

Medical errors and diagnostic failures are a major source of preventable harm. One influential Johns Hopkins analysis estimated that more than 250,000 deaths per year in the United States may be associated with medical error, while more recent diagnostic-error research suggests that misdiagnosis contributes to hundreds of thousands of serious harms — including death and permanent disability — each year. In oncology specifically, where treatment decisions depend on the intersection of molecular pathology, imaging findings, and rapidly evolving guideline recommendations, the cognitive demands on clinicians have grown faster than any individual's capacity to meet them. The question is not whether artificial intelligence has a role in supporting these decisions. It is which kind of AI is adequate to the task.

1.2 The Challenge: Why Static AI Falls Short

The first generation of clinical AI addressed narrow, well-defined tasks. A convolutional neural network trained to detect diabetic retinopathy from fundus photographs. A gradient-boosted model predicting 30-day readmission risk from structured EHR data. A transformer fine-tuned to extract medication names from clinical notes. Each of these systems demonstrated impressive performance on its specific benchmark. None of them could reason across tasks, integrate heterogeneous data types, or adapt mid-inference when new information emerged.

The Integration Challenge
Real patients present with data distributed across imaging systems, genomic databases, pathology archives, and free-text clinical notes. These modalities do not share a common representation. A system that excels at reading a CT scan typically cannot connect that finding to a mutation report or a clinical history.
The Reliability Challenge
Liu et al. (2026) in npj Digital Medicine found that hallucinations persisted in approximately 30% of clinical scenarios, even after output filtering, few-shot prompting, and prompt engineering.
The Trust Challenge
Even when AI performs accurately, clinicians decline to act on its recommendations. Abbas et al. (2025) found in a meta-analysis of 62 studies that lack of transparency and interpretability remains one of the most persistent barriers to clinical adoption.

1.3 The Case for Agentic AI: A New Paradigm

The emergence of agentic AI represents a qualitatively different response to all three challenges. Where static models produce outputs, agentic systems execute plans. An agentic AI does not simply generate a response to a clinical query. It decomposes the query into sub-tasks, selects appropriate tools for each, executes them sequentially or in parallel, evaluates intermediate results, and iterates until it reaches a conclusion it can defend with verifiable citations.

Early empirical evidence suggests that this architecture can substantially improve performance in controlled oncology decision-making tasks. Ferber et al. (2025), published in Nature Cancer, developed and validated an autonomous AI agent for clinical decision-making in oncology at TU Dresden. Evaluated on 20 realistic multimodal patient cases, the agent reached correct clinical conclusions in roughly 90% of cases, compared to just 30.3% for the base language model operating alone. The near-tripling of performance was achieved not by scaling the underlying model, but by giving it the ability to plan, retrieve evidence, execute specialised tools, and revise its reasoning.

Key Finding

This architecture also addresses the reliability problem directly. By anchoring generation in retrieved evidence rather than parametric memory, the system produced citations in 82.5% of responses that were accurately aligned with its assertions — enabling clinicians to verify the basis of each recommendation rather than accepting it on trust.

The picture is not uniformly optimistic. Zhao et al. (2026), reviewing over 500 studies of AI agents in healthcare in npj Artificial Intelligence, concluded that technical innovation is outpacing the ethical governance frameworks needed to deploy these systems responsibly. Clinical validation of agentic systems remained predominantly limited to laboratory and simulation settings rather than prospective clinical trials with real patients. The gap between benchmark accuracy and deployable safety is not a technical detail. It is the central problem this blog examines. Part 2 examines how agentic architectures work in practice. Part 3 confronts the harder question: whether a system that reaches the right answer through a reasoning process that clinicians cannot follow, audit, or challenge is one that medicine should trust.

02
The Technology
Inside the Clinical Agent

Inside the Clinical Agent: How Agentic AI Reasons, Retrieves, and Acts

2.1 The Architecture Shift: From Generation to Agency

To understand what makes agentic AI different from the language models that preceded it, it helps to start with what a standard large language model cannot do. When a clinician queries a model with a complex oncology case, the model generates a response in a single forward pass through its parameters. It draws entirely on knowledge encoded during training, cannot access real-time information, cannot call external tools, and cannot revise its reasoning when a gap appears mid-inference.

Agentic AI breaks this single-pass constraint. The dominant architectural paradigm enabling this is ReAct, introduced by Yao et al. (2023), which interleaves reasoning traces with tool-use actions in a continuous loop: think, act, observe, revise, repeat. The annotated trace below shows what this looks like for a real oncology query.

You do not need to follow every clinical detail. The important point is the loop: the agent asks what is missing, retrieves evidence, checks constraints, and revises before answering.

Annotated ReAct Trace — EGFR+ NSCLC Treatment Query Illustrative example based on Yao et al. (2023) and Ferber et al. (2025) patterns
Thought 1 The patient is a 58-year-old female with stage IIIB NSCLC, confirmed EGFR exon 19 deletion, no prior systemic therapy. I need to determine first-line treatment eligibility and retrieve current guideline recommendations before reasoning about drug selection.
Action 1 retrieve_guidelines(query="first-line EGFR-mutant NSCLC exon19 deletion treatment", source="ESMO_2024")
Observation 1 Retrieved 3 chunks [cosine sim > 0.87]: ESMO 2024 §4.1 — osimertinib recommended as preferred first-line agent (IA evidence); FLAURA2 data: PFS 25.5 mo vs 16.7 mo. Contraindication flag: QTc > 500 ms.
Thought 2 Guidelines support osimertinib. I must check for cardiac contraindications and CNS metastases before finalising. Calling the imaging analysis tool.
Action 2 imaging_tool(modality="CT_chest_abdomen_pelvis", task="staging_and_CNS_screening", model="MedSAM-v2")
Observation 2 Structured finding: {"stage_CT": "IIIB", "CNS_mets": false, "pleural": false}. Confidence: 0.93.
Thought 3 Imaging confirms stage IIIB, no CNS involvement, no cardiac contraindication. Sufficient evidence to synthesise a cited recommendation.
Final Output Osimertinib 80 mg/day (first-line, ESMO 2024 Grade IA). Consider combination per FLAURA2 data (PFS +8.8 mo); MDT discussion advised. Citations: [ESMO 2024 §4.1], [Wu et al., NEJM 2023].

The trace makes visible what a static model elides entirely: the reasoning is structured, iterative, and auditable step by step. The performance gain is architectural rather than parametric — the same base model, when embedded in an agentic system, moves from 30.3% to roughly 90% accuracy not by scaling, but by placing reasoning inside a closed process of decomposition, retrieval, and self-revision. Each subproblem is narrower and easier to validate than the global clinical question, and RAG grounds the agent in external evidence rather than parametric memory, shifting from recall-based plausibility to source-based justification.

Fig. 2 — Clinical decision accuracy: agentic AI vs. base LLM
Ferber et al. (2025), Nature Cancer · same model, same 20-case oncology evaluation · agentic vs. no-tools baseline

2.2 The Pipeline in Practice: Ferber et al.'s Oncology Agent

Before examining the Ferber et al. system in detail, it is worth making explicit a distinction the benchmarking literature often blurs. Not all AI systems that involve language models, retrieval, or tool use are agentic. The term covers a wide spectrum, and the governance implications differ substantially across it.

Clinical AI System Types — A Taxonomy
System type
What it does
Key limitation
Static clinical AI
Produces a single prediction from a fixed input (e.g., sepsis risk score from EHR fields)
Cannot adapt, retrieve, or reason across tasks; interpretable but inflexible
RAG chatbot
Retrieves documents before generating a response; reduces some hallucination
Still a single generation pass; no planning, no tool use, no self-correction
Tool-using LLM
Calls external tools (search, calculators, APIs) to augment generation
May lack iterative planning; tool outputs are not systematically verified
Agentic clinical AI
Plans, retrieves, uses specialist tools, checks intermediate results, and iterates until a verifiable conclusion is reached
More capable — but harder to explain, audit, and govern than any system above it

The most rigorously validated clinical agentic system published to date is the autonomous oncology agent developed by Ferber and colleagues, reported in Nature Cancer in August 2025. Its architecture instantiates agentic design principles in a medically demanding context: personalised treatment decision-making where inputs span histopathology, radiology, genomics, and clinical guidelines simultaneously.

Fig. 3 — Ferber et al. Agentic Oncology Pipeline Click a step to expand
Patient Data Input
MULTIMODAL
Histopathology slides, radiology images, genomic reports, clinical notes and EHR history
The system accepts heterogeneous inputs simultaneously: WSI pathology slides for biomarker detection, CT/MRI images for staging, mutation reports (KRAS, BRAF, MSI status), and longitudinal clinical documentation. This multimodal intake is the first departure from single-modality AI tools.
RAG Knowledge Layer
RETRIEVAL
Dense vector embeddings retrieve relevant passages from OncoKB, PubMed, and clinical guideline corpora at inference time
Rather than relying on parametric memory, the agent retrieves the exact source text of guidelines and returns it to the clinician alongside its recommendation. This citation mechanism is how the system addresses hallucination: 82.5% of 171 citations produced were accurately aligned with the agent's assertions. The retrieved text is also the basis for clinician verification.
Specialised Tool Suite
TOOL USE
Vision Transformer (MSI/KRAS/BRAF detection), MedSAM (radiological segmentation), OncoKB API, PubMed search
The tool layer is what separates this architecture from a RAG chatbot. A vision transformer trained on histopathology detects molecular biomarkers directly from slide images. MedSAM performs radiological segmentation. Real-time web tools ensure recommendations reflect current evidence, not training data months out of date. The agent selects tools dynamically based on each case's requirements.
ReAct Reasoning Loop
PLANNING
Think → Act → Observe cycle iterates until a defensible conclusion is reached; self-corrects on tool failure
The ReAct loop is the cognitive engine. At each step: the agent articulates what it is trying to determine (Thought), invokes a tool or retrieval operation (Action), and processes the result (Observation). When a tool call fails or returns ambiguous results, the agent revises its approach. This capacity for mid-inference correction is categorically absent from single-pass models. The full trajectory is logged, providing an audit trail.
Clinical Decision Output
OUTPUT
Treatment recommendation with cited sources — ~90% accuracy on 20 multimodal oncology cases vs. 30.3% for GPT-4 alone
The final output includes the recommendation, the reasoning trajectory, and the original source texts supporting each claim. This transparency mechanism is imperfect: the reasoning chain is lengthy and technically demanding. But it represents a structural improvement over systems that produce recommendations with no provenance at all. The verifiability of citations is the system's primary explainability feature.

What matters in this pipeline is not any single module in isolation, but the closed-loop interaction between decomposition, evidence retrieval, specialised tool invocation, and iterative verification. Patient data are first rendered into a sequence of tractable questions; those questions trigger targeted retrieval and expert tools; the outputs of those tools are then reintroduced into the reasoning loop, where they can confirm, refine, or overturn the agent's prior hypothesis. The system therefore behaves less like a chatbot with attachments than like a closed-loop decision system, in which each stage conditions the next and each intermediate result remains available for correction. Its clinical strength lies precisely in this feedback structure: evidence does not merely decorate the answer after the fact, but actively reshapes the decision while it is being formed.

Fig. 4 — ReAct agent loop: Thought-Action-Observation cycle in the Ferber et al. pipeline
ReAct reasoning loop in the Ferber et al. oncology agent Flowchart showing Thought-Action-Observation loop with tool suite, leading to clinical recommendation. Patient case Multimodal inputs ReAct loop Thought Plan next step Action Call tool or retrieve Observation Process tool result iterate Tool suite Vision Transformer (biomarkers) MedSAM (radiology) OncoKB / PubMed search RAG knowledge base Sufficient? conclusion reached? No Yes Clinical recommendation With cited guideline sources

Under the Hood: RAG and Tool Integration

The architecture diagram above describes what the agent does at each step. It is equally important to understand how each step is implemented: the specific technical choices that make retrieval and tool use work reliably in a high-stakes clinical context.

Retrieval: Hybrid Dense + Sparse Search

A practical implementation of this kind typically chunks guidelines (such as ESMO, NCCN, and NICE documents) into overlapping passages of around 512 tokens to avoid boundary artefacts. Each chunk is embedded using a biomedical-domain encoder such as PubMedBERT or BioLORD, and stored in a vector index such as FAISS. At query time, a dense path based on embedding similarity and a sparse path (BM25) based on keyword overlap are run in parallel, with results merged and reranked. Only chunks exceeding a confidence threshold are passed into the context window. This hybrid approach addresses the known failure mode of dense-only retrieval, where semantically similar but factually divergent passages can surface without sufficient discriminative filtering.

Vision Tool Output: Structured Handoff

MedSAM and vision transformers do not return raw segmentation masks to the reasoning agent, since masks are not language-model-readable. Instead, a post-processing layer converts spatial outputs into a structured JSON finding object before it enters the ReAct context. For a CT scan, this might look like: {"lesion_site": "RUL", "diameter_mm": 32, "lymph_nodes": "N2", "pleural_effusion": false, "CNS_mets": false, "confidence": 0.93}. The agent treats this structured observation identically to a retrieved text chunk: it can cite it, query it further, or flag it for human review if the confidence score falls below a safety threshold. This serialisation step is architecturally significant. It is what allows heterogeneous modalities to be unified under a single reasoning loop without requiring the language model itself to process images.

Verifier: Grounded Claim Checking

Before the final recommendation is released, a dedicated verifier component, implemented as a separate LLM call with a constrained checking prompt, re-examines each factual claim in the draft output against the retrieved source chunks. It performs three checks: (1) attribution, confirming that each cited claim is actually present in the cited chunk; (2) contradiction, checking whether any retrieved evidence conflicts with the proposed recommendation; and (3) completeness, verifying that flagged contraindications in the retrieved guidelines have been addressed. Claims that fail any check are returned to the ReAct loop as a new observation, triggering an additional reasoning step rather than allowing an ungrounded recommendation to reach the clinician. In the Ferber et al. system, this verification step was associated with 82.5% of final responses carrying accurately aligned citations, a substantially higher rate than retrieval-augmented generation without explicit grounding verification.

Critical Limitation — Quis custodiet ipsos custodes?

The verifier architecture raises an uncomfortable question: who monitors the monitor? If the verifier is itself a large language model, it is subject to the same failure modes it is meant to prevent. Chief among these is sycophancy — the documented tendency of LLMs to agree with or validate the outputs of other LLMs in a shared context, particularly when those outputs are expressed with high confidence. A verifier operating purely through generative inference may ratify a plausible-sounding but incorrect recommendation because its parametric priors align with the orchestrator's conclusion, rather than because the citation evidence genuinely supports it. This is not a hypothetical concern: sycophancy in LLM-to-LLM evaluation has been documented in the alignment literature (Sharma et al., 2023) and represents a structural risk in any multi-agent pipeline where one generative model is asked to grade another's output.

The practical implication for clinical deployment is that robust verification cannot rely on a generative LLM alone. Production-grade systems increasingly integrate rule-based checks and symbolic logic constraints — for example, a formal drug-interaction database query that returns a deterministic contraindication flag, or a structured eligibility checker against trial criteria expressed in machine-readable logic. These non-generative components are not susceptible to sycophancy because they do not produce outputs by predicting plausible text. They break the LLM self-reinforcement loop that would otherwise allow a confident hallucination to pass through verification unchallenged.

As multimodal evidence accumulates, context windows become crowded, coordination burdens increase, and the management of parallel clinical subtasks becomes harder to sustain within one reasoning thread. The natural next step is therefore not simply a more capable solitary agent, but a system in which coordination itself becomes an architectural primitive.

2.3 Multi-Agent Collaboration: The Next Design Horizon

The single-agent system described by Ferber and colleagues represents a significant proof of concept but also a design ceiling. Complex clinical scenarios often exceed what any single agent can reason about within a single context window. The emerging direction in 2025 and 2026 is multi-agent collaboration, where distinct agents specialise in different components of a clinical reasoning task and coordinate through structured communication protocols.

The motivating analogy is the structure of clinical teams. A cancer patient's care involves a radiologist, a pathologist, a molecular oncologist, a pharmacist, and a multidisciplinary tumour board, each contributing specialised expertise. Multi-agent AI systems attempt to replicate this division of labour computationally. The most technically interesting design challenge, however, is not how to build specialised agents but how to decide which agents to invoke and when.

MDAgents — Adaptive Complexity Routing (NeurIPS 2024)

On receiving a clinical query, a complexity classifier, itself a prompted LLM call, evaluates three dimensions: (1) number of data modalities involved, (2) presence of conflicting evidence or contraindications, and (3) whether the query involves drug dosing, trial eligibility, or rare-disease reasoning. Based on this classification, the system routes to one of three collaboration regimes:

Tier 1 — Single Agent
Trigger: unambiguous factual query, single modality, established guideline exists
e.g. "What is the standard dose of pembrolizumab for PD-L1 ≥ 50% NSCLC?"
One LLM call with RAG retrieval. No inter-agent communication overhead. Response latency: < 4 s.
Tier 2 — Specialist Panel
Trigger: multimodal inputs, potential drug interactions, or two or more plausible treatment paths
e.g. "Stage IIB NSCLC, EGFR exon20 insertion, prior platinum therapy, new hepatic lesion on CT"
Two to four specialist agents (e.g., genomics, imaging, pharmacology) each produce a structured report; a synthesis agent integrates outputs under a shared context. Asynchronous execution where modalities are independent.
Tier 3 — Full MDT Consultation
Trigger: rare disease, conflicting guideline recommendations, trial eligibility assessment, or multi-comorbidity risk
e.g. "ALK+ NSCLC with concurrent RET fusion, prior lorlatinib progression, cardiac LVEF 45%, clinical trial query"
Full specialist ensemble with an explicit debate round: agents exchange structured disagreements before a moderator agent produces a reconciled recommendation with explicit uncertainty quantification. Highest latency; reserved for cases where a fast but wrong answer is worse than a slow but defensible one.

The Cost of Deliberation: Latency and Token Economics

Routing logic determines not only which agents to invoke but how long a clinical team will wait for a recommendation. This is among the most underappreciated engineering constraints in agentic clinical AI. A Tier 3 full MDT consultation — involving five specialist agents, a debate round, and multiple RAG retrieval cycles — may consume tens of thousands of tokens and take several minutes to complete. In an elective oncology MDT meeting, this latency is acceptable. In an emergency department managing a patient in septic shock, it is not.

Tier 1
< 4 s
~2–4k tokens
Routine factual query, single RAG call, one LLM pass
Tier 2
15–45 s
~10–25k tokens
Specialist panel, parallel retrieval, synthesis pass
Tier 3
2–8 min
~50–150k tokens
Full MDT debate, multi-round reasoning, uncertainty quantification

These figures are not fixed — they depend on model size, retrieval index scale, and parallelisation strategy — but they illustrate a structural tension that the benchmarking literature largely ignores. Liu et al. (2026) in npj Digital Medicine found that agentic systems required substantially greater computational resources than baseline LLMs, yet their benchmarks assessed accuracy in isolation from latency. A system that reaches ~90% accuracy in eight minutes may be transformative for treatment planning and counterproductive for triage.

This tension has a direct parallel in the course framing of integrated machine learning systems. An agent that is accurate in isolation but incompatible with the time constraints of the clinical environment in which it is embedded is not a well-integrated system. The engineering challenge is not simply to build a capable agent but to build one whose latency profile matches the decision latency of the clinical setting it serves. Tier 1 queries must remain under the threshold where a clinician would simply look up the answer independently. Tier 3 consultations must be reserved for decisions where a wrong fast answer is genuinely worse than a correct slow one — a determination that requires the system to know not just what it thinks, but how long it is allowed to think.

The routing diagram shows the logic; the research literature traces how the field arrived there. The timeline below marks the key systems and benchmarks that established the current state of multi-agent clinical AI.

NeurIPS 2024
Introduced adaptive complexity-routing across three collaboration tiers: single agent, specialist panel, and full MDT. Achieved best performance in 7 of 10 medical benchmarks. The key insight is that applying maximum coordination uniformly degrades performance through overhead and context dilution. Matching coordination intensity to case complexity is the decisive design choice.
Nature Cancer, Aug 2025
First rigorous proof-of-concept for agentic clinical AI. Approximately 90% accuracy on multimodal cases. RAG + vision transformers + ReAct reasoning loop. Demonstrated that the accuracy gap between static and agentic models is architectural, not parametric.
Methods and Protocols (MDPI), Feb 2026
Systematic review of LangGraph, CrewAI, and MCP frameworks for clinical AI. The key finding: cross-verification agents, whose sole function is to check reasoning outputs before they reach the clinician, were the single architectural feature most strongly associated with reduced hallucination rates.
npj Digital Medicine, 2026
First systematic benchmarking study across multiple agentic architectures. Open-source systems (Llama-4-based) achieved only 39.1% accuracy; proprietary GPT-4-based systems substantially outperformed. Hallucinations persisted in ~30% of scenarios despite mitigation strategies.

The architecture underlying these systems has a consistent topology across all the frameworks the literature evaluates, whether LangGraph, CrewAI, or bespoke implementations: orchestrator, parallel specialists, verifier. The diagram below shows how information flows through a Tier 3 (full MDT) deployment. The case enters the orchestrator, fans out to five specialist agents executing in parallel, and is then consolidated by a verifier before a recommendation reaches the clinician.

Fig. 5 — Multi-agent collaborative architecture: orchestrator, specialist agents, and verifier (Tier 3 deployment)
Multi-agent collaborative architecture for clinical AI Orchestrator decomposes task to five parallel specialist agents; verifier consolidates before clinician output. Patient Complex case Orchestrator Decomposes task assigns sub-tasks Diagnostic Differential dx Literature RAG / PubMed Imaging MedSAM / ViT Risk Prognosis scores Drug interaction Safety check Verifier Cross-checks outputs flags conflicts Clinician Final decision Specialist agents execute in parallel

What these architectures collectively represent is a shift in how AI systems engage with clinical complexity. Rather than compressing all relevant knowledge into a single large model, they distribute reasoning across specialised components connected by explicit communication and verification protocols. The result is a system more capable than any of its individual parts, and whose reasoning process can, at least in principle, be traced through the interaction log between agents.

Yet the same features that make this architecture powerful are precisely what make it difficult to govern. A static model can be tested against a fixed input-output benchmark: given this patient record, does the model produce the correct risk score? An agentic system, by contrast, may retrieve different source documents, call different tools, and follow a different reasoning path each time it encounters a similar case, depending on what is in its retrieval index, what its tools return, and how its planner interprets intermediate results. This context-dependence makes evaluation, accountability, and clinical responsibility structurally harder than for any previous generation of clinical AI. Part 3 confronts these consequences directly.

What Could Go Wrong — A Compound Failure Scenario

Imagine the retrieval module selects an outdated guideline fragment — one published before a recent trial changed first-line recommendations for a specific mutation subtype. The imaging tool returns a high-confidence segmentation that misclassifies a borderline lymph node as negative. The language model, reasoning over these inputs, constructs a coherent treatment recommendation supported by citations and a plausible reasoning trace. The final output looks more trustworthy than a static model's prediction precisely because it contains steps, tools, and references. And yet every visible feature of trustworthiness — the trace, the citations, the tool outputs — has been built on compounding errors that are individually undetectable from the outside.

This is the central paradox of agentic clinical AI: the structural features that enable better performance also enable more convincing failure. A wrong answer from a static model looks like a wrong answer. A wrong answer from a well-designed agentic system can look like a well-reasoned clinical decision. Recognising and addressing this asymmetry is the work of Part 3.

03
The Ethics
The Trust Problem

The Trust Problem: Why Explainability Is Not Enough

The ethical challenges of agentic clinical AI are not three separate issues — trust, accountability, and fairness — sitting neatly alongside one another. They share a single structural root: the decision is produced through a multi-step process that is difficult to inspect from outside. Explainability fails because the reasoning unfolds across dozens of intermediate steps. Accountability fails because errors cannot be traced to a specific component without trace-level logging. Bias auditing fails because there is no single output to disaggregate — the bias is distributed across the pipeline. Part 3 develops each of these consequences in turn, but their common origin is worth naming first.

3.1 The Technical Root: Why Agentic AI Is Harder to Explain Than Any Model Before It

The explainability problem in clinical AI is not new. Physicians have been wary of black-box models for over a decade, and techniques such as SHAP and LIME were developed largely in response to that wariness. For a single-model classifier predicting sepsis risk from structured EHR data, these tools are genuinely useful: they can tell a clinician that elevated lactate, falling blood pressure, and patient age were the three features most responsible for a high-risk prediction.

Agentic systems make this problem structurally harder. A clinical agent does not produce a single prediction from a fixed input. It produces a conclusion at the end of a reasoning trajectory that may span dozens of steps, multiple tool calls, and hundreds of retrieved text passages. A post-hoc explanation tool applied to the final output cannot reconstruct this trajectory. It can only characterise the relationship between the final answer and the inputs visible to the explanation model, which is a fraction of what actually shaped the agent's conclusion.

"Future AI systems may need to provide medical professionals with explanations of AI predictions and decisions. While current XAI methods match these requirements in principle, they are too inflexible and not sufficiently geared toward clinicians' needs to fulfill this role."

Räz, Pahud De Mortanges & Reyes — Frontiers in Radiology, 2025
Where XAI Methods Fail in Agentic Systems — Four Structural Gaps
Gap
Why it matters in agentic settings
01
Post-hoc attribution breaks down
SHAP and LIME perturb a fixed input to map local output behaviour. In an agentic system the effective input is dynamically constructed at inference time from retrieved documents, prior tool outputs, and intermediate reasoning — making the perturbation space undefined.
02
Explanations can decrease trust
Rosenbacke et al. (2024) found that adding explanations increased trust in only 5 of 10 empirical studies. A confident-sounding trace attached to an erroneous recommendation may be more dangerous than a system that flags uncertainty without explanation.
03
No context-dependence
An EGFR mutation finding is decisive in one case and irrelevant in another. Current XAI outputs do not adapt to the clinical situation — they produce the same feature-importance framing regardless of what the clinician actually needs to verify.
04
No dialogic verification
A clinician reviewing a borderline segmentation result needs to ask follow-up questions. Static saliency maps cannot respond. The result is a one-way output where the most consequential step — verifying a specific intermediate finding — is structurally inaccessible.

The technical difficulty is worth making precise. Methods such as SHAP and LIME work by perturbing a model's inputs and observing how its outputs change — effectively mapping a local region of the model's input-output function. This approach is coherent when a model's decision is produced in a single deterministic pass from a fixed input. It breaks down when the input itself is not fixed but is dynamically constructed at inference time from retrieved documents, tool outputs, and previous reasoning steps. In an agentic system, the effective input to any given reasoning step — and thus the space that a post-hoc explanation must characterise — is not the original patient query. It is the accumulation of everything the agent has seen, retrieved, and concluded up to that point. This accumulated context is different for every patient and every run, making population-level feature attribution methods structurally inapplicable. Even if a perfect explanation of the final output were produced, it could not tell a clinician which of the thirty-seven intermediate reasoning steps contained the error that led to a harmful recommendation. What clinical AI explainability requires is not better post-hoc attribution of outputs. It is in-process transparency — visibility into reasoning as it unfolds, at a granularity that allows intervention before a conclusion hardens.

Räz et al. (2025), writing in Frontiers in Radiology, identified three escalating demands that clinical AI explainability must meet. The first is context-dependence: explanations must be tailored to the specific clinical situation and user. The second is genuine dialogue: the system must respond to clinicians' follow-up questions, not just provide a static summary. The third, and currently unachieved, is social capability: understanding when an explanation is insufficient and adjusting accordingly. Current agentic systems reliably meet none of these three requirements.

3.2 The Trust Paradox: Explanations Can Make Things Worse

The intuitive response to the explainability problem is to build better explanation tools. The empirical evidence suggests this is not straightforwardly correct. Rosenbacke et al. (2024), screening 778 articles in JMIR AI and analysing 10 that met rigorous empirical criteria, found that explainable AI increased clinicians' trust compared to unexplained AI in only five of the ten studies. Three studies found no significant effect. Two found that XAI could either increase or decrease trust depending on explanation quality.

Interactive — The Trust Calibration Problem

Adjust the sliders to explore how explanation quality and AI accuracy interact to create different trust outcomes.

Explanation clarity
50
AI accuracy
70
Adjust the sliders to see the trust calibration outcome.

There is a second and more troubling finding embedded in this literature. A 2025 paper in Diagnostics developing a dynamic trust calibration framework for AI-assisted diagnosis identified excessive trust as a clinical risk comparable to insufficient trust. When clinicians defer to an AI recommendation without critical evaluation because its explanation seems authoritative, they become vulnerable to the system's errors in exactly those cases where the system is most confident but wrong. The goal of explainability in high-stakes clinical settings is not to maximise trust. It is to produce calibrated trust: a disposition in which clinicians rely on the AI when it is reliable and override it when it is not.

This finding has a direct implication for system design. Explanation quality cannot be treated as a secondary concern to be addressed after accuracy is established. If XAI features are built after the core reasoning system is deployed, they will be optimised for persuasiveness rather than accuracy — which is precisely the failure mode the literature documents. Calibrated trust requires that explanation mechanisms are co-designed with the reasoning architecture, not retrofitted onto it.

3.3 The Accountability Vacuum: Who Is Responsible When the Agent Is Wrong

The opacity of agentic reasoning chains is not only an epistemic problem. It is a legal and institutional one. When a patient is harmed by a clinical decision that an AI agent influenced, the question of accountability cannot be answered by examining the agent's output alone. It requires tracing the reasoning chain: which sources did the agent retrieve, which tools did it call, at which step did a reasoning error occur, and was that error attributable to a flaw in the underlying model, a gap in the retrieval corpus, a poorly calibrated tool, or an input data quality problem.

The Governance-Innovation Gap
Already exists
  • ~90% benchmark accuracy on controlled oncology cases (Ferber et al. 2025)
  • ReAct loops, RAG pipelines, multi-agent MDT frameworks deployable now
  • EU AI Act: CDS classified as high-risk, logging and oversight required
  • GDPR / HIPAA: data protection requirements in force
Does not yet exist
  • Agreed standard for trace-level logging of intermediate reasoning steps
  • Post-incident attribution method — which step caused the harmful output
  • Definition of "sufficient oversight" for 7-agent, 50-step pipelines
  • Bias audit framework for multi-component pipelines (parametric + retrieval + tool)
Regulatory Gap

The EU AI Act classifies clinical decision support systems as high-risk AI applications, requiring comprehensive logging, transparent communication of capabilities and limitations, and human oversight sufficient to detect and correct errors. For a multi-agent clinical system spanning seven specialised agents and dozens of tool calls, what constitutes "sufficient understanding" for human oversight remains undefined. The regulatory framework has identified the right requirement. The technical and institutional infrastructure to meet it does not yet exist.

This accountability vacuum has a distributional dimension. Njei et al. (2026), in a scoping review published in PLOS ONE, found that technical innovation in clinical agentic systems is outpacing ethical governance frameworks, and that the gap is largest in lower-resource settings where the infrastructure for human oversight is thinnest. If agentic AI is deployed at scale in under-resourced hospitals precisely because it appears to reduce the need for specialist oversight, the populations most likely to receive AI-influenced decisions with least human supervision are also those least equipped to challenge or correct errors.

Designing Meaningful Human-in-the-Loop Intervention

Acknowledging that human oversight is necessary is not the same as specifying how it should work. Current clinical AI literature tends to treat "human-in-the-loop" as a binary: either a clinician reviews and approves the final output, or they do not. This framing is insufficient for agentic systems where consequential reasoning occurs at intermediate steps, not only at the final recommendation.

Consider a concrete failure scenario. At Observation 2 of the ReAct trace shown in Part 2, the imaging tool returns a structured finding with {"CNS_mets": false}. If the MedSAM model was underpowered for a particular lesion morphology and this field is wrong, every subsequent reasoning step builds on an incorrect foundation. By the time the final recommendation reaches the clinician, the error is baked into the conclusion in a way that is invisible unless the clinician reads the full reasoning trace. Approving or rejecting the final output is too late — the correction needed to occur at the intermediate observation.

Ideal Human-in-the-Loop Architecture — Midstream Intervention
Thought 1
Agent plans retrieval strategy
Action 1 + Observation 1
Guideline retrieval, normal
Observation 2 ⚠ Clinician Intervention Point
Imaging tool returns CNS_mets: false — clinician reviews imaging evidence, flags the finding as uncertain, edits observation node
System re-reasons from corrected node
Downstream thoughts and actions regenerated from the corrected observation, not from the original erroneous one
Final Output (revised)
Recommendation now reflects CNS uncertainty, recommends MRI before committing to first-line regimen

This is a materially different design from the standard "approve or reject" interface. It requires the UI to expose the reasoning trace to the clinician in a readable form, allow them to edit the content of specific observation nodes, and trigger a partial re-inference from the edited node forward. The technical infrastructure for this — sometimes called trace-level human feedback — exists in research prototypes but is absent from the clinical implementations described in the 2025–2026 literature reviewed here. Building it into deployable systems is one of the concrete governance requirements the EU AI Act implies but does not specify.

Note — Expanded data-flow surface
A static model processes one bounded clinical record per inference call. An agentic system retrieves external documents, calls APIs, and routes intermediate observations between specialist agents across multiple steps. Patient data must therefore be protected not just at input and output, but across every tool call, retrieval query, log entry, and inter-agent message. GDPR and HIPAA were not written for multi-step agentic pipelines and address this requirement only indirectly.

3.4 Bias Made Invisible: When Explainability Fails Equity

The opacity of agentic systems does not only obscure errors in individual cases. It obscures systematic patterns of unfairness across patient populations. A biased training dataset produces a biased model, but a biased model whose reasoning cannot be examined is a model whose bias cannot be identified, measured, or corrected.

In clinical oncology, the training data available to large language models and the clinical trials from which guideline recommendations derive are both historically skewed toward white, male, Western patient populations. Obermeyer et al. (2019), published in Science, demonstrated that a widely deployed commercial algorithm systematically underestimated the health needs of Black patients because it used healthcare cost as a proxy for health need — embedding the effects of historical underpayment into the risk model. The algorithm was accurate by its own metric. It was deeply unfair by any clinical standard. Its bias was invisible until a research team specifically designed a study to look for it.

01
Parametric Bias
Bias encoded in the language model's weights during pretraining on historically skewed clinical text, trial data, and medical literature.
02
Retrieval Corpus Bias
The guideline documents retrieved by the RAG layer reflect the populations on which clinical evidence was generated, predominantly Western, white, and male.
03
Tool Training Bias
Vision transformers for biomarker detection trained on homogeneous histopathology datasets may perform differently across patient populations.
04
Audit Impossibility
A 2025 meta-analysis of explainable AI in clinical decision support, published in Healthcare (MDPI), found that only a minority of the 62 studies reviewed had examined their systems for demographic performance disparities, and fewer still had proposed mechanisms for addressing them.

When a recommendation emerges from a multi-step reasoning chain spanning retrieved guidelines, genomic tools, and imaging analysis, there is no single point at which bias can be located and corrected. The implication is direct: a system that cannot be examined cannot be audited for fairness, and a system that cannot be audited for fairness cannot be trusted to serve all patients equitably. Adequate bias auditing for agentic clinical systems would require not just demographic disaggregation of final outputs — the minimum standard typically applied to static models — but component-level attribution that can determine whether a performance disparity originates in the retrieval corpus, the imaging tool, the parametric model, or their interaction. This attribution problem is analytically tractable in principle: one could run the system with and without specific corpus subsets, or substitute alternative vision models, and measure the change in outcome distributions across demographic groups. In practice, however, the computational cost of such ablations at clinical scale, combined with the absence of sufficiently large demographically annotated evaluation datasets, means that no framework for conducting this audit currently exists. Constructing one is an open research problem that the field has not yet seriously engaged.

Conclusion

The Promise and the Condition

The performance gap between agentic AI and its predecessors is real and substantial in controlled evaluations. The evidence reviewed here suggests that agentic systems, built on ReAct reasoning loops, retrieval-augmented generation, specialised tool suites, and multi-agent collaboration, can outperform unaided language models on complex oncology decision tasks by a substantial margin. The architecture is promising. The benchmarks are compelling, but still preliminary.

But performance on benchmarks is a necessary condition for clinical deployment, not a sufficient one. The evidence reviewed in Part 3 establishes three conditions that current agentic systems do not yet reliably meet.

First, the explanations they provide are not consistently the right kind of explanations for clinicians to form calibrated judgements. They can increase trust indiscriminately rather than appropriately, and the specific demands of context-dependence, genuine dialogue, and social capability that clinical explainability requires are unmet by any current system. Worse, the Rosenbacke et al. (2024) evidence suggests that providing explanations may actually decrease trust in some conditions — meaning more transparency is not a reliable remedy and could introduce new failure modes of its own.

Second, the accountability frameworks needed to govern multi-step agentic reasoning in high-stakes clinical settings do not yet exist at the institutional or regulatory level, despite the EU AI Act establishing the right requirements. The technical innovation has outpaced the governance. This gap is not merely administrative: without trace-level logging standards, there is no agreed method for determining at which reasoning step a harmful recommendation originated — making post-incident review, legal accountability, and iterative system improvement structurally impractical.

Third, the opacity of multi-agent reasoning chains actively impedes the bias auditing that equitable clinical deployment requires. When bias is distributed across parametric weights, retrieval corpora, and specialised tools, it cannot be identified by examining the final output alone. The distributional stakes are high: the Njei et al. (2026) scoping review found that the governance gap is largest precisely in lower-resource settings, which means the populations most likely to encounter inadequately governed agentic AI are also those with the fewest institutional mechanisms to detect or challenge its errors.

These three conditions are not independent. The same opacity that makes explanations inadequate also makes accountability elusive and bias invisible. They share a structural root: agentic systems reason through processes that were not designed with external auditability as a first-order requirement. Addressing any one of them therefore requires the same underlying investment — in trace-level logging, in interpretable intermediate representations, and in the governance infrastructure that can make use of them. This convergence is, in its way, good news: the research and engineering agenda required is coherent rather than fragmented.

None of these conditions is an argument against agentic AI in healthcare. They are arguments about the sequence in which development and deployment should proceed. The most honest conclusion is also the most demanding one. The question is not whether agentic AI will transform clinical decision-making. The evidence reviewed here suggests it already is. The question is whether the institutions deploying it are building the oversight infrastructure fast enough to ensure that transformation is equitable, accountable, and worthy of the trust it is beginning to receive.

Three unmet conditions for clinical deployment
01
Explainability
Current agentic systems cannot provide the context-dependent, dialogic explanations that calibrated clinical trust requires. Adding explanations may increase or decrease trust unpredictably.
Unresolved
03
Equity
Distributed bias across parametric weights, retrieval corpora, and tool training datasets cannot be detected by output inspection alone. Governance gaps are largest in under-resourced settings.
Structurally invisible
Research Agenda

The three conditions identified above share a structural root in the opacity of multi-step agentic reasoning. A coherent research agenda follows: (1) standardised trace-level logging formats that persist intermediate observations and tool outputs in auditable, machine-readable form; (2) interpretable intermediate representations that allow downstream verifiers — human and algorithmic — to inspect reasoning at the step level rather than only at the output; and (3) prospective bias auditing frameworks that decompose bias attribution across parametric, retrieval, and tool components rather than treating the pipeline as a monolithic black box. None of these is a purely technical problem. Each requires sustained collaboration between ML researchers, clinicians, ethicists, and the regulatory bodies that will ultimately determine whether agentic clinical AI earns the institutional trust it is already beginning to receive in practice.

References

Selected references supporting the technical and ethical analysis above. Accessed April 2026.