From Passive Analysis to Active Decision-Making: The Case for Agentic AI in Clinical Care
1.1 The Domain: Medicine's Data Problem
Modern healthcare generates data at a scale that exceeds human capacity to process it. A single oncology patient may generate large volumes of information across their care pathway: genomic sequencing results, histopathology slides, radiological images, electronic health records spanning years of clinical notes, and a continuously expanding body of treatment guidelines. The clinician tasked with integrating this information into a treatment decision does so under time pressure, with incomplete access to the latest evidence, and without computational tools capable of reasoning across all of it simultaneously.
Medical errors and diagnostic failures are a major source of preventable harm. One influential Johns Hopkins analysis estimated that more than 250,000 deaths per year in the United States may be associated with medical error, while more recent diagnostic-error research suggests that misdiagnosis contributes to hundreds of thousands of serious harms — including death and permanent disability — each year. In oncology specifically, where treatment decisions depend on the intersection of molecular pathology, imaging findings, and rapidly evolving guideline recommendations, the cognitive demands on clinicians have grown faster than any individual's capacity to meet them. The question is not whether artificial intelligence has a role in supporting these decisions. It is which kind of AI is adequate to the task.
1.2 The Challenge: Why Static AI Falls Short
The first generation of clinical AI addressed narrow, well-defined tasks. A convolutional neural network trained to detect diabetic retinopathy from fundus photographs. A gradient-boosted model predicting 30-day readmission risk from structured EHR data. A transformer fine-tuned to extract medication names from clinical notes. Each of these systems demonstrated impressive performance on its specific benchmark. None of them could reason across tasks, integrate heterogeneous data types, or adapt mid-inference when new information emerged.
1.3 The Case for Agentic AI: A New Paradigm
The emergence of agentic AI represents a qualitatively different response to all three challenges. Where static models produce outputs, agentic systems execute plans. An agentic AI does not simply generate a response to a clinical query. It decomposes the query into sub-tasks, selects appropriate tools for each, executes them sequentially or in parallel, evaluates intermediate results, and iterates until it reaches a conclusion it can defend with verifiable citations.
Early empirical evidence suggests that this architecture can substantially improve performance in controlled oncology decision-making tasks. Ferber et al. (2025), published in Nature Cancer, developed and validated an autonomous AI agent for clinical decision-making in oncology at TU Dresden. Evaluated on 20 realistic multimodal patient cases, the agent reached correct clinical conclusions in roughly 90% of cases, compared to just 30.3% for the base language model operating alone. The near-tripling of performance was achieved not by scaling the underlying model, but by giving it the ability to plan, retrieve evidence, execute specialised tools, and revise its reasoning.
This architecture also addresses the reliability problem directly. By anchoring generation in retrieved evidence rather than parametric memory, the system produced citations in 82.5% of responses that were accurately aligned with its assertions — enabling clinicians to verify the basis of each recommendation rather than accepting it on trust.
The picture is not uniformly optimistic. Zhao et al. (2026), reviewing over 500 studies of AI agents in healthcare in npj Artificial Intelligence, concluded that technical innovation is outpacing the ethical governance frameworks needed to deploy these systems responsibly. Clinical validation of agentic systems remained predominantly limited to laboratory and simulation settings rather than prospective clinical trials with real patients. The gap between benchmark accuracy and deployable safety is not a technical detail. It is the central problem this blog examines. Part 2 examines how agentic architectures work in practice. Part 3 confronts the harder question: whether a system that reaches the right answer through a reasoning process that clinicians cannot follow, audit, or challenge is one that medicine should trust.
Inside the Clinical Agent: How Agentic AI Reasons, Retrieves, and Acts
2.1 The Architecture Shift: From Generation to Agency
To understand what makes agentic AI different from the language models that preceded it, it helps to start with what a standard large language model cannot do. When a clinician queries a model with a complex oncology case, the model generates a response in a single forward pass through its parameters. It draws entirely on knowledge encoded during training, cannot access real-time information, cannot call external tools, and cannot revise its reasoning when a gap appears mid-inference.
Agentic AI breaks this single-pass constraint. The dominant architectural paradigm enabling this is ReAct, introduced by Yao et al. (2023), which interleaves reasoning traces with tool-use actions in a continuous loop: think, act, observe, revise, repeat. The annotated trace below shows what this looks like for a real oncology query.
You do not need to follow every clinical detail. The important point is the loop: the agent asks what is missing, retrieves evidence, checks constraints, and revises before answering.
The trace makes visible what a static model elides entirely: the reasoning is structured, iterative, and auditable step by step. The performance gain is architectural rather than parametric — the same base model, when embedded in an agentic system, moves from 30.3% to roughly 90% accuracy not by scaling, but by placing reasoning inside a closed process of decomposition, retrieval, and self-revision. Each subproblem is narrower and easier to validate than the global clinical question, and RAG grounds the agent in external evidence rather than parametric memory, shifting from recall-based plausibility to source-based justification.
2.2 The Pipeline in Practice: Ferber et al.'s Oncology Agent
Before examining the Ferber et al. system in detail, it is worth making explicit a distinction the benchmarking literature often blurs. Not all AI systems that involve language models, retrieval, or tool use are agentic. The term covers a wide spectrum, and the governance implications differ substantially across it.
The most rigorously validated clinical agentic system published to date is the autonomous oncology agent developed by Ferber and colleagues, reported in Nature Cancer in August 2025. Its architecture instantiates agentic design principles in a medically demanding context: personalised treatment decision-making where inputs span histopathology, radiology, genomics, and clinical guidelines simultaneously.
What matters in this pipeline is not any single module in isolation, but the closed-loop interaction between decomposition, evidence retrieval, specialised tool invocation, and iterative verification. Patient data are first rendered into a sequence of tractable questions; those questions trigger targeted retrieval and expert tools; the outputs of those tools are then reintroduced into the reasoning loop, where they can confirm, refine, or overturn the agent's prior hypothesis. The system therefore behaves less like a chatbot with attachments than like a closed-loop decision system, in which each stage conditions the next and each intermediate result remains available for correction. Its clinical strength lies precisely in this feedback structure: evidence does not merely decorate the answer after the fact, but actively reshapes the decision while it is being formed.
Under the Hood: RAG and Tool Integration
The architecture diagram above describes what the agent does at each step. It is equally important to understand how each step is implemented: the specific technical choices that make retrieval and tool use work reliably in a high-stakes clinical context.
A practical implementation of this kind typically chunks guidelines (such as ESMO, NCCN, and NICE documents) into overlapping passages of around 512 tokens to avoid boundary artefacts. Each chunk is embedded using a biomedical-domain encoder such as PubMedBERT or BioLORD, and stored in a vector index such as FAISS. At query time, a dense path based on embedding similarity and a sparse path (BM25) based on keyword overlap are run in parallel, with results merged and reranked. Only chunks exceeding a confidence threshold are passed into the context window. This hybrid approach addresses the known failure mode of dense-only retrieval, where semantically similar but factually divergent passages can surface without sufficient discriminative filtering.
MedSAM and vision transformers do not return raw segmentation masks to the reasoning agent, since masks are not language-model-readable. Instead, a post-processing layer converts spatial outputs into a structured JSON finding object before it enters the ReAct context. For a CT scan, this might look like: {"lesion_site": "RUL", "diameter_mm": 32, "lymph_nodes": "N2", "pleural_effusion": false, "CNS_mets": false, "confidence": 0.93}. The agent treats this structured observation identically to a retrieved text chunk: it can cite it, query it further, or flag it for human review if the confidence score falls below a safety threshold. This serialisation step is architecturally significant. It is what allows heterogeneous modalities to be unified under a single reasoning loop without requiring the language model itself to process images.
Before the final recommendation is released, a dedicated verifier component, implemented as a separate LLM call with a constrained checking prompt, re-examines each factual claim in the draft output against the retrieved source chunks. It performs three checks: (1) attribution, confirming that each cited claim is actually present in the cited chunk; (2) contradiction, checking whether any retrieved evidence conflicts with the proposed recommendation; and (3) completeness, verifying that flagged contraindications in the retrieved guidelines have been addressed. Claims that fail any check are returned to the ReAct loop as a new observation, triggering an additional reasoning step rather than allowing an ungrounded recommendation to reach the clinician. In the Ferber et al. system, this verification step was associated with 82.5% of final responses carrying accurately aligned citations, a substantially higher rate than retrieval-augmented generation without explicit grounding verification.
The verifier architecture raises an uncomfortable question: who monitors the monitor? If the verifier is itself a large language model, it is subject to the same failure modes it is meant to prevent. Chief among these is sycophancy — the documented tendency of LLMs to agree with or validate the outputs of other LLMs in a shared context, particularly when those outputs are expressed with high confidence. A verifier operating purely through generative inference may ratify a plausible-sounding but incorrect recommendation because its parametric priors align with the orchestrator's conclusion, rather than because the citation evidence genuinely supports it. This is not a hypothetical concern: sycophancy in LLM-to-LLM evaluation has been documented in the alignment literature (Sharma et al., 2023) and represents a structural risk in any multi-agent pipeline where one generative model is asked to grade another's output.
The practical implication for clinical deployment is that robust verification cannot rely on a generative LLM alone. Production-grade systems increasingly integrate rule-based checks and symbolic logic constraints — for example, a formal drug-interaction database query that returns a deterministic contraindication flag, or a structured eligibility checker against trial criteria expressed in machine-readable logic. These non-generative components are not susceptible to sycophancy because they do not produce outputs by predicting plausible text. They break the LLM self-reinforcement loop that would otherwise allow a confident hallucination to pass through verification unchallenged.
As multimodal evidence accumulates, context windows become crowded, coordination burdens increase, and the management of parallel clinical subtasks becomes harder to sustain within one reasoning thread. The natural next step is therefore not simply a more capable solitary agent, but a system in which coordination itself becomes an architectural primitive.
2.3 Multi-Agent Collaboration: The Next Design Horizon
The single-agent system described by Ferber and colleagues represents a significant proof of concept but also a design ceiling. Complex clinical scenarios often exceed what any single agent can reason about within a single context window. The emerging direction in 2025 and 2026 is multi-agent collaboration, where distinct agents specialise in different components of a clinical reasoning task and coordinate through structured communication protocols.
The motivating analogy is the structure of clinical teams. A cancer patient's care involves a radiologist, a pathologist, a molecular oncologist, a pharmacist, and a multidisciplinary tumour board, each contributing specialised expertise. Multi-agent AI systems attempt to replicate this division of labour computationally. The most technically interesting design challenge, however, is not how to build specialised agents but how to decide which agents to invoke and when.
On receiving a clinical query, a complexity classifier, itself a prompted LLM call, evaluates three dimensions: (1) number of data modalities involved, (2) presence of conflicting evidence or contraindications, and (3) whether the query involves drug dosing, trial eligibility, or rare-disease reasoning. Based on this classification, the system routes to one of three collaboration regimes:
The Cost of Deliberation: Latency and Token Economics
Routing logic determines not only which agents to invoke but how long a clinical team will wait for a recommendation. This is among the most underappreciated engineering constraints in agentic clinical AI. A Tier 3 full MDT consultation — involving five specialist agents, a debate round, and multiple RAG retrieval cycles — may consume tens of thousands of tokens and take several minutes to complete. In an elective oncology MDT meeting, this latency is acceptable. In an emergency department managing a patient in septic shock, it is not.
These figures are not fixed — they depend on model size, retrieval index scale, and parallelisation strategy — but they illustrate a structural tension that the benchmarking literature largely ignores. Liu et al. (2026) in npj Digital Medicine found that agentic systems required substantially greater computational resources than baseline LLMs, yet their benchmarks assessed accuracy in isolation from latency. A system that reaches ~90% accuracy in eight minutes may be transformative for treatment planning and counterproductive for triage.
This tension has a direct parallel in the course framing of integrated machine learning systems. An agent that is accurate in isolation but incompatible with the time constraints of the clinical environment in which it is embedded is not a well-integrated system. The engineering challenge is not simply to build a capable agent but to build one whose latency profile matches the decision latency of the clinical setting it serves. Tier 1 queries must remain under the threshold where a clinician would simply look up the answer independently. Tier 3 consultations must be reserved for decisions where a wrong fast answer is genuinely worse than a correct slow one — a determination that requires the system to know not just what it thinks, but how long it is allowed to think.
The routing diagram shows the logic; the research literature traces how the field arrived there. The timeline below marks the key systems and benchmarks that established the current state of multi-agent clinical AI.
The architecture underlying these systems has a consistent topology across all the frameworks the literature evaluates, whether LangGraph, CrewAI, or bespoke implementations: orchestrator, parallel specialists, verifier. The diagram below shows how information flows through a Tier 3 (full MDT) deployment. The case enters the orchestrator, fans out to five specialist agents executing in parallel, and is then consolidated by a verifier before a recommendation reaches the clinician.
What these architectures collectively represent is a shift in how AI systems engage with clinical complexity. Rather than compressing all relevant knowledge into a single large model, they distribute reasoning across specialised components connected by explicit communication and verification protocols. The result is a system more capable than any of its individual parts, and whose reasoning process can, at least in principle, be traced through the interaction log between agents.
Yet the same features that make this architecture powerful are precisely what make it difficult to govern. A static model can be tested against a fixed input-output benchmark: given this patient record, does the model produce the correct risk score? An agentic system, by contrast, may retrieve different source documents, call different tools, and follow a different reasoning path each time it encounters a similar case, depending on what is in its retrieval index, what its tools return, and how its planner interprets intermediate results. This context-dependence makes evaluation, accountability, and clinical responsibility structurally harder than for any previous generation of clinical AI. Part 3 confronts these consequences directly.
Imagine the retrieval module selects an outdated guideline fragment — one published before a recent trial changed first-line recommendations for a specific mutation subtype. The imaging tool returns a high-confidence segmentation that misclassifies a borderline lymph node as negative. The language model, reasoning over these inputs, constructs a coherent treatment recommendation supported by citations and a plausible reasoning trace. The final output looks more trustworthy than a static model's prediction precisely because it contains steps, tools, and references. And yet every visible feature of trustworthiness — the trace, the citations, the tool outputs — has been built on compounding errors that are individually undetectable from the outside.
This is the central paradox of agentic clinical AI: the structural features that enable better performance also enable more convincing failure. A wrong answer from a static model looks like a wrong answer. A wrong answer from a well-designed agentic system can look like a well-reasoned clinical decision. Recognising and addressing this asymmetry is the work of Part 3.
The Trust Problem: Why Explainability Is Not Enough
The ethical challenges of agentic clinical AI are not three separate issues — trust, accountability, and fairness — sitting neatly alongside one another. They share a single structural root: the decision is produced through a multi-step process that is difficult to inspect from outside. Explainability fails because the reasoning unfolds across dozens of intermediate steps. Accountability fails because errors cannot be traced to a specific component without trace-level logging. Bias auditing fails because there is no single output to disaggregate — the bias is distributed across the pipeline. Part 3 develops each of these consequences in turn, but their common origin is worth naming first.
3.1 The Technical Root: Why Agentic AI Is Harder to Explain Than Any Model Before It
The explainability problem in clinical AI is not new. Physicians have been wary of black-box models for over a decade, and techniques such as SHAP and LIME were developed largely in response to that wariness. For a single-model classifier predicting sepsis risk from structured EHR data, these tools are genuinely useful: they can tell a clinician that elevated lactate, falling blood pressure, and patient age were the three features most responsible for a high-risk prediction.
Agentic systems make this problem structurally harder. A clinical agent does not produce a single prediction from a fixed input. It produces a conclusion at the end of a reasoning trajectory that may span dozens of steps, multiple tool calls, and hundreds of retrieved text passages. A post-hoc explanation tool applied to the final output cannot reconstruct this trajectory. It can only characterise the relationship between the final answer and the inputs visible to the explanation model, which is a fraction of what actually shaped the agent's conclusion.
"Future AI systems may need to provide medical professionals with explanations of AI predictions and decisions. While current XAI methods match these requirements in principle, they are too inflexible and not sufficiently geared toward clinicians' needs to fulfill this role."
Räz, Pahud De Mortanges & Reyes — Frontiers in Radiology, 2025The technical difficulty is worth making precise. Methods such as SHAP and LIME work by perturbing a model's inputs and observing how its outputs change — effectively mapping a local region of the model's input-output function. This approach is coherent when a model's decision is produced in a single deterministic pass from a fixed input. It breaks down when the input itself is not fixed but is dynamically constructed at inference time from retrieved documents, tool outputs, and previous reasoning steps. In an agentic system, the effective input to any given reasoning step — and thus the space that a post-hoc explanation must characterise — is not the original patient query. It is the accumulation of everything the agent has seen, retrieved, and concluded up to that point. This accumulated context is different for every patient and every run, making population-level feature attribution methods structurally inapplicable. Even if a perfect explanation of the final output were produced, it could not tell a clinician which of the thirty-seven intermediate reasoning steps contained the error that led to a harmful recommendation. What clinical AI explainability requires is not better post-hoc attribution of outputs. It is in-process transparency — visibility into reasoning as it unfolds, at a granularity that allows intervention before a conclusion hardens.
Räz et al. (2025), writing in Frontiers in Radiology, identified three escalating demands that clinical AI explainability must meet. The first is context-dependence: explanations must be tailored to the specific clinical situation and user. The second is genuine dialogue: the system must respond to clinicians' follow-up questions, not just provide a static summary. The third, and currently unachieved, is social capability: understanding when an explanation is insufficient and adjusting accordingly. Current agentic systems reliably meet none of these three requirements.
3.2 The Trust Paradox: Explanations Can Make Things Worse
The intuitive response to the explainability problem is to build better explanation tools. The empirical evidence suggests this is not straightforwardly correct. Rosenbacke et al. (2024), screening 778 articles in JMIR AI and analysing 10 that met rigorous empirical criteria, found that explainable AI increased clinicians' trust compared to unexplained AI in only five of the ten studies. Three studies found no significant effect. Two found that XAI could either increase or decrease trust depending on explanation quality.
Adjust the sliders to explore how explanation quality and AI accuracy interact to create different trust outcomes.
There is a second and more troubling finding embedded in this literature. A 2025 paper in Diagnostics developing a dynamic trust calibration framework for AI-assisted diagnosis identified excessive trust as a clinical risk comparable to insufficient trust. When clinicians defer to an AI recommendation without critical evaluation because its explanation seems authoritative, they become vulnerable to the system's errors in exactly those cases where the system is most confident but wrong. The goal of explainability in high-stakes clinical settings is not to maximise trust. It is to produce calibrated trust: a disposition in which clinicians rely on the AI when it is reliable and override it when it is not.
This finding has a direct implication for system design. Explanation quality cannot be treated as a secondary concern to be addressed after accuracy is established. If XAI features are built after the core reasoning system is deployed, they will be optimised for persuasiveness rather than accuracy — which is precisely the failure mode the literature documents. Calibrated trust requires that explanation mechanisms are co-designed with the reasoning architecture, not retrofitted onto it.
3.3 The Accountability Vacuum: Who Is Responsible When the Agent Is Wrong
The opacity of agentic reasoning chains is not only an epistemic problem. It is a legal and institutional one. When a patient is harmed by a clinical decision that an AI agent influenced, the question of accountability cannot be answered by examining the agent's output alone. It requires tracing the reasoning chain: which sources did the agent retrieve, which tools did it call, at which step did a reasoning error occur, and was that error attributable to a flaw in the underlying model, a gap in the retrieval corpus, a poorly calibrated tool, or an input data quality problem.
- ~90% benchmark accuracy on controlled oncology cases (Ferber et al. 2025)
- ReAct loops, RAG pipelines, multi-agent MDT frameworks deployable now
- EU AI Act: CDS classified as high-risk, logging and oversight required
- GDPR / HIPAA: data protection requirements in force
- Agreed standard for trace-level logging of intermediate reasoning steps
- Post-incident attribution method — which step caused the harmful output
- Definition of "sufficient oversight" for 7-agent, 50-step pipelines
- Bias audit framework for multi-component pipelines (parametric + retrieval + tool)
The EU AI Act classifies clinical decision support systems as high-risk AI applications, requiring comprehensive logging, transparent communication of capabilities and limitations, and human oversight sufficient to detect and correct errors. For a multi-agent clinical system spanning seven specialised agents and dozens of tool calls, what constitutes "sufficient understanding" for human oversight remains undefined. The regulatory framework has identified the right requirement. The technical and institutional infrastructure to meet it does not yet exist.
This accountability vacuum has a distributional dimension. Njei et al. (2026), in a scoping review published in PLOS ONE, found that technical innovation in clinical agentic systems is outpacing ethical governance frameworks, and that the gap is largest in lower-resource settings where the infrastructure for human oversight is thinnest. If agentic AI is deployed at scale in under-resourced hospitals precisely because it appears to reduce the need for specialist oversight, the populations most likely to receive AI-influenced decisions with least human supervision are also those least equipped to challenge or correct errors.
Designing Meaningful Human-in-the-Loop Intervention
Acknowledging that human oversight is necessary is not the same as specifying how it should work. Current clinical AI literature tends to treat "human-in-the-loop" as a binary: either a clinician reviews and approves the final output, or they do not. This framing is insufficient for agentic systems where consequential reasoning occurs at intermediate steps, not only at the final recommendation.
Consider a concrete failure scenario. At Observation 2 of the ReAct trace shown in Part 2, the imaging tool returns a structured finding with {"CNS_mets": false}. If the MedSAM model was underpowered for a particular lesion morphology and this field is wrong, every subsequent reasoning step builds on an incorrect foundation. By the time the final recommendation reaches the clinician, the error is baked into the conclusion in a way that is invisible unless the clinician reads the full reasoning trace. Approving or rejecting the final output is too late — the correction needed to occur at the intermediate observation.
CNS_mets: false — clinician reviews imaging evidence, flags the finding as uncertain, edits observation nodeThis is a materially different design from the standard "approve or reject" interface. It requires the UI to expose the reasoning trace to the clinician in a readable form, allow them to edit the content of specific observation nodes, and trigger a partial re-inference from the edited node forward. The technical infrastructure for this — sometimes called trace-level human feedback — exists in research prototypes but is absent from the clinical implementations described in the 2025–2026 literature reviewed here. Building it into deployable systems is one of the concrete governance requirements the EU AI Act implies but does not specify.
3.4 Bias Made Invisible: When Explainability Fails Equity
The opacity of agentic systems does not only obscure errors in individual cases. It obscures systematic patterns of unfairness across patient populations. A biased training dataset produces a biased model, but a biased model whose reasoning cannot be examined is a model whose bias cannot be identified, measured, or corrected.
In clinical oncology, the training data available to large language models and the clinical trials from which guideline recommendations derive are both historically skewed toward white, male, Western patient populations. Obermeyer et al. (2019), published in Science, demonstrated that a widely deployed commercial algorithm systematically underestimated the health needs of Black patients because it used healthcare cost as a proxy for health need — embedding the effects of historical underpayment into the risk model. The algorithm was accurate by its own metric. It was deeply unfair by any clinical standard. Its bias was invisible until a research team specifically designed a study to look for it.
When a recommendation emerges from a multi-step reasoning chain spanning retrieved guidelines, genomic tools, and imaging analysis, there is no single point at which bias can be located and corrected. The implication is direct: a system that cannot be examined cannot be audited for fairness, and a system that cannot be audited for fairness cannot be trusted to serve all patients equitably. Adequate bias auditing for agentic clinical systems would require not just demographic disaggregation of final outputs — the minimum standard typically applied to static models — but component-level attribution that can determine whether a performance disparity originates in the retrieval corpus, the imaging tool, the parametric model, or their interaction. This attribution problem is analytically tractable in principle: one could run the system with and without specific corpus subsets, or substitute alternative vision models, and measure the change in outcome distributions across demographic groups. In practice, however, the computational cost of such ablations at clinical scale, combined with the absence of sufficiently large demographically annotated evaluation datasets, means that no framework for conducting this audit currently exists. Constructing one is an open research problem that the field has not yet seriously engaged.
The Promise and the Condition
The performance gap between agentic AI and its predecessors is real and substantial in controlled evaluations. The evidence reviewed here suggests that agentic systems, built on ReAct reasoning loops, retrieval-augmented generation, specialised tool suites, and multi-agent collaboration, can outperform unaided language models on complex oncology decision tasks by a substantial margin. The architecture is promising. The benchmarks are compelling, but still preliminary.
But performance on benchmarks is a necessary condition for clinical deployment, not a sufficient one. The evidence reviewed in Part 3 establishes three conditions that current agentic systems do not yet reliably meet.
First, the explanations they provide are not consistently the right kind of explanations for clinicians to form calibrated judgements. They can increase trust indiscriminately rather than appropriately, and the specific demands of context-dependence, genuine dialogue, and social capability that clinical explainability requires are unmet by any current system. Worse, the Rosenbacke et al. (2024) evidence suggests that providing explanations may actually decrease trust in some conditions — meaning more transparency is not a reliable remedy and could introduce new failure modes of its own.
Second, the accountability frameworks needed to govern multi-step agentic reasoning in high-stakes clinical settings do not yet exist at the institutional or regulatory level, despite the EU AI Act establishing the right requirements. The technical innovation has outpaced the governance. This gap is not merely administrative: without trace-level logging standards, there is no agreed method for determining at which reasoning step a harmful recommendation originated — making post-incident review, legal accountability, and iterative system improvement structurally impractical.
Third, the opacity of multi-agent reasoning chains actively impedes the bias auditing that equitable clinical deployment requires. When bias is distributed across parametric weights, retrieval corpora, and specialised tools, it cannot be identified by examining the final output alone. The distributional stakes are high: the Njei et al. (2026) scoping review found that the governance gap is largest precisely in lower-resource settings, which means the populations most likely to encounter inadequately governed agentic AI are also those with the fewest institutional mechanisms to detect or challenge its errors.
These three conditions are not independent. The same opacity that makes explanations inadequate also makes accountability elusive and bias invisible. They share a structural root: agentic systems reason through processes that were not designed with external auditability as a first-order requirement. Addressing any one of them therefore requires the same underlying investment — in trace-level logging, in interpretable intermediate representations, and in the governance infrastructure that can make use of them. This convergence is, in its way, good news: the research and engineering agenda required is coherent rather than fragmented.
None of these conditions is an argument against agentic AI in healthcare. They are arguments about the sequence in which development and deployment should proceed. The most honest conclusion is also the most demanding one. The question is not whether agentic AI will transform clinical decision-making. The evidence reviewed here suggests it already is. The question is whether the institutions deploying it are building the oversight infrastructure fast enough to ensure that transformation is equitable, accountable, and worthy of the trust it is beginning to receive.
The three conditions identified above share a structural root in the opacity of multi-step agentic reasoning. A coherent research agenda follows: (1) standardised trace-level logging formats that persist intermediate observations and tool outputs in auditable, machine-readable form; (2) interpretable intermediate representations that allow downstream verifiers — human and algorithmic — to inspect reasoning at the step level rather than only at the output; and (3) prospective bias auditing frameworks that decompose bias attribution across parametric, retrieval, and tool components rather than treating the pipeline as a monolithic black box. None of these is a purely technical problem. Each requires sustained collaboration between ML researchers, clinicians, ethicists, and the regulatory bodies that will ultimately determine whether agentic clinical AI earns the institutional trust it is already beginning to receive in practice.
Selected references supporting the technical and ethical analysis above. Accessed April 2026.
- Ferber, D. et al. (2025) 'Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology', Nature Cancer, 6(8), pp. 1337–1349. doi:10.1038/s43018-025-00991-6
- Liu, Y., Carrero, Z.I., Jiang, X. et al. (2026) 'Benchmarking large language model-based agent systems for clinical decision tasks', npj Digital Medicine, 9, 25. Nature
- Zhao, L., Liu, S., Xin, T. et al. (2026) 'AI agent in healthcare: applications, evaluations, and future directions', npj Artificial Intelligence, 2, 31. Nature
- Spieser, J., Balapour, A., Meller, J., Patra, K.C. and Shamsaei, B. (2026) 'A Review of Multi-Agent AI Systems for Biological and Clinical Data Analysis', Methods and Protocols (MDPI), 9(2), 33. MDPI
- Kim, Y. et al. (2024) 'MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making', Advances in Neural Information Processing Systems (NeurIPS 2024). arXiv:2404.15155
- Abbas, Q., Jeong, W. and Lee, S.W. (2025) 'Explainable AI in Clinical Decision Support Systems: A Meta-Analysis of Methods, Applications, and Usability Challenges', Healthcare (MDPI), 13(17), 2154. PMC
- Räz, T., Pahud De Mortanges, A. and Reyes, M. (2025) 'Explainable AI in medicine: challenges of integrating XAI into the future clinical routine', Frontiers in Radiology, 5:1627169. doi:10.3389/fradi.2025.1627169
- Rosenbacke, R., Melhus, Å., McKee, M. and Stuckler, D. (2024) 'How Explainable Artificial Intelligence Can Increase or Decrease Clinicians' Trust in AI Applications in Health Care: Systematic Review', JMIR AI, 3:e53207. JMIR AI
- Obermeyer, Z. et al. (2019) 'Dissecting racial bias in an algorithm used to manage the health of populations', Science, 366(6464), pp. 447–453. doi:10.1126/science.aax2342
- Yao, S. et al. (2023) 'ReAct: Synergizing Reasoning and Acting in Language Models', ICLR 2023. arXiv:2210.03629
- Njei, B., Al-Ajlouni, Y.A., Kanmounye, U.S. et al. (2026) 'Artificial intelligence agents in healthcare research: A scoping review', PLOS ONE, 21(2):e0342182. PMC