ON
← Back to feed
United KingdomTechnology4 days ago

Towards autonomous medical artificial intelligence agents

The article discusses the progress of large language models (LLMs) in healthcare, highlighting their capabilities in various medical tasks such as question-answering, reasoning, and diagnostic challenges. It also mentions real-world applications like decision-support tools for medical guidelines, data extraction from clinical notes, and generating clinical codes. The text emphasizes that while LLMs have made significant strides, their current narrow applications do not fully leverage their potential in complex clinical workflows. Effective clinical decision-making involves multiple steps, andL

Main

LLMs have shown impressive performance on medical benchmarks, ranging from traditional question-answering tasks 3 , 4 to more challenging reasoning scenarios 5 , 6 and multimodal diagnostic challenges 7 . Moreover, several initiatives have demonstrated the practical utility of LLMs in real-world healthcare settings, including their application as decision-support tools for medical guideline information 8 , extracting and structuring data 9 from clinical notes, and generating clinical codes 10 . However, as LLMs evolve towards more generalist, reasoning models, these current, narrowly defined applications in healthcare drastically underutilize the broader potential of LLMs across many medical tasks and fall short in addressing the multifaceted demands of clinical workflows, which require optimizing diagnostic accuracy without overutilizing medical resources. Within those, effective clinical decision-making is a multi-step process whereby physicians need to repeatedly gather patient information through history taking and diagnostic tests, then combine and reason over the results until they feel confident enough to establish a working hypothesis and initiate a treatment. In this context, nearly all tasks are performed within an EHR system. Within such systems, physicians order laboratory tests such as blood or urine samples, microbiological studies, request imaging procedures and order interventions or medications. Crucially, the execution and documentation of these actions are managed within systems that must adhere to standards such as the Fast Healthcare Interoperability Resources (FHIR), which provide a protocol ensuring consistent exchange of information across different systems. Overall, the multi-step clinical workflows physicians need to follow mirror the emerging paradigm of artificial intelligence (AI) agents: LLM-based systems that solve problems autonomously step by step, leveraging external tools or executing software programs 11 . This concept holds great potential in healthcare, in which virtual AI copilots could collaborate with medical professionals on cases under varying levels of supervision.

Several recent studies have explored the use of AI agents in healthcare, from task-level agents operating within FHIR-compatible environments 12 to benchmarks simulating clinical decision-making 13 , 14 . These include AMIE, a conversational diagnostic system optimized for patient–physician dialogue 1 , and MAI-DxO 15 , a multi-agent diagnostician that improved diagnostic accuracy and cost efficiency on complex case vignettes. In live clinical practice, OpenAI and Penda Health have built a non-autonomous safety-net assistant that is embedded in primary-care workflows, which provides suggestions to physicians 16 . Although these efforts have increased realism of AI evaluations in healthcare, they do not evaluate the clinical action capabilities of AI. In this regard, another study evaluated various LLMs on full clinical workflows using real-world data from the MIMIC-IV (Medical Information Mart for Intensive Care) dataset 17 , but its design did not integrate established medical coding systems such as FHIR or encompass essential components of realistic clinical workflows, such as direct patient communication or the management of pre-admission medication. It concluded that current models still lack the reliability necessary to autonomously manage complex medical cases 18 . Thus, despite notable progress in developing increasingly autonomous and generalist LLMs, two critical challenges in healthcare remain unresolved. First, there is a gap regarding the integration of AI agents into existing workflows. Second, the performance and safety of such agents has not yet been evaluated in full patient care workflows spanning communication, diagnosis, treatment decisions and admission.

To address these gaps, we present MIRA, an autonomous AI agent that operates within a controlled, sandboxed virtual EHR. We evaluate it by running full emergency department care workflows on more than 500 cases from MIMIC-IV, in which the agent executes diagnostic and therapeutic decisions across surgery, internal medicine and oncology. MIRA interacts via chat with a patient agent whose responses strictly mirror the documented history of present illness (HPI) extracted from clinical notes, and uses 11 tools with more than 85,000 options to order and interpret laboratory, microbiology and imaging studies, generate diagnostic hypotheses, and execute treatment plans, including scheduling procedures, prescribing medications, and arranging admissions. It navigates a large fraction of the option space available to physicians while complying with FHIR and six coding systems (International Classification of Diseases (ICD), Logical Observation Identifiers Names and Codes (LOINC), Anatomical Therapeutic Chemical (ATC), National Drug Code (NDC), RxNorm and SNOMED-CT). The whole workflow of MIRA is summarized in Fig. 1 and explained in more…

Read the full article at Nature News
Source document: Nature News

1 reports

Nature NewsParty-alignedCenter4 days ago
Towards autonomous medical artificial intelligence agents

The article discusses the progress of large language models (LLMs) in healthcare, highlighting their capabilities in various medical tasks such as question-answering, reasoning, and diagnostic challenges. It also mentions real-world applications like decision-support tools for medical guidelines, data extraction from clinical notes, and generating clinical codes. The text emphasizes that while LLMs have made significant strides, their current narrow applications do not fully leverage their potential in complex clinical workflows. Effective clinical decision-making involves multiple steps, andL

Bias read (Center): The article provides a technical overview of the development and application of large language models in healthcare. It does not take a stance on any political issue, nor does it exhibit biased language or selective sourcing. The content remains focused on technological advancements and their impact

Official sources cited

Go to the primary sources (1)

The official sources this coverage is built on. Read them directly to bypass framing.