LLM-Driven System Replicates Expert Feature Extraction Across Clinical Cohorts

Global: LLM‑Driven System Replicates Expert Feature Extraction Across Clinical Cohorts

A research team introduced SNOW, a multi‑agent large language model (LLM) workflow that autonomously generates patient‑level clinical features from unstructured electronic health record (EHR) notes. The system was evaluated on a 147‑patient prostate‑cancer cohort for five‑year recurrence prediction and on a separate heart‑failure cohort of 2,084 patients for mortality prediction, achieving performance on par with or exceeding manual expert extraction while dramatically reducing human effort.

Study Design and Manual Clinician Feature Generation

Domain experts applied a rigorous Clinician Feature Generation (CFG) protocol, manually reviewing prostate‑cancer notes to define nuanced features for each patient. This labor‑intensive process produced a high‑fidelity ground‑truth feature table that served as the benchmark for subsequent automation.

SNOW Architecture and Automation Process

SNOW employs a transparent, modular set of LLM agents that replicate the iterative reasoning and validation steps performed by clinicians. The agents ingest raw notes, propose candidate features, and iteratively refine them under limited expert oversight, thereby creating a complete patient‑level feature set without manual abstraction.

Performance Evaluation on the Prostate‑Cancer Cohort

When predicting five‑year cancer recurrence, SNOW attained an AUC‑ROC of 0.767, closely matching the manual CFG baseline of 0.762. The system outperformed structured data baselines, a clinician‑guided LLM extraction approach, and six alternative representational feature‑generation methods.

Efficiency Gains and Resource Reduction

After initial configuration, SNOW generated the full feature table for the prostate‑cancer cohort in 12 hours with only five hours of clinician oversight, representing an approximate 48‑fold reduction in expert labor compared with the manual CFG process.

External Validation on an HFpEF Cohort

Without task‑specific tuning, SNOW was applied to a heart‑failure with preserved ejection fraction (HFpEF) cohort drawn from MIMIC‑IV (n = 2,084). The system produced prognostic features that yielded AUC‑ROC scores of 0.851 for 30‑day mortality and 0.763 for 1‑year mortality, surpassing both baseline and alternative feature‑generation techniques.

Implications for Clinical AI Deployment

These findings demonstrate that a modular LLM‑agent framework can scale expert‑level feature generation from unstructured clinical text, preserve interpretability, and maintain generalizability across diverse disease settings. The approach offers a pathway to integrate rich narrative data into predictive models while alleviating the bottleneck of manual abstraction.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

LLM‑Driven System Replicates Expert Feature Extraction Across Clinical Cohorts