Task-specific LLMs more accurate than general ones for identifying ILNs

Kate Madden Yee, Senior Editor, AuntMinnie.com. Headshot

Task-specific natural language processing models (LLMs) are more accurate and reliable for extracting incidental lung nodule (ILN) data from unstructured chest CT radiology reports than general-purpose LLMs, researchers have found.

"Artificial intelligence-based tools that automatically identify and flag actionable findings can standardize and optimize patient tracking and promote more equitable adherence," wrote a group led by João Martins da Fonseca, MD, of the University of Florida in Gainesville. The team's results were published February 5 in Radiology: Cardiothoracic Imaging.

ILNs are often found on chest CT, the group explained, noting that those measuring 6 mm or more require CT follow-up and those equal to or greater than 8 mm may require even more diagnostic workup, such as PET/CT imaging or biopsy. But follow-up for ILNs "remains inadequate, with adherence below 50% in some cohorts -- underscoring the need for alternatives to improve care," it wrote.

LLMs offer general-purpose natural language processing (NLP) capabilities, but task-specific NLP tools may provide greater reliability for culling clinically useful data from imaging exams. Da Fonseca's group compared the diagnostic performance of an NLP model with seven LLMs for extracting ILN-related information from unstructured radiology reports.

The NLP tool (dubbed Focused incidental Nodule detection, or FiNd) was developed with data from 21,542 radiology reports. The investigators used it to analyze 1,016 chest CT radiology reports and compared its performance with the following seven LLMs:

  1. Gemma (Google)
  2. Haiku (Anthropic)
  3. Sonnet 2 (MED-EL)
  4. GPT-4o (OpenAI)
  5. DeepSeek (DeepSeek)
  6. Phi-4 (Microsoft)
  7. MedGemma (Google)

The researchers prompted the models to identify ILNs and to assign nodules to one of three size categories (< 6 mm or unspecified, 6 mm to 7.9 mm, or ≥ 8 mm). The researchers assessed the models' performance for accuracy, sensitivity, and specificity for three classification tasks: ILN detection, identification of nodules ≥ 6 mm, and identification of nodules ≥ 8 mm.

The team reported the following:

Overall performance of task-specific LLM compared to general-purpose LLMs for identifying lung nodules
Measure Gemma Haiku Sonnet 2 DeepSeek GPT-4o Phi-4 MedGemma
Accuracy 77.7% 86.6% 86.6% 83.4% 87.9% 83.7% 88.6%
Sensitivity 98.1% 98.9% 98.5% 98.1% 94.8% 96.7% 95.8%
Specificity 70.3% 39.7% 82.3% 78% 85.5% 79% 86%

When it came to identifying nodules of specific sizes, FiNd showed high accuracy (96.8%) and balanced sensitivity and specificity for nodules equal to or greater than 6 mm and highest accuracy for those nodules equal to or greater than 8 mm (97.4%).

"The NLP model, tailored to ILN terminology and structured parsing rules, outperformed general LLMs in key diagnostic tasks," the team wrote, and urged future research to "explore adapting LLMs using radiology-specific training and real-time clinical integration."

Click here for the full study.

Page 1 of 400
Next Page