Radiology Report Labeling on MIMIC-CXR

Abstract

This research explores using CLIP (Contrastive Language-Image Pre-training) to label chest radiography images from the MIMIC-CXR dataset, which contains 377,110 chest X-ray images with associated radiologist notes. A subset of 18,015 images/text pairs was used, which were then embedded into 512-dimensional vectors using CLIP. Two experiments were conducted: zero-shot and fine-tuned approaches. The process involved embedding training text examples to create a search space, then using nearest neighbor with cosine similarity to match test images to the most relevant text vectors. Results showed limited success due to the specificity of medical images and the small training dataset.

Introduction

CLIP is a multi-modal embedding model which can be used to identify similarities between text and images. This study investigates CLIP’s performance on specialized medical datasets. The MIMIC-CXR dataset from Beth Israel Deaconess Medical Center contains chest radiographs with free-text radiology reports. The problem is framed as finding the most similar report for a new patient’s X-ray to potentially help doctors diagnose patients more efficiently.

Dataset

A total of 18,015 images/text pairs from the MIMIC-CXR Database v2.0.0 were used, which contains 377,110 images with 227,835 radiographic studies. The focus was on the most diagnostically relevant sections of the reports (impression section first, findings section if impression wasn’t available, or the final section as a last resort).

Methodology

Unlike previous approaches that train CNN models on labeled classifications, CLIP was used directly on X-ray images and text report pairs. This preserves important details that might be lost when converting to simple labels (e.g., “severe pneumonia” vs. “mild pneumonia”).

Model use case

Figure 1: An example of the primary use case of the model. (Left) The input image the model has never seen before. (Center) The model outputs a text description that is semantically similar to (Right) the actual report.

Two methods were employed: zero-shot with CLIP and a fine-tuned version of CLIP. The dataset was split into 80/10/10 train/validation/test splits and experiments were conducted with both zero-shot and fine-tuned CLIP models (ViT-B/32). We split our 18,015 image/text pairs from MIMIC into an 80/10/10 train/validation/test split. This gave 14,412 image/text pairs for fine-tuning.

For fine-tuning the CLIP model, both text and image training data from the MIMIC-CXR dataset were used. During this process, some of the early layers of the model were frozen and training was prioritized for the deeper layers. The model was then trained for 20 epochs on similar settings as in the original CLIP paper but with lowered learning rate and weight decay. CLIP’s loss function was tracked on the validation set, and the model with the best performance was saved.

For evaluation, output text was converted into multi-labels using CheXpert labeler and compared to ground truth labels across metrics including precision, recall, F1-score, accuracy, and AUROC.

Results

Zero-shot Model Results:

Condition	Precision	Recall	F1	Accuracy	AUROC	Positives
No Finding	0.396	0.058	0.102	0.608	0.502	686
Enlarged Cardiomediastinum	0.000	0.000	NaN	0.906	0.495	154
Cardiomegaly	0.301	0.222	0.256	0.674	0.524	455
Lung Lesion	0.055	0.064	0.059	0.912	0.507	78
Lung Opacity	0.236	0.584	0.337	0.462	0.504	421
Edema	0.218	0.157	0.183	0.648	0.485	451
Consolidation	0.000	0.000	NaN	0.903	0.499	171
Pneumonia	0.242	0.527	0.331	0.456	0.479	461
Atelectasis	0.216	0.377	0.274	0.545	0.486	411
Pneumothorax	0.192	0.728	0.303	0.345	0.490	353
Pleural Effusion	0.319	0.525	0.397	0.443	0.462	629
Pleural Other	0.000	0.000	NaN	0.979	0.498	32
Fracture	0.031	0.111	0.048	0.847	0.493	63
Support Devices	0.253	0.561	0.349	0.512	0.529	421

Zero-shot Model Overall Results:

Metric	Value
Average F1 Score	0.2399
Average Accuracy	0.6600
Average AUROC	0.497

(Note that the F1 score above is calculated from 11 valid categories, excluding the 3 categories with NaN values.)

Fine-tuned Model Results:

Condition	Precision	Recall	F1	Accuracy	AUROC	Positives
No Finding	0.337	0.214	0.262	0.541	0.478	686
Enlarged Cardiomediastinum	0.031	0.019	0.024	0.864	0.481	154
Cardiomegaly	0.234	0.081	0.121	0.701	0.496	455
Lung Lesion	0.050	0.077	0.061	0.897	0.505	78
Lung Opacity	0.240	0.380	0.294	0.573	0.506	421
Edema	0.251	0.166	0.200	0.667	0.500	451
Consolidation	0.114	0.135	0.123	0.819	0.512	171
Pneumonia	0.236	0.293	0.262	0.577	0.484	461
Atelectasis	0.238	0.151	0.185	0.696	0.504	411
Pneumothorax	0.186	0.232	0.207	0.650	0.492	353
Pleural Effusion	0.342	0.243	0.284	0.573	0.496	629
Pleural Other	0.083	0.031	0.045	0.977	0.513	32
Fracture	0.020	0.063	0.031	0.860	0.476	63
Support Devices	0.229	0.197	0.212	0.657	0.497	421

Fine-tuned Model Overall Results:

Metric	Value
Average F1 Score	0.1651
Average Accuracy	0.7180
Average AUROC	0.4960

Both zero-shot and fine-tuned models performed similarly, with the fine-tuned model slightly better on AUROC for most of the five clinically important pathologies (atelectasis, cardiomegaly, consolidation, edema, and pleural effusion). However, overall performance was lower than CNN models directly trained on labels in previous research, which achieved AUROC scores >0.7 or even >0.8, while this approach achieved scores mostly around 0.5.

Conclusion

This is a difficult problem. To the untrainted eye the radiology reports do not have any clear differences. CLIP pre-training does not focus on enabling the network to distinguish such high-frequency patterns in the overall image. While fine-tuning improved the model slightly, the results remained far from clinically useful standards. We are also limited here because we train our model to output full text descriptions, which is then categorized into a label using CheXpert. Models trained directly on CheXpert labels would naturally perform better when evaluated on those same labels. We are also very limited by the ammount of training data and compute. Future research directions include:

Fine-tuning CLIP on a larger portion of the dataset
Experimenting with k-nearest neighbors search for the labels
Better utilizing multiple associated images from the same study
Incorporating CheXpert labeling while preserving detailed text

We believe further development could improve pathology classification beyond simple labeling.