ROCO-Radiology-CLIP (ViT-B/32)

A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.

This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling zero-shot classification and semantic search for radiology concepts.

Performance (Test Set)

  • Batch-wise Recall@1: 70.83% (State-of-the-art for T4 fine-tuning)
  • Batch-wise Recall@5: 96.99%
  • Global Retrieval Recall@1: ~6% (500x better than random chance)
  • Global Retrieval Recall@5: ~16% Though a lot of work need to be done on this as the recall is still quite low. It will be updated with newer version

Usage

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("spicy03/CLIP-ROCO-v1")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Predict
image = Image.open("chest_xray.jpg")
labels = ["Pneumonia", "Normal", "Edema"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs)
Downloads last month
27
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spicy03/CLIP-ROCO-v1

Finetuned
(116)
this model

Dataset used to train spicy03/CLIP-ROCO-v1

Space using spicy03/CLIP-ROCO-v1 1