ROCO-Radiology-CLIP (ViT-B/32)
A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.
This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling zero-shot classification and semantic search for radiology concepts.
Performance (Test Set)
- Batch-wise Recall@1: 70.83% (State-of-the-art for T4 fine-tuning)
- Batch-wise Recall@5: 96.99%
- Global Retrieval Recall@1: ~6% (500x better than random chance)
- Global Retrieval Recall@5: ~16% Though a lot of work need to be done on this as the recall is still quite low. It will be updated with newer version
Usage
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("spicy03/CLIP-ROCO-v1")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Predict
image = Image.open("chest_xray.jpg")
labels = ["Pneumonia", "Normal", "Edema"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs)
- Downloads last month
- 27
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for spicy03/CLIP-ROCO-v1
Base model
openai/clip-vit-base-patch32