LSE-DINOv2: Local Scale Equivariant DINOv2
A DINOv2 Vision Transformer equipped with Deep Equilibrium Model (DEM) based local scale adaptation for improved scale equivariance (local scale consistency).
π Key Features
- Learned Local Scaling: Learns to adaptively scale different image regions based on content
- Deep Equilibrium Model: Uses fixed-point iteration to find optimal deformation parameters
- Multi-Layer Adaptation: Applies scaling at multiple transformer layers
- ImageNet-1K Trained: Fine-tuned on ImageNet-1k for classification
π¦ Installation
pip install torch torchvision timm transformers safetensors
pip install torchdeq # Optional but Recommended: for DEQ solver, falls back to simple iteration if not installed.
π Quick Start
Basic Inference
β οΈ Important: This model uses custom code. The example below automatically handles downloading the custom code files (
modeling_lse_dinov2.pyandconfiguration_lse_dinov2.py) to the cache usingtrust_remote_code=Trueandsnapshot_download. Make sure you trust the source before running custom code.
import torch
import sys
from PIL import Image
from torchvision import transforms
from transformers import AutoConfig
from huggingface_hub import snapshot_download
# Helper function to load the model (handles custom code automatically)
def load_lse_dinov2(repo_name="ashiq24/lse-dinov2-base"):
"""Load LSE-DINOv2 model from HuggingFace Hub."""
import sys
from huggingface_hub import snapshot_download
# Load config with trust_remote_code to download custom code
config = AutoConfig.from_pretrained(repo_name, trust_remote_code=True)
# Download model files to cache and get the directory
cache_dir = snapshot_download(repo_id=repo_name, allow_patterns="*.py")
# Add cache directory to Python path
if cache_dir not in sys.path:
sys.path.insert(0, cache_dir)
# Import model class from downloaded files
from modeling_lse_dinov2 import LSEDinoV2ForImageClassification
return LSEDinoV2ForImageClassification.from_pretrained(repo_name)
# Load model
model = load_lse_dinov2("ashiq24/lse-dinov2-base")
model.eval()
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Prepare image with ImageNet preprocessing
transform = transforms.Compose([
transforms.Resize(256, interpolation=transforms.InterpolationMode.BICUBIC),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Load and preprocess image
image = Image.open("your_image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0).to(device)
# Run inference
with torch.no_grad():
outputs = model(pixel_values)
predicted_class = outputs.logits.argmax(-1).item()
print(f"Predicted class: {predicted_class}")
Deformation Visualization
The model learns content-aware local scaling transformations (canonicalization). Here's a visualization showing original images and their deformed versions at different layers:
Each row shows an input image and its learned deformations at the input, Layer 4, and Layer 8, along with the deformation magnitude heatmaps.
Extracting Learned Deformation Parameters
The model learns content-aware deformation parameters (phi) that can be obtained:
# Get phi parameters for visualization
with torch.no_grad():
phi_x_list, phi_y_list = model.get_phi_parameters(pixel_values)
for i, (phi_x, phi_y) in enumerate(zip(phi_x_list, phi_y_list)):
print(f"Layer {i}: phi_x shape {phi_x.shape}, phi_y shape {phi_y.shape}")
ποΈ Architecture
Base Model
| Component | Value |
|---|---|
| Architecture | DINOv2 ViT-Base |
| Patch Size | 14Γ14 |
| Image Size | 224Γ224 |
| Embedding Dim | 768 |
| Transformer Blocks | 12 |
| Attention Heads | 12 |
| Register Tokens | 4 |
DEM Adapter
| Component | Value |
|---|---|
| Phi Layers | 3 |
| Local Scaling Grids | [16x16, 8x8, 8x8] |
| DEQ #Layers | 2 |
| CNN #Channels | [3, 64, 128] |
| DEQ Iterations | 5 |
| Applied at Layers | [input, 4, 8] |
π Additional Resources
For detailed information about the model architecture, training procedures, and experimental results, please refer to:
- GitHub Repository: local-scale-equivariance - Full codebase, training scripts, and documentation
- Project Website: Local Scale Equivariance - visualizations, and additional resources
- Paper: Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer - Paper
π Training Details
| Parameter | Value |
|---|---|
| Dataset | ImageNet-1K |
| Optimizer | AdamW |
| Backbone LR | 5e-5 |
| DEM LR | 1e-5 |
| Epochs | 20 |
| Batch Size | 200 |
β οΈ Limitations
- Computational Overhead: Additional inference time due to DEQ iterations
- Memory Usage: Slightly higher than standard DINOv2
- Dependencies: Requires
timmand optionallytorchdeq
π License
Apache 2.0
π Acknowledgements
π Citation
@inproceedings{rahman2025local,
title={Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer},
author={Rahman, Md Ashiqur and Yang, Chiao-An and Cheng, Michael N and Hao, Lim Jun and Jiang, Jeremiah and Lim, Teck-Yian and Yeh, Raymond A},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={10527--10537},
year={2025}
}
- Downloads last month
- 14,833
