LSE-DINOv2: Local Scale Equivariant DINOv2

License PyTorch timm

A DINOv2 Vision Transformer equipped with Deep Equilibrium Model (DEM) based local scale adaptation for improved scale equivariance (local scale consistency).

πŸ”‘ Key Features

  • Learned Local Scaling: Learns to adaptively scale different image regions based on content
  • Deep Equilibrium Model: Uses fixed-point iteration to find optimal deformation parameters
  • Multi-Layer Adaptation: Applies scaling at multiple transformer layers
  • ImageNet-1K Trained: Fine-tuned on ImageNet-1k for classification

πŸ“¦ Installation

pip install torch torchvision timm transformers safetensors
pip install torchdeq  # Optional but Recommended: for DEQ solver, falls back to simple iteration if not installed. 

πŸš€ Quick Start

Basic Inference

⚠️ Important: This model uses custom code. The example below automatically handles downloading the custom code files (modeling_lse_dinov2.py and configuration_lse_dinov2.py) to the cache using trust_remote_code=True and snapshot_download. Make sure you trust the source before running custom code.

import torch
import sys
from PIL import Image
from torchvision import transforms
from transformers import AutoConfig
from huggingface_hub import snapshot_download

# Helper function to load the model (handles custom code automatically)
def load_lse_dinov2(repo_name="ashiq24/lse-dinov2-base"):
    """Load LSE-DINOv2 model from HuggingFace Hub."""
    import sys
    from huggingface_hub import snapshot_download
    
    # Load config with trust_remote_code to download custom code
    config = AutoConfig.from_pretrained(repo_name, trust_remote_code=True)
    
    # Download model files to cache and get the directory
    cache_dir = snapshot_download(repo_id=repo_name, allow_patterns="*.py")
    
    # Add cache directory to Python path
    if cache_dir not in sys.path:
        sys.path.insert(0, cache_dir)
    
    # Import model class from downloaded files
    from modeling_lse_dinov2 import LSEDinoV2ForImageClassification
    return LSEDinoV2ForImageClassification.from_pretrained(repo_name)

# Load model
model = load_lse_dinov2("ashiq24/lse-dinov2-base")
model.eval()

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Prepare image with ImageNet preprocessing
transform = transforms.Compose([
    transforms.Resize(256, interpolation=transforms.InterpolationMode.BICUBIC),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Load and preprocess image
image = Image.open("your_image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0).to(device)

# Run inference
with torch.no_grad():
    outputs = model(pixel_values)
    predicted_class = outputs.logits.argmax(-1).item()

print(f"Predicted class: {predicted_class}")

Deformation Visualization

The model learns content-aware local scaling transformations (canonicalization). Here's a visualization showing original images and their deformed versions at different layers:

Deformation Visualization

Each row shows an input image and its learned deformations at the input, Layer 4, and Layer 8, along with the deformation magnitude heatmaps.

Extracting Learned Deformation Parameters

The model learns content-aware deformation parameters (phi) that can be obtained:

# Get phi parameters for visualization
with torch.no_grad():
    phi_x_list, phi_y_list = model.get_phi_parameters(pixel_values)

for i, (phi_x, phi_y) in enumerate(zip(phi_x_list, phi_y_list)):
    print(f"Layer {i}: phi_x shape {phi_x.shape}, phi_y shape {phi_y.shape}")

πŸ—οΈ Architecture

Base Model

Component Value
Architecture DINOv2 ViT-Base
Patch Size 14Γ—14
Image Size 224Γ—224
Embedding Dim 768
Transformer Blocks 12
Attention Heads 12
Register Tokens 4

DEM Adapter

Component Value
Phi Layers 3
Local Scaling Grids [16x16, 8x8, 8x8]
DEQ #Layers 2
CNN #Channels [3, 64, 128]
DEQ Iterations 5
Applied at Layers [input, 4, 8]

πŸ“Š Additional Resources

For detailed information about the model architecture, training procedures, and experimental results, please refer to:

πŸ“ˆ Training Details

Parameter Value
Dataset ImageNet-1K
Optimizer AdamW
Backbone LR 5e-5
DEM LR 1e-5
Epochs 20
Batch Size 200

⚠️ Limitations

  • Computational Overhead: Additional inference time due to DEQ iterations
  • Memory Usage: Slightly higher than standard DINOv2
  • Dependencies: Requires timm and optionally torchdeq

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgements

  • DINOv2 by Meta AI
  • timm by Ross Wightman
  • torchdeq for Deep Equilibrium Model implementation

πŸ“š Citation

@inproceedings{rahman2025local,
  title={Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer},
  author={Rahman, Md Ashiqur and Yang, Chiao-An and Cheng, Michael N and Hao, Lim Jun and Jiang, Jeremiah and Lim, Teck-Yian and Yeh, Raymond A},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={10527--10537},
  year={2025}
}
Downloads last month
14,833
Safetensors
Model size
86.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ashiq24/lse-dinov2-base