DBNet++ RepViT (Chinese)

Lightweight text detection model combining DBNet++ with RepViT backbone, optimized for efficient inference. Pretrained on Chinese text detection datasets.

Model Details

Component	Configuration
Architecture	DBNet++ (Differentiable Binarization)
Backbone	RepViT (lightweight ViT-inspired CNN)
Neck	RSEFPN (in: [48, 96, 192, 384], out: 96)
Head	DBNetPPHead (inner: 24, k: 50)
Parameters	~3M
Input Size	640x640 (flexible)

Training Data

This model was converted from OpenOCR pretrained weights, trained on Chinese text detection datasets.

Recommended datasets for fine-tuning:

MSRA-TD500 (Chinese + English)
ICDAR2017 RCTW (Chinese)
CTW1500

Note: For English-only text detection, fine-tuning on English datasets (ICDAR2015, Total-Text) is recommended.

Usage

With Hugging Face

from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="thisisiron/dbnetpp_repvit_ch",
    filename="dbnetpp_repvit_ch.pth"
)

# Load weights
state_dict = torch.load(model_path, map_location="cpu")

With OCR-Factory

import torch
from ocrfactory.models.detect import DBNetPP

# Build model
model = DBNetPP(
    backbone={"name": "RepViT"},
    neck={
        "name": "RSEFPN",
        "in_channels": [48, 96, 192, 384],
        "out_channels": 96,
        "shortcut": True
    },
    head={
        "name": "DBNetPPHead",
        "in_channels": 96,
        "inner_channels": 24,
        "k": 50,
        "use_asf": False
    }
)

# Load weights
state_dict = torch.load("dbnetpp_repvit_ch.pth", map_location="cpu")
model.load_state_dict(state_dict, strict=True)
model.eval()

# Inference
x = torch.randn(1, 3, 640, 640)
with torch.no_grad():
    output = model(x)
    shrink_map = output["shrink_map"]  # (1, 1, 640, 640)

Training Config (YAML)

architecture:
  backbone:
    name: RepViT
  neck:
    name: RSEFPN
    in_channels: [48, 96, 192, 384]
    out_channels: 96
    shortcut: true
  head:
    name: DBNetPPHead
    in_channels: 96
    inner_channels: 24
    k: 50
    use_asf: false

Performance

Dataset	Precision	Recall	H-mean
MSRA-TD500	-	-	-

Performance metrics will be updated after benchmarking.

References

OpenOCR: https://github.com/Topdu/OpenOCR
RepViT: https://github.com/THU-MIG/RepViT
DBNet++: Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for thisisiron/dbnetpp_repvit_ch

Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion

Paper • 2202.10304 • Published Feb 21, 2022