Baseer-Nakba HTR: A State-of-the-Art VLM for Arabic Handwritten Text Recognition

Overview

This repository contains the model weights and inference pipeline for our submission to the NAKBA NLP 2026 Arabic Handwritten Text Recognition (HTR) competition. Our approach adapts the 3B-parameter Baseer Vision-Language Model (VLM) to effectively parse and recognize highly cursive, historical Arabic manuscripts.

By utilizing a progressive training pipeline, domain-matched data augmentation, and advanced checkpoint merging, this unified model mitigates the challenges of varying writer styles, age-related document degradation, and morphological complexity.

Competition Results

Our final model secured top placements on the official Nakba hidden test set leaderboard.

Metric	Score	Rank
Word Error Rate (WER)	0.25	1st
Character Error Rate (CER)	0.09	2nd

Training Methodology

Our model was trained using a multi-stage Supervised Fine-Tuning (SFT) curriculum. Data Augmentation: The Muharaf enhancement dataset was converted to grayscale to match the visual complexity and tonal distribution of the Nakba competition data. Decoder-Only SFT: We first trained the text decoder autoregressively on the structurally similar Muharaf dataset to condition the language modeling head. Full Encoder-Decoder Tuning: We subsequently unfroze the vision encoder and trained the full architecture on the Nakba dataset. Checkpoint Merging: To stabilize predictions and maximize generalization, we averaged the weights of our top-performing epochs (Epoch 1 and Epoch 5).

Training Hyperparameters

All supervised experiments were conducted ensuring standardized hyperparameters across configurations.

Parameter	Value
Hardware	2 NVIDIA H100 GPUs
Base Model	[3B-parameter Baseer
Epochs	5
Optimizer	AdamW
Weight Decay	0.01
Learning Rate Schedule	Cosine
Batch Size	128
Max Sequence Length	1200 tokens
Input Image Resolution	644 x 644 pixels
Decoder-Only Learning Rate	1e-4
Encoder-Decoder Learning Rate	Text Decoder: 1e-4, Vision Encoder: 9e-6

Merge Method

This model was merged using the SLERP merge method.

Models Merged

The following models were included in the merge:

Basser_Nakab_ep_5
Basser_Nakab_ep_1

Configuration

The following YAML configuration was used to produce this model:

merge_method: slerp
base_model: Basser_Nakab_ep_1
models:
  - model: Basser_Nakab_ep_1
  - model: Basser_Nakab_ep_5
parameters:
  t:
    - value: 0.50
dtype: bfloat16

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Paper for Misraj/Baseer__Nakba

Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

Paper • 2509.18174 • Published Sep 17, 2025 • 132