Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
Abstract
A framework for developing specialized Turkish legal language models through domain adaptation, featuring a pre-trained encoder model and decoder models with continual pre-training for enhanced legal text processing.
This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
Community
Mecellem Models propose Turkish legal-domain encoders and decoders trained from scratch and via continual pre-training.
ModernBERT-based encoders (112.7B tokens) achieve top-3 Turkish retrieval results with high production efficiency, while Qwen3-based decoders show 36.2% perplexity reduction on legal text.
Models and datasets are released via Hugging Face to support reproducible and cost-effective legal NLP for Turkish and other low-resource languages.
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/mecellem-models-turkish-models-trained-from-scratch-and-continually-pre-trained-for-the-legal-domain
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLMs via Curriculum Learning (2025)
- TabiBERT: A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish (2025)
- TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)
- Bielik 11B v3: Multilingual Large Language Model for European Languages (2025)
- AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages (2026)
- Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training (2026)
- MiniLingua: A Small Open-Source LLM for European Languages (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend