Model Card for IndoTranslit Multilingual Transliterator
Model Summary
This is a multilingual Marian-based Seq2Seq transliterator trained on the combined IndoTranslit dataset (2.7M pairs).
It can transliterate both Romanized Hindi (Hindish) and Romanized Bengali (Banglish) into their respective native scripts using a single shared model.
- Architecture: MarianMT (Seq2Seq Transformer)
- Parameters: ~60M
- Training Data: IndoTranslit dataset (Hindi 1.77M, Bengali 975k)
- Languages: Hindi + Bengali
Intended Use
- General-purpose transliteration for South Asian languages in Romanized script.
- Works for multilingual inputs, code-mixed text, and noisy social media writing.
Example Usage
from transformers import MarianMTModel, MarianTokenizer
model_name = "sk-community/indotranslit_multilingual"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Example Hindi
input_hi = "aap kaise ho"
hi_inputs = tokenizer(input_hi, return_tensors="pt")
hi_outputs = model.generate(**hi_inputs)
print(tokenizer.decode(hi_outputs[0], skip_special_tokens=True))
# Output: "आप कैसे हो"
# Example Bengali
input_bn = "tumi amar bondhu"
bn_inputs = tokenizer(input_bn, return_tensors="pt")
bn_outputs = model.generate(**bn_inputs)
print(tokenizer.decode(bn_outputs[0], skip_special_tokens=True))
# Output: "তুমি আমার বন্ধু"
Performance
- BLEU (Hindi): 77.57
- BLEU (Bengali): 77.82
- BLEU (Multilingual): 73.15
Citation
@article{gharami2025indotranslit,
title={Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration},
author={Kanchon Gharami and Quazi Sarwar Muhtaseem and Deepti Gupta and Lavanya Elluri and Shafika Showkat Moni},
year={2025}
}
- Downloads last month
- 31
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support