KiteFish-A1-1.5B

KiteFish-A1-1.5B is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources across mathematics, computer science, and theoretical physics.

📄 Paper: https://arxiv.org/abs/2602.17288
💻 Github: https://github.com/kitefishai/KiteFish-A1-1.5B-Math

This is a base scientific language model (not instruction-tuned).

Overview

KiteFish-A1-1.5B explores what it takes to train a domain-specialized scientific language model directly from structured LaTeX archives.

Training Scale

  • ~52B pretraining tokens
  • ~5B additional post-training tokens
  • ~200GB processed scientific corpus
  • LLaMA-compatible tokenizer (~102k vocab)
  • 2× NVIDIA A100 (80GB) GPUs
  • 24 experimental training runs

The focus of this project is scientific language modeling robustness, not benchmark optimization.

Model Architecture

  • 24 Transformer layers
  • Hidden size: 2048
  • FFN size: 5504
  • 16 attention heads
  • Context length: 4096 (trained at 768 tokens)
  • Dense LLaMA-style architecture

Optimization

  • AdamW
  • Learning rate: 2e-4
  • Warmup: 500 steps
  • Weight decay: 0.1
  • Gradient accumulation: 32
  • bf16 mixed precision
  • Gradient checkpointing enabled

Validation Perplexity: ~4.2 (held-out scientific corpus)

Intended Use

KiteFish-A1-1.5B is suitable for:

  • Scientific text modeling research
  • Mathematical language modeling experiments
  • Pretraining initialization for domain fine-tuning
  • Tokenization and symbolic modeling research
  • Studying LaTeX structure modeling

It is not optimized for:

  • Instruction following
  • Chat-based applications
  • General conversational AI
  • Benchmark leaderboard performance

Performance Notes

This model was trained under moderate compute constraints and without instruction tuning or alignment stages.

Observed characteristics:

  • Strong familiarity with scientific writing style
  • Stable LaTeX structural modeling
  • Reasonable symbolic fluency
  • Limited reasoning depth
  • Low downstream benchmark accuracy without fine-tuning

Performance improves significantly with supervised fine-tuning (SFT), LoRA adaptation, or domain-specific instruction tuning.

Limitations

  • Not instruction-tuned
  • No RLHF or preference alignment
  • Trained at 768-token sequence length
  • Domain restricted to selected arXiv categories
  • Not optimized for reasoning benchmarks
  • General NLP benchmark scores may be low

This release is intended primarily for research and experimentation.

Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "KiteFishAI/KiteFish-A1-1.5B-Math"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Prove that the sum of two continuous functions is continuous."
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model in your research, please cite:

@article{kitefish_a1_2026,
  title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives},
  author={...},
  year={2026},
  eprint={2602.17288},
  archivePrefix={arXiv}
}
Downloads last month
219
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including KiteFishAI/KiteFish-A1-1.5B-Math

Paper for KiteFishAI/KiteFish-A1-1.5B-Math