KiteFish-A1-1.5B

KiteFish-A1-1.5B is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources across mathematics, computer science, and theoretical physics.

📄 Paper: https://arxiv.org/abs/2602.17288
💻 Github: https://github.com/kitefishai/KiteFish-A1-1.5B-Math

This is a base scientific language model (not instruction-tuned).

Overview

KiteFish-A1-1.5B explores what it takes to train a domain-specialized scientific language model directly from structured LaTeX archives.

Training Scale

~52B pretraining tokens
~5B additional post-training tokens
~200GB processed scientific corpus
LLaMA-compatible tokenizer (~102k vocab)
2× NVIDIA A100 (80GB) GPUs
24 experimental training runs

The focus of this project is scientific language modeling robustness, not benchmark optimization.

Model Architecture

24 Transformer layers
Hidden size: 2048
FFN size: 5504
16 attention heads
Context length: 4096 (trained at 768 tokens)
Dense LLaMA-style architecture

Optimization

AdamW
Learning rate: 2e-4
Warmup: 500 steps
Weight decay: 0.1
Gradient accumulation: 32
bf16 mixed precision
Gradient checkpointing enabled

Validation Perplexity: ~4.2 (held-out scientific corpus)

Intended Use

KiteFish-A1-1.5B is suitable for:

Scientific text modeling research
Mathematical language modeling experiments
Pretraining initialization for domain fine-tuning
Tokenization and symbolic modeling research
Studying LaTeX structure modeling

It is not optimized for:

Instruction following
Chat-based applications
General conversational AI
Benchmark leaderboard performance

Performance Notes

This model was trained under moderate compute constraints and without instruction tuning or alignment stages.

Observed characteristics:

Strong familiarity with scientific writing style
Stable LaTeX structural modeling
Reasonable symbolic fluency
Limited reasoning depth
Low downstream benchmark accuracy without fine-tuning

Performance improves significantly with supervised fine-tuning (SFT), LoRA adaptation, or domain-specific instruction tuning.

Limitations

Not instruction-tuned
No RLHF or preference alignment
Trained at 768-token sequence length
Domain restricted to selected arXiv categories
Not optimized for reasoning benchmarks
General NLP benchmark scores may be low

This release is intended primarily for research and experimentation.

Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "KiteFishAI/KiteFish-A1-1.5B-Math"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Prove that the sum of two continuous functions is continuous."
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model in your research, please cite:

@article{kitefish_a1_2026,
  title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives},
  author={...},
  year={2026},
  eprint={2602.17288},
  archivePrefix={arXiv}
}

Downloads last month: 219

Safetensors

Model size

2B params

Tensor type

BF16

Collection including KiteFishAI/KiteFish-A1-1.5B-Math

KiteFish-Math

Collection

Math specific model series • 3 items • Updated 1 day ago

Paper for KiteFishAI/KiteFish-A1-1.5B-Math

ArXiv-to-Model: A Practical Study of Scientific LM Training

Paper • 2602.17288 • Published 2 days ago • 4