SmolLM2-70M

A SmolLM2-70M model pretrained on the Sutra-10B pedagogical dataset for 3 epochs (~30.6B tokens total). This model demonstrates that a 69M parameter model can be trained to near-capacity performance using dense, curated educational data.

Model Details

Property Value
Architecture LlamaForCausalLM
Parameters 69.2M
Hidden Size 384
Layers 32
Attention Heads 6 (2 KV heads)
Context Length 8,192
Vocabulary 49,152
Precision bfloat16
Base Model SmolLM2-70M
Training Dataset Sutra-10B (10.2B tokens)

Training

The model was trained for 3 epochs on the Sutra-10B dataset using a single NVIDIA L40S GPU (46GB). This checkpoint is the best perplexity checkpoint from epoch 3.

Epoch Tokens Training Time Learning Rate Best Perplexity
1 10.2B 25.82h 3e-4 → 3e-5 39.50
2 10.2B 25.78h 1e-4 → 1e-5 37.81
3 10.2B 26.16h 3e-5 → 3e-6 37.72
Total 30.6B 77.76h — 37.72

Training configuration:

  • Optimizer: AdamW (fused), weight decay 0.1
  • Schedule: Cosine with warmup
  • Batch size: 4 per device, gradient accumulation 8 (effective ~262K tokens/step)
  • Sequence length: 8,192
  • Flash Attention 2, TF32 matmul, torch.compile
  • Throughput: ~110K tokens/sec

Benchmark Results

All benchmarks evaluated using lm-evaluation-harness v0.4.11. All tasks are 0-shot except GSM8K (5-shot).

This Model vs Training Progression

Benchmark E3-best E3-final E2-best E2-final E1-final
ARC-Easy 33.00 33.16 32.83 33.12 33.46
ARC-Challenge 22.35 21.67 22.61 22.44 22.44
BoolQ 39.66 39.66 39.79 39.54 39.79
HellaSwag 26.14 26.03 26.08 25.91 26.03
PIQA 54.84 55.01 54.24 54.13 54.62
SciQ 45.20 46.30 44.10 45.50 43.60
WinoGrande 50.04 49.33 50.51 48.70 48.78
TruthfulQA 48.02 47.93 48.30 48.14 48.30
GSM8K 0.53 0.61 0.68 0.83 0.15
MMLU 22.96 22.87 23.00 22.98 22.99
OpenBookQA 27.60 27.60 — — —
Average (10) 34.27 34.26 34.21 34.13 34.02

Comparison with 1B Token Baselines (SmolLM2-70M)

These are results from training the same SmolLM2-70M model on various 1B-token datasets from the Pre-training Dataset Samples collection for 1 epoch, showing that Sutra-10B at 3 epochs achieves the highest performance for this model size.

Dataset (1B tokens) HellaSwag PIQA WinoGrande ARC-C MMLU TruthfulQA GSM8K Avg
Sutra-10B (3 epochs) 26.14 54.84 50.04 22.35 22.96 48.02 0.53 34.27
Sutra-1B 25.43 53.86 49.41 23.04 22.91 49.09 1.14 32.13
FineWiki-1B 25.56 51.69 48.86 24.15 23.34 51.16 0.91 32.24
FinePDFs-1B 25.58 52.56 50.51 22.44 22.95 51.41 1.21 32.38
DCLM-Baseline-1B 25.85 55.17 50.20 21.08 22.97 49.21 0.68 32.16
FineWeb-Edu-1B 25.72 55.11 50.36 21.25 22.96 48.11 1.21 32.10
Essential-Web-1B 26.02 55.44 48.30 20.99 22.95 49.59 1.29 32.08
Synth-1B 26.63 50.98 48.78 21.93 23.24 47.10 1.29 31.42

Key Findings

  1. Capacity ceiling: The 70M parameter model reaches its capacity ceiling at approximately 10B tokens. Additional epochs (up to 30.6B total tokens) yield only marginal improvements in benchmark scores (+0.25 average from epoch 1 to epoch 3), despite continued perplexity improvement (39.50 → 37.72).

  2. Perplexity vs benchmarks: Perplexity continues to decrease across epochs, but downstream benchmark performance plateaus, suggesting the model's representational capacity is the bottleneck rather than data exposure.

  3. Data quality matters: Even at 1B tokens, Sutra outperforms or matches larger web-crawled datasets (DCLM, FineWeb-Edu, Essential-Web) on average, demonstrating the value of curated pedagogical content.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("codelion/SmolLM2-70M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("codelion/SmolLM2-70M")

input_text = "The theory of relativity states that"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • This is a 69M parameter base model (not instruction-tuned) — it generates completions, not conversational responses
  • Performance is at the capacity ceiling for this model size; larger models would benefit more from the Sutra-10B dataset
  • The model was trained primarily on English educational content

Related Resources

  • Dataset: codelion/sutra-10B — 10B token pedagogical pretraining dataset
  • Sutra Framework: Generates structured educational content optimized for LLM pretraining

License

Apache 2.0

Downloads last month
8
Safetensors
Model size
69.2M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train codelion/SmolLM2-70M

Evaluation results