JuliaGPTDistill

A ~5M parameter LLaMA-style student model distilled from JuliaFluxGPT (10M params). Uses knowledge distillation with temperature scaling to compress the teacher's knowledge into a smaller architecture.

Architecture

Parameter	Value
Architecture	LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA)
Embedding dim	256
Layers	4
Query heads	4
KV heads	2 (GQA ratio 2:1)
Head dim	64
Context length	256 tokens
Vocabulary	2,000 (ByteLevel BPE)
Dropout	0.1
Weight tying	Yes
Framework	Julia + Flux.jl

Distillation Settings

Parameter	Value
Teacher model	JuliaFluxGPT (512d/8L/8Q/2KV)
KD temperature	4.0
KD alpha	0.5
Loss	0.5 * CE + 0.5 * KL(teacher \|\| student)

Training

	Value
Dataset	philosophy-corpus
Tokenizer	BPE (2,000 vocab, ByteLevel)
Training steps	4,089
Best val loss	7.44
Hardware	NVIDIA RTX 3060 12GB

Inference Settings

Parameter	Value
vocab_size	2,000
context_length	256
temperature	0.8
top_k	40

Note: This model requires the same BPE tokenizer used by JuliaFluxGPT. No tokenizer file is included in this repo — use the tokenizer from JuliaFluxGPT.

Checkpoint Format

JLD2 files containing:

model_state — Flux model weights
hyperparams — Dict("n_embd"=>256, "n_layer"=>4, "n_head"=>4, "n_kv_head"=>2, "vocab_size"=>2000, "block_size"=>256, "dropout"=>0.1, "kd_temperature"=>4.0, "kd_alpha"=>0.5)
step, best_val_loss, train_losses, val_losses

Files

File	Description
`best_model.jld2`	Best validation loss checkpoint
`final_model.jld2`	Final training step checkpoint
`checkpoint_latest.jld2`	Latest periodic checkpoint

Provenance

Author: LisaMegaWatts
Source code: DavinciDreams/JuliaGPT
Teacher model: LisaMegaWatts/JuliaFluxGPT

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train LisaMegaWatts/JuliaGPTDistill

Evaluation results

Val Loss on philosophy-corpus
self-reported

7.440