JuliaGPTDistill

A ~5M parameter LLaMA-style student model distilled from JuliaFluxGPT (10M params). Uses knowledge distillation with temperature scaling to compress the teacher's knowledge into a smaller architecture.

Architecture

Parameter Value
Architecture LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA)
Embedding dim 256
Layers 4
Query heads 4
KV heads 2 (GQA ratio 2:1)
Head dim 64
Context length 256 tokens
Vocabulary 2,000 (ByteLevel BPE)
Dropout 0.1
Weight tying Yes
Framework Julia + Flux.jl

Distillation Settings

Parameter Value
Teacher model JuliaFluxGPT (512d/8L/8Q/2KV)
KD temperature 4.0
KD alpha 0.5
Loss 0.5 * CE + 0.5 * KL(teacher || student)

Training

Value
Dataset philosophy-corpus
Tokenizer BPE (2,000 vocab, ByteLevel)
Training steps 4,089
Best val loss 7.44
Hardware NVIDIA RTX 3060 12GB

Inference Settings

Parameter Value
vocab_size 2,000
context_length 256
temperature 0.8
top_k 40

Note: This model requires the same BPE tokenizer used by JuliaFluxGPT. No tokenizer file is included in this repo — use the tokenizer from JuliaFluxGPT.

Checkpoint Format

JLD2 files containing:

  • model_state — Flux model weights
  • hyperparams — Dict("n_embd"=>256, "n_layer"=>4, "n_head"=>4, "n_kv_head"=>2, "vocab_size"=>2000, "block_size"=>256, "dropout"=>0.1, "kd_temperature"=>4.0, "kd_alpha"=>0.5)
  • step, best_val_loss, train_losses, val_losses

Files

File Description
best_model.jld2 Best validation loss checkpoint
final_model.jld2 Final training step checkpoint
checkpoint_latest.jld2 Latest periodic checkpoint

Provenance

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train LisaMegaWatts/JuliaGPTDistill

Evaluation results