MicroJulia
A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The first model in the Julia SLM lineage β a minimal proof-of-concept that established the training and serving infrastructure.
Model Family Context
MicroJulia is the starting point of an architectural progression:
| Model | Generation | Architecture | Tokenizer | Framework |
|---|---|---|---|---|
| MicroJulia | 1st | GPT-2 (LayerNorm, GELU, learned pos) | Character-level | Flux.jl |
| JuliaFluxGPT | 2nd | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | BPE 2000 | Flux.jl |
| JuliaSLM | 3rd | Modern Transformer (RMSNorm, SwiGLU, RoPE) | BPE 2000 | Lux.jl |
| MonarchSLM | 3rd | Monarch Mixer (sub-quadratic) | BPE 2000 | Lux.jl |
| SymbioSLM | 3rd | Symbiogenesis (3 organelles) | BPE 2000 | Lux.jl |
Architecture
Classic GPT-2 design β deliberately minimal:
GPT (GPT-2 style)
+-- wte: Embedding(vocab_size -> n_embd) [token embeddings]
+-- wpe: Embedding(block_size -> n_embd) [learned position embeddings]
+-- drop: Dropout
+-- blocks x N:
| +-- ln1: LayerNorm(n_embd)
| +-- attn: CausalSelfAttention
| | +-- qkv: Dense(n_embd -> 3*n_embd) [fused Q/K/V projection]
| | +-- proj: Dense(n_embd -> n_embd)
| +-- ln2: LayerNorm(n_embd)
| +-- ffwd: FeedForward
| +-- Dense(n_embd -> 4*n_embd)
| +-- GELU
| +-- Dense(4*n_embd -> n_embd)
+-- ln_f: LayerNorm(n_embd)
+-- lm_head: Dense(n_embd -> vocab_size)
Key Design Choices (GPT-2 era)
| Component | MicroJulia (GPT-2) | Later Models (LLaMA-style) |
|---|---|---|
| Normalization | LayerNorm (with bias) | RMSNorm (no bias) |
| Activation | GELU | SwiGLU |
| Position encoding | Learned embeddings | RoPE |
| QKV projection | Fused single Dense | Separate Q, K, V |
| FFN | Standard 4x expansion | SwiGLU 2/3 adjusted |
| Output head | Separate lm_head | Weight-tied with embedding |
| Tokenizer | Character-level (~28 chars) | BPE (2000 tokens) |
Character-Level Tokenization
Uses a minimal character vocabulary:
a-z, space, period (28 characters)
Each character maps directly to a token ID. No subword segmentation β the model must learn word boundaries, morphology, and syntax from individual characters.
Trade-offs:
- Simpler tokenizer implementation
- No OOV (out-of-vocabulary) issues
- Model must spend capacity on character-level patterns
- Less efficient than BPE for the same context window
Model Details
| Parameter | Value |
|---|---|
| Architecture | GPT-2 style (pre-norm Transformer) |
| Tokenizer | Character-level (~28 characters) |
| Position encoding | Learned position embeddings |
| Normalization | LayerNorm |
| Activation | GELU |
| Output projection | Separate Dense (not weight-tied) |
| Framework | Julia + Flux.jl |
Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint hyperparams dict and loaded dynamically.
Training
| Value | |
|---|---|
| Dataset | Classical philosophy texts |
| Tokenizer | Character-level mapping |
| Framework | Julia + Flux.jl |
| Hardware | Google Colab / NVIDIA GPU |
| Precision | Float32 |
Implementation Notes
Causal Masking
Uses a pre-computed additive upper-triangular mask (global constant):
CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)
Applied to attention scores before softmax.
Position Embeddings
Learned absolute position embeddings (not RoPE):
tok = wte(token_ids) # (C, T, B)
pos = wpe(1:T) # (C, T, 1) broadcast to batch
x = tok .+ pos
Limited to the trained block_size β no length extrapolation.
Usage
OpenAI-Compatible API
Served via MicroJulia Space:
curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "hello"}],
"stream": true
}'
Files
| File | Description |
|---|---|
checkpoint.jld2 |
Trained model weights + hyperparams (JLD2 format) |
vocab.json |
Character vocabulary mapping |
Checkpoint contains:
model_stateβ Flux model weightshyperparamsβ Dict with vocab_size, n_embd, block_size, n_layer, n_headstepβ Training stepbest_val_lossβ Best validation loss
Provenance
- Author: LisaMegaWatts
- Repository: DavinciDreams/micro-julia
- Training date: February 2026
- Architecture reference: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
- Lineage: Evolved into JuliaGPT (custom autograd) and the Lux.jl model family
References
- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
- Karpathy, A. (2023). nanoGPT. GitHub repository.
Citation
@misc{microjulia2026,
title={MicroJulia: A Minimal Character-Level GPT in Julia},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/MicroJulia}
}
License
MIT