MicroJulia

A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The first model in the Julia SLM lineage — a minimal proof-of-concept that established the training and serving infrastructure.

Model Family Context

MicroJulia is the starting point of an architectural progression:

Model	Generation	Architecture	Tokenizer	Framework
MicroJulia	1st	GPT-2 (LayerNorm, GELU, learned pos)	Character-level	Flux.jl
JuliaFluxGPT	2nd	LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA)	BPE 2000	Flux.jl
JuliaSLM	3rd	Modern Transformer (RMSNorm, SwiGLU, RoPE)	BPE 2000	Lux.jl
MonarchSLM	3rd	Monarch Mixer (sub-quadratic)	BPE 2000	Lux.jl
SymbioSLM	3rd	Symbiogenesis (3 organelles)	BPE 2000	Lux.jl

Architecture

Classic GPT-2 design — deliberately minimal:

GPT (GPT-2 style)
+-- wte: Embedding(vocab_size -> n_embd)      [token embeddings]
+-- wpe: Embedding(block_size -> n_embd)      [learned position embeddings]
+-- drop: Dropout
+-- blocks x N:
|   +-- ln1: LayerNorm(n_embd)
|   +-- attn: CausalSelfAttention
|   |   +-- qkv: Dense(n_embd -> 3*n_embd)   [fused Q/K/V projection]
|   |   +-- proj: Dense(n_embd -> n_embd)
|   +-- ln2: LayerNorm(n_embd)
|   +-- ffwd: FeedForward
|       +-- Dense(n_embd -> 4*n_embd)
|       +-- GELU
|       +-- Dense(4*n_embd -> n_embd)
+-- ln_f: LayerNorm(n_embd)
+-- lm_head: Dense(n_embd -> vocab_size)

Key Design Choices (GPT-2 era)

Component	MicroJulia (GPT-2)	Later Models (LLaMA-style)
Normalization	LayerNorm (with bias)	RMSNorm (no bias)
Activation	GELU	SwiGLU
Position encoding	Learned embeddings	RoPE
QKV projection	Fused single Dense	Separate Q, K, V
FFN	Standard 4x expansion	SwiGLU 2/3 adjusted
Output head	Separate lm_head	Weight-tied with embedding
Tokenizer	Character-level (~28 chars)	BPE (2000 tokens)

Character-Level Tokenization

Uses a minimal character vocabulary:

a-z, space, period (28 characters)

Each character maps directly to a token ID. No subword segmentation — the model must learn word boundaries, morphology, and syntax from individual characters.

Trade-offs:

Simpler tokenizer implementation
No OOV (out-of-vocabulary) issues
Model must spend capacity on character-level patterns
Less efficient than BPE for the same context window

Model Details

Parameter	Value
Architecture	GPT-2 style (pre-norm Transformer)
Tokenizer	Character-level (~28 characters)
Position encoding	Learned position embeddings
Normalization	LayerNorm
Activation	GELU
Output projection	Separate Dense (not weight-tied)
Framework	Julia + Flux.jl

Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint hyperparams dict and loaded dynamically.

Training

	Value
Dataset	Classical philosophy texts
Tokenizer	Character-level mapping
Framework	Julia + Flux.jl
Hardware	Google Colab / NVIDIA GPU
Precision	Float32

Implementation Notes

Causal Masking

Uses a pre-computed additive upper-triangular mask (global constant):

CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)

Applied to attention scores before softmax.

Position Embeddings

Learned absolute position embeddings (not RoPE):

tok = wte(token_ids)    # (C, T, B)
pos = wpe(1:T)          # (C, T, 1) broadcast to batch
x = tok .+ pos

Limited to the trained block_size — no length extrapolation.

Usage

OpenAI-Compatible API

Served via MicroJulia Space:

curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "hello"}],
    "stream": true
  }'

Files

File	Description
`checkpoint.jld2`	Trained model weights + hyperparams (JLD2 format)
`vocab.json`	Character vocabulary mapping

Checkpoint contains:

model_state — Flux model weights
hyperparams — Dict with vocab_size, n_embd, block_size, n_layer, n_head
step — Training step
best_val_loss — Best validation loss

Provenance

Author: LisaMegaWatts
Repository: DavinciDreams/micro-julia
Training date: February 2026
Architecture reference: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
Lineage: Evolved into JuliaGPT (custom autograd) and the Lux.jl model family

References

Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
Karpathy, A. (2023). nanoGPT. GitHub repository.

Citation

@misc{microjulia2026,
  title={MicroJulia: A Minimal Character-Level GPT in Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MicroJulia}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track