MonarchSLM
A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the first Monarch Mixer implementation in Julia.
Part of the Julia SLM family of models exploring alternative sequence mixing architectures.
Model Family
MonarchSLM is the Monarch Mixer variant in a family of three architectures trained on the same data with matched parameter budgets:
| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| JuliaSLM | Transformer | 4-head causal attention + RoPE | 34.5 | 5.04M |
| MonarchSLM | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| SymbioSLM | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
Architecture
JuliaGPTModel (monarch)
+-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
+-- blocks x 8:
| +-- ln1: RMSNorm(256)
| +-- seq_mixer: MonarchSequenceMixer
| | +-- conv: CausalDepthwiseConv1d(256, kernel=4)
| | +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
| | | +-- L1: (16, 16, 16) # block-diagonal factor 1
| | | +-- L2: (16, 16, 16) # block-diagonal factor 2
| | +-- gate: LearnedGate(256)
| +-- ln2: RMSNorm(256)
| +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
How Monarch Sequence Mixing Works
Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:
M = P^T * BlockDiag(L1) * P * BlockDiag(L2)
where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.
Per-head forward pass:
- Realize the T x T mixing matrix M from learned factors L1, L2
- Apply a multiplicative 0/1 causal mask (lower triangular)
- Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
- A short causal convolution (kernel=4) provides complementary local n-gram context
- Conv and Monarch outputs are combined via a learned sigmoid gate
No positional encoding needed β the Monarch matrices learn position-dependent mixing patterns directly.
Key Differences from Transformer
| Property | Transformer | Monarch Mixer |
|---|---|---|
| Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) |
| Position encoding | RoPE (separate) | None (implicit in Monarch matrices) |
| Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply |
| Seq mixer params/block | 262K | 67K (74% reduction) |
| Layers (same param budget) | 6 | 8 (extra layers from param savings) |
Parameter Efficiency
The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:
| Component | Params per block |
|---|---|
| CausalDepthwiseConv1d (K=4) | 1,024 |
| 8 x MonarchMatrix (2 x 16^3 each) | 65,536 |
| LearnedGate | 256 |
| Total sequence mixing | 66,816 |
| SwiGLU FFN | 491,520 |
| RMSNorm x 2 | 512 |
| Block total | 558,848 |
Model Details
| Parameter | Value |
|---|---|
| Total parameters | 4,983,040 |
| Embedding dim | 256 |
| Layers | 8 |
| Monarch heads | 8 |
| Channels per head | 32 |
| Block size (p) | 16 (T = p^2 = 256) |
| Conv kernel size | 4 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | None (learned in Monarch matrices) |
| Weight tying | Yes |
Training
| Value | |
|---|---|
| Dataset | philosophy-corpus |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 89 minutes |
| Throughput | ~19K tok/s |
Training Curves
| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 7.28 | 5.58 | 265.4 |
| 2,000 | 4.29 | 4.21 | 67.6 |
| 6,000 | 3.83 | 3.81 | 45.3 |
| 10,000 | 3.69 | 3.68 | 39.6 |
| 12,305 | 3.66 | 3.65 | 38.4 |
Key Findings
- Monarch Mixer achieves 89% of the baseline Transformer quality at the same parameter budget
- The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
- The model learns coherent language generation using only fixed learned mixing patterns β no dynamic attention
- Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
- Both models generate coherent English with dialogue, grammar, and philosophical content
Relationship to Symbiogenesis
MonarchSLM's Monarch matrices serve as one of three "organelles" in the Symbiogenesis architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.
The biological metaphor: MonarchSLM is like a prokaryote β a single-organelle organism. SymbioSLM is the eukaryote β multiple organelles fused into one cell.
Implementation
Built entirely in Julia:
- Lux.jl β Explicit-parameter neural network framework
- Zygote.jl β Automatic differentiation
- CUDA.jl β GPU acceleration
- NNlib.jl β batched_mul for Monarch realization, softmax, activations
Monarch matrix realization uses NNlib.batched_mul for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.
Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
Usage
OpenAI-Compatible API
Served via MonarchSLM Space:
curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "the nature of"}],
"max_tokens": 200,
"temperature": 0.8,
"top_k": 40
}'
Load in Julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux
tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
model = create_model(ModelConfig(;
arch="monarch", vocab_size=vocab_size(tok),
embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
n_monarch_heads=8, conv_kernel_size=4,
ffn_mult=4, context_length=256, weight_tying=true,
))
text = generate(model, ps, st, tok, "the nature of ";
max_new_tokens=200, temperature=0.8, top_k=40)
Files
| File | Description |
|---|---|
final.jld2 |
Trained model parameters (JLD2 format, 74MB) |
config.toml |
Model architecture configuration |
vocab.json |
BPE vocabulary (2000 tokens) |
merges.txt |
BPE merge rules |
Provenance
- Author: LisaMegaWatts
- Training code: DavinciDreams/julia-slm
- Data pipeline: DavinciDreams/text-pipeline
- Training date: February 2026
- Architecture reference: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
- First Julia implementation of Monarch Mixer sequence mixing
References
- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. NeurIPS 2023.
- Karpathy, A. (2023). nanoGPT. GitHub repository.
Citation
@misc{monarchslm2026,
title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
}
License
MIT
Dataset used to train LisaMegaWatts/MonarchSLM
Evaluation results
- Val PPL on philosophy-corpusself-reported38.400
- Val Loss on philosophy-corpusself-reported3.650