MonarchSLM

A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the first Monarch Mixer implementation in Julia.

Part of the Julia SLM family of models exploring alternative sequence mixing architectures.

Model Family

MonarchSLM is the Monarch Mixer variant in a family of three architectures trained on the same data with matched parameter budgets:

Model	Architecture	Sequence Mixing	Val PPL	Params
JuliaSLM	Transformer	4-head causal attention + RoPE	34.5	5.04M
MonarchSLM	Monarch Mixer	8-head Monarch matrix + conv + gate	38.4	4.98M
SymbioSLM	Symbiogenesis	3 organelles (CausalConv + Monarch + LongConv) + gate	TBD	~4.1M

Architecture

JuliaGPTModel (monarch)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- blocks x 8:
|   +-- ln1: RMSNorm(256)
|   +-- seq_mixer: MonarchSequenceMixer
|   |   +-- conv: CausalDepthwiseConv1d(256, kernel=4)
|   |   +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
|   |   |   +-- L1: (16, 16, 16)  # block-diagonal factor 1
|   |   |   +-- L2: (16, 16, 16)  # block-diagonal factor 2
|   |   +-- gate: LearnedGate(256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)

How Monarch Sequence Mixing Works

Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:

M = P^T * BlockDiag(L1) * P * BlockDiag(L2)

where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.

Per-head forward pass:

Realize the T x T mixing matrix M from learned factors L1, L2
Apply a multiplicative 0/1 causal mask (lower triangular)
Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
A short causal convolution (kernel=4) provides complementary local n-gram context
Conv and Monarch outputs are combined via a learned sigmoid gate

No positional encoding needed — the Monarch matrices learn position-dependent mixing patterns directly.

Key Differences from Transformer

Property	Transformer	Monarch Mixer
Sequence mixing	Dynamic (input-dependent attention)	Fixed (learned mixing matrices)
Position encoding	RoPE (separate)	None (implicit in Monarch matrices)
Complexity	O(T^2 * D)	O(T^(3/2)) realize + O(T^2) apply
Seq mixer params/block	262K	67K (74% reduction)
Layers (same param budget)	6	8 (extra layers from param savings)

Parameter Efficiency

The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:

Component	Params per block
CausalDepthwiseConv1d (K=4)	1,024
8 x MonarchMatrix (2 x 16^3 each)	65,536
LearnedGate	256
Total sequence mixing	66,816
SwiGLU FFN	491,520
RMSNorm x 2	512
Block total	558,848

Model Details

Parameter	Value
Total parameters	4,983,040
Embedding dim	256
Layers	8
Monarch heads	8
Channels per head	32
Block size (p)	16 (T = p^2 = 256)
Conv kernel size	4
FFN hidden dim	640
Context length	256 tokens
Vocabulary	2,000 (ByteLevel BPE)
Position encoding	None (learned in Monarch matrices)
Weight tying	Yes

Training

	Value
Dataset	philosophy-corpus
Corpus	981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...)
Train tokens	~100M (Chinchilla-optimal: 20 tok/param)
Optimizer	AdamW (lr=6e-4, min_lr=6e-5, cosine decay)
Warmup	500 steps (linear)
Max steps	12,305
Batch size	32
Gradient clipping	1.0 (global norm)
Precision	Float16 AMP (Float32 master weights)
Hardware	NVIDIA RTX 3060 12GB
Training time	89 minutes
Throughput	~19K tok/s

Training Curves

Step	Train Loss	Val Loss	Val PPL
500	7.28	5.58	265.4
2,000	4.29	4.21	67.6
6,000	3.83	3.81	45.3
10,000	3.69	3.68	39.6
12,305	3.66	3.65	38.4

Key Findings

Monarch Mixer achieves 89% of the baseline Transformer quality at the same parameter budget
The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention
Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
Both models generate coherent English with dialogue, grammar, and philosophical content

Relationship to Symbiogenesis

MonarchSLM's Monarch matrices serve as one of three "organelles" in the Symbiogenesis architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.

The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell.

Implementation

Built entirely in Julia:

Lux.jl — Explicit-parameter neural network framework
Zygote.jl — Automatic differentiation
CUDA.jl — GPU acceleration
NNlib.jl — batched_mul for Monarch realization, softmax, activations

Monarch matrix realization uses NNlib.batched_mul for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

Usage

OpenAI-Compatible API

Served via MonarchSLM Space:

curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'

Load in Julia

using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="monarch", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
    n_monarch_heads=8, conv_kernel_size=4,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)

Files

File	Description
`final.jld2`	Trained model parameters (JLD2 format, 74MB)
`config.toml`	Model architecture configuration
`vocab.json`	BPE vocabulary (2000 tokens)
`merges.txt`	BPE merge rules

Provenance

Author: LisaMegaWatts
Training code: DavinciDreams/julia-slm
Data pipeline: DavinciDreams/text-pipeline
Training date: February 2026
Architecture reference: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
First Julia implementation of Monarch Mixer sequence mixing

References

Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. NeurIPS 2023.
Karpathy, A. (2023). nanoGPT. GitHub repository.

Citation

@misc{monarchslm2026,
  title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train LisaMegaWatts/MonarchSLM

Evaluation results

Val PPL on philosophy-corpus
self-reported

38.400
Val Loss on philosophy-corpus
self-reported

3.650