MonarchSLM

A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the first Monarch Mixer implementation in Julia.

Part of the Julia SLM family of models exploring alternative sequence mixing architectures.

Model Family

MonarchSLM is the Monarch Mixer variant in a family of three architectures trained on the same data with matched parameter budgets:

Model Architecture Sequence Mixing Val PPL Params
JuliaSLM Transformer 4-head causal attention + RoPE 34.5 5.04M
MonarchSLM Monarch Mixer 8-head Monarch matrix + conv + gate 38.4 4.98M
SymbioSLM Symbiogenesis 3 organelles (CausalConv + Monarch + LongConv) + gate TBD ~4.1M

Architecture

JuliaGPTModel (monarch)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- blocks x 8:
|   +-- ln1: RMSNorm(256)
|   +-- seq_mixer: MonarchSequenceMixer
|   |   +-- conv: CausalDepthwiseConv1d(256, kernel=4)
|   |   +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
|   |   |   +-- L1: (16, 16, 16)  # block-diagonal factor 1
|   |   |   +-- L2: (16, 16, 16)  # block-diagonal factor 2
|   |   +-- gate: LearnedGate(256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)

How Monarch Sequence Mixing Works

Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:

M = P^T * BlockDiag(L1) * P * BlockDiag(L2)

where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.

Per-head forward pass:

  1. Realize the T x T mixing matrix M from learned factors L1, L2
  2. Apply a multiplicative 0/1 causal mask (lower triangular)
  3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
  4. A short causal convolution (kernel=4) provides complementary local n-gram context
  5. Conv and Monarch outputs are combined via a learned sigmoid gate

No positional encoding needed β€” the Monarch matrices learn position-dependent mixing patterns directly.

Key Differences from Transformer

Property Transformer Monarch Mixer
Sequence mixing Dynamic (input-dependent attention) Fixed (learned mixing matrices)
Position encoding RoPE (separate) None (implicit in Monarch matrices)
Complexity O(T^2 * D) O(T^(3/2)) realize + O(T^2) apply
Seq mixer params/block 262K 67K (74% reduction)
Layers (same param budget) 6 8 (extra layers from param savings)

Parameter Efficiency

The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:

Component Params per block
CausalDepthwiseConv1d (K=4) 1,024
8 x MonarchMatrix (2 x 16^3 each) 65,536
LearnedGate 256
Total sequence mixing 66,816
SwiGLU FFN 491,520
RMSNorm x 2 512
Block total 558,848

Model Details

Parameter Value
Total parameters 4,983,040
Embedding dim 256
Layers 8
Monarch heads 8
Channels per head 32
Block size (p) 16 (T = p^2 = 256)
Conv kernel size 4
FFN hidden dim 640
Context length 256 tokens
Vocabulary 2,000 (ByteLevel BPE)
Position encoding None (learned in Monarch matrices)
Weight tying Yes

Training

Value
Dataset philosophy-corpus
Corpus 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...)
Train tokens ~100M (Chinchilla-optimal: 20 tok/param)
Optimizer AdamW (lr=6e-4, min_lr=6e-5, cosine decay)
Warmup 500 steps (linear)
Max steps 12,305
Batch size 32
Gradient clipping 1.0 (global norm)
Precision Float16 AMP (Float32 master weights)
Hardware NVIDIA RTX 3060 12GB
Training time 89 minutes
Throughput ~19K tok/s

Training Curves

Step Train Loss Val Loss Val PPL
500 7.28 5.58 265.4
2,000 4.29 4.21 67.6
6,000 3.83 3.81 45.3
10,000 3.69 3.68 39.6
12,305 3.66 3.65 38.4

Key Findings

  • Monarch Mixer achieves 89% of the baseline Transformer quality at the same parameter budget
  • The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
  • The model learns coherent language generation using only fixed learned mixing patterns β€” no dynamic attention
  • Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
  • Both models generate coherent English with dialogue, grammar, and philosophical content

Relationship to Symbiogenesis

MonarchSLM's Monarch matrices serve as one of three "organelles" in the Symbiogenesis architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.

The biological metaphor: MonarchSLM is like a prokaryote β€” a single-organelle organism. SymbioSLM is the eukaryote β€” multiple organelles fused into one cell.

Implementation

Built entirely in Julia:

  • Lux.jl β€” Explicit-parameter neural network framework
  • Zygote.jl β€” Automatic differentiation
  • CUDA.jl β€” GPU acceleration
  • NNlib.jl β€” batched_mul for Monarch realization, softmax, activations

Monarch matrix realization uses NNlib.batched_mul for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

Usage

OpenAI-Compatible API

Served via MonarchSLM Space:

curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'

Load in Julia

using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="monarch", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
    n_monarch_heads=8, conv_kernel_size=4,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)

Files

File Description
final.jld2 Trained model parameters (JLD2 format, 74MB)
config.toml Model architecture configuration
vocab.json BPE vocabulary (2000 tokens)
merges.txt BPE merge rules

Provenance

  • Author: LisaMegaWatts
  • Training code: DavinciDreams/julia-slm
  • Data pipeline: DavinciDreams/text-pipeline
  • Training date: February 2026
  • Architecture reference: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
  • First Julia implementation of Monarch Mixer sequence mixing

References

  • Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. NeurIPS 2023.
  • Karpathy, A. (2023). nanoGPT. GitHub repository.

Citation

@misc{monarchslm2026,
  title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train LisaMegaWatts/MonarchSLM

Evaluation results