Asankhaya Sharma
AI & ML interests
Recent Activity
Organizations
1024 in max_position_embeddings
For evaluation we used lm-evaluation-harness with a custom wrapper to handle diffusion-specific probability calculations for multiple choice tasks.
For inference we used standard Transformers library. The diffusion models use a custom generate() method that handles parallel token generation with configurable diffusion steps. Throughput was measured with batch size 1, generating 100 tokens per prompt averaged over multiple runs.
Adding `transformers` as the library name
I ran the numbers on layer-only params (excluding embeddings):
| Config | Hidden | Layers | Layer Params | Score | Tier |
|---|---|---|---|---|---|
| 4L | 768 | 4 | 28.3M | 31.98% | Low |
| 12L | 512 | 12 | 37.7M | 38.15% | High |
| 16L | 448 | 16 | 38.5M | 32.61% | Low |
| 24L | 384 | 24 | 42.5M | 31.79% | Low |
| 32L | 384 | 32 | 56.6M | 38.50% | High |
| 48L | 320 | 48 | 59.0M | 32.45% | Low |
| 64L | 256 | 64 | 50.3M | 38.21% | High |
The 48L config has the most layer params (59M) but is in the Low tier, while 12L has fewer (37.7M) and is High tier.
The hidden dimension threshold still dominates. But er-layer representation width seems critical, with hidden=320 or 256, you create an information bottleneck that more layers can't overcome, unless you hit the critical depth thresholds (32 or 64 layers) where something else compensates.
This suggests the finding should be reframed as: at small scale, you need sufficient hidden dimension AND appropriate depth.
(BTW, based on your earlier comment I've added a note to the article clarifying the parameter matching limitations — thanks for the feedback!)
Here's the full breakdown of where parameters come from:
- Embeddings (scales linearly with d_model)
- Token embeddings: vocab_size × d_model = 50,257 × d
- Position embeddings: 1,024 × d
- Total: ~51,281 × d
- Per transformer layer (scales quadratically with d_model)
- Attention (Q, K, V, O): 4 × d²
- MLP (up + down, with 4x intermediate): 2 × d × 4d = 8d²
- LayerNorms: ~4d (negligible)
- Total per layer: ~12d²
- LM Head
- Usually tied with embeddings (free) or d × vocab_size
4L × 768:
- Embeddings: 51,281 × 768 ≈ 39.4M
- Layers: 4 × 12 × 768² ≈ 28.3M
- Total: ~68M
12L × 512:
- Embeddings: 51,281 × 512 ≈ 26.3M
- Layers: 12 × 12 × 512² ≈ 37.7M
- Total: ~64M
Thanks for the references I will take a look.
High-throughput deployment use cases
Key findings from our research on optimal architectures for small language models:
→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m
Key findings from our research on optimal architectures for small language models:
→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m
Key findings from our research on optimal architectures for small language models:
→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m