codelion (Asankhaya Sharma)

updated a model 1 day ago

codelion/dhara-70m

Text Generation • 71.3M • Updated 1 day ago • 3.26k • 26

New activity in codelion/dhara-70m 1 day ago

1024 in max_position_embeddings

1

#2 opened 4 days ago by

khtsly

commented on The Optimal Architecture for Small Language Models 1 day ago

For evaluation we used lm-evaluation-harness with a custom wrapper to handle diffusion-specific probability calculations for multiple choice tasks.

For inference we used standard Transformers library. The diffusion models use a custom generate() method that handles parallel token generation with configurable diffusion steps. Throughput was measured with batch size 1, generating 100 tokens per prompt averaged over multiple runs.

New activity in codelion/dhara-70m 1 day ago

Adding `transformers` as the library name

#3 opened 1 day ago by

ariG23498

updated a model 1 day ago

codelion/Qwen3-4B-Instruct-2507-self-verify-lora

Updated 1 day ago • 30

commented on The Optimal Architecture for Small Language Models 3 days ago

I ran the numbers on layer-only params (excluding embeddings):

Config	Hidden	Layers	Layer Params	Score	Tier
4L	768	4	28.3M	31.98%	Low
12L	512	12	37.7M	38.15%	High
16L	448	16	38.5M	32.61%	Low
24L	384	24	42.5M	31.79%	Low
32L	384	32	56.6M	38.50%	High
48L	320	48	59.0M	32.45%	Low
64L	256	64	50.3M	38.21%	High

The 48L config has the most layer params (59M) but is in the Low tier, while 12L has fewer (37.7M) and is High tier.

The hidden dimension threshold still dominates. But er-layer representation width seems critical, with hidden=320 or 256, you create an information bottleneck that more layers can't overcome, unless you hit the critical depth thresholds (32 or 64 layers) where something else compensates.

This suggests the finding should be reframed as: at small scale, you need sufficient hidden dimension AND appropriate depth.

(BTW, based on your earlier comment I've added a note to the article clarifying the parameter matching limitations — thanks for the feedback!)

commented on The Optimal Architecture for Small Language Models 4 days ago

Here's the full breakdown of where parameters come from:

Embeddings (scales linearly with d_model)

Token embeddings: vocab_size × d_model = 50,257 × d
Position embeddings: 1,024 × d
Total: ~51,281 × d

Per transformer layer (scales quadratically with d_model)

Attention (Q, K, V, O): 4 × d²
MLP (up + down, with 4x intermediate): 2 × d × 4d = 8d²
LayerNorms: ~4d (negligible)
Total per layer: ~12d²

LM Head

Usually tied with embeddings (free) or d × vocab_size

4L × 768:

Embeddings: 51,281 × 768 ≈ 39.4M
Layers: 4 × 12 × 768² ≈ 28.3M
Total: ~68M

12L × 512:

Embeddings: 51,281 × 512 ≈ 26.3M
Layers: 12 × 12 × 512² ≈ 37.7M
Total: ~64M

commented on The Optimal Architecture for Small Language Models 4 days ago

Thanks for the references I will take a look.

New activity in codelion/dhara-70m 5 days ago

High-throughput deployment use cases

1

#1 opened 5 days ago by

Cagnicolas

reacted to their post with 👍 5 days ago

Post

5709

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

reacted to their post with 🤗🚀🔥 6 days ago

Post

5709

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

posted an update 6 days ago

Post

5709

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

published an article 6 days ago

Article

The Optimal Architecture for Small Language Models

6 days ago

•

61

liked a model 6 days ago

codelion/dhara-70m

Text Generation • 71.3M • Updated 1 day ago • 3.26k • 26

published a model 6 days ago

codelion/dhara-70m

Text Generation • 71.3M • Updated 1 day ago • 3.26k • 26

upvoted an article 7 days ago

Article

The Optimal Architecture for Small Language Models

6 days ago

•

61

updated a collection 7 days ago

Dhara Foundational Models

Collection

Diffusion Language Models combining deep narrow networks, Canon layers (depthwise causal convolutions), and WSD (Warmup-Stable-Decay) training. • 1 item • Updated 5 days ago • 2

updated a model 7 days ago

codelion/dhara-70m

Text Generation • 71.3M • Updated 1 day ago • 3.26k • 26

Asankhaya Sharma

AI & ML interests

Recent Activity

Organizations

codelion/dhara-70m

1024 in max_position_embeddings

Adding `transformers` as the library name

codelion/Qwen3-4B-Instruct-2507-self-verify-lora

High-throughput deployment use cases

The Optimal Architecture for Small Language Models

codelion/dhara-70m

codelion/dhara-70m

The Optimal Architecture for Small Language Models

Dhara Foundational Models

codelion/dhara-70m

Asankhaya Sharma

AI & ML interests

Recent Activity

Organizations

codelion's activity

1024 in max_position_embeddings

Adding `transformers` as the library name

High-throughput deployment use cases

The Optimal Architecture for Small Language Models

The Optimal Architecture for Small Language Models