7 11 46

Mitko Vasilev

mitkox

AI & ML interests

Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.

Recent Activity

posted an update about 16 hours ago

GLM-4.7-Flash is fast, good and cheap. 3,074 tokens/sec peak at 200k tokens context window on my desktop PC. Works with Claude Code and opencode for hours. No errors, drop-in replacement of the Anthropic cloud AI. MIT licensed, open weights, free for commercial use and modifications. Supports speculative decoding using MTP, which is highly effective in mitigating latency. Great for on device AI coding as AWQ 4bit at 18.5 GB. Hybrid inference on a single consumer GPU + CPU RAM.

posted an update 19 days ago

I just stress-tested the Beast: MiniMax-M2.1 on Z8 Fury G5. 2101 tokens/sec. FORTY concurrent clients. That's 609 t/s out, 1492 t/s in. The model outputs fire faster than I can type, but feeds on data like a black hole on cheat day. But wait, there's more! Threw it into Claude Code torture testing with 60+ tools, 8 agents (7 sub-agents because apparently one wasn't enough chaos). It didn't even flinch. Extremely fast, scary good at coding. The kind of performance that makes you wonder if the model's been secretly reading Stack Overflow in its spare time lol 3 months ago, these numbers lived in my "maybe in “2030 dreams. Today it's running on my desk AND heaths my home office during the winter!

posted an update about 1 month ago

Got to 1199.8 tokens/sec with Devstral Small -2 on my desktop GPU workstation. vLLM nightly. Works out of the box with Mistral Vibe. Next is time to test the big one.

View all activity

Organizations

posted an update about 16 hours ago

Post

143

GLM-4.7-Flash is fast, good and cheap.
3,074 tokens/sec peak at 200k tokens context window on my desktop PC.
Works with Claude Code and opencode for hours. No errors, drop-in replacement of the Anthropic cloud AI.
MIT licensed, open weights, free for commercial use and modifications.
Supports speculative decoding using MTP, which is highly effective in mitigating latency.
Great for on device AI coding as AWQ 4bit at 18.5 GB. Hybrid inference on a single consumer GPU + CPU RAM.

1 reply

posted an update 19 days ago

Post

3255

I just stress-tested the Beast: MiniMax-M2.1 on Z8 Fury G5.
2101 tokens/sec. FORTY concurrent clients. That's 609 t/s out, 1492 t/s in. The model outputs fire faster than I can type, but feeds on data like a black hole on cheat day.
But wait, there's more! Threw it into Claude Code torture testing with 60+ tools, 8 agents (7 sub-agents because apparently one wasn't enough chaos). It didn't even flinch. Extremely fast, scary good at coding. The kind of performance that makes you wonder if the model's been secretly reading Stack Overflow in its spare time lol
3 months ago, these numbers lived in my "maybe in “2030 dreams. Today it's running on my desk AND heaths my home office during the winter!

3 replies

posted an update about 1 month ago

Post

2369

Got to 1199.8 tokens/sec with Devstral Small -2 on my desktop GPU workstation. vLLM nightly.
Works out of the box with Mistral Vibe. Next is time to test the big one.

3 replies

posted an update about 2 months ago

Post

3175

I run 20 AI coding agents locally on my desktop workstation at 400+ tokens/sec with MiniMax-M2. It’s a Sonnet drop-in replacement in my Cursor, Claude Code, Droid, Kilo and Cline peak at 11k tok/sec input and 433 tok/s output, can generate 1B+ tok/m.All with 196k context window. I'm running it for 6 days now with this config.

Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.

Z8 Fury G5, Xeon 3455, 4xA6K. Aibrix 0.5.0, vLLM 0.11.2,

5 replies

posted an update 2 months ago

Post

4177

I just threw Qwen3-0.6B in BF16 into an on device AI drag race on AMD Strix Halo with vLLM:

564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons

TL;DR You don't just run AI on AMD. You negotiate with it.

The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint

Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.

1 reply

posted an update 3 months ago

Post

391

I have just vibe coded a feature for ODA on-device AI with MiniMax M2, running locally on my Z8 Fury - and holy silicon, this thing SLAPS!
TL;DR the nerd stuff

Specialized in coding and agentic work
60 tokens/sec
Ryzen AI is getting some serious ROCm 7.0.2 brain implants
One extra script to rule them all and bind them to my GPU
Vibe coding feature implementation that actually worked on the first try. I know, I'm scared too

posted an update 3 months ago

Post

1876

I’m just reading that Ryzen AI 395 has to be 30% slower than DGX Spark in LLM inferencing… and only 96GB GPU RAM… good I haven’t RTFM upfront, so I made the AMD faster with 128GB unified RAM 🫡
Z2 mini G1a can run Qwen3 Coder 30B BF16 at 26.8 tok/sec in ~60GB GPU RAM

posted an update 3 months ago

Post

2805

Say hello to my little friends! I just unboxed this trio of HP Z2 G1a!

Three is always better than one!
3x AMD Ryzen AI Max+ Pro 395
384GB RAM
24TB of RAID storage
Ubuntu 24.04
ROCm 7.0.2
llama cpp, vLLM and Aibrix

Small, cheap GPUs are about to become the Raspberry Pi of edge AI inference. Sprinkle some kubectl fairy dust on top, and suddenly it's a high-availability, self-healing, cloud-native, enterprise-grade AI cluster camping in a closet.

Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.

3 replies

posted an update 3 months ago

Post

2821

I see all Chinese labs are turning TL;DR into TL;DRGB

Problem: 1M text tokens == 1 M opportunities for your GPU to file worker-comp
Solution: don’t feed the model War & Peace—feed it the movie poster.

This is Glyph, Zai’s new visual-text compression voodoo:
• 10 k words → 3 PNGs ≈ 3 k visual tokens
• Compression ratio: 4.3×
• Throughput: 40-60 tok/s i.e. your context window now finishes before my coffee does

So I did the only reasonable thing: asked GLM-4.6 to port Glyph for Qwen3-VL-8B-Thinking.
Translation: I made one model compress a novel into a comic strip, then made another model read the comic strip and still ace QA.
It’s basically passing notes in class, except the note is a 1920×1080 meme and the teacher is a transformer.

We've gone from "Attention is All You Need" to "Attention is Too Expensive, Just Use Your Eyes." Remember kids: in 2025 literacy is optional, but JPEG is forever.

updated a model 3 months ago

mitkox/google-jefferson

0.1B • Updated Oct 15, 2025 • 2 • 1

published a model 3 months ago

mitkox/google-jefferson

0.1B • Updated Oct 15, 2025 • 2 • 1

liked a model 3 months ago

Kwaipilot/KAT-Dev-72B-Exp

Text Generation • 73B • Updated Oct 13, 2025 • 111 • 159

posted an update 3 months ago

Post

319

Friday evening. KAT-Dev-72B-Exp is spinning in Aibrix K8s. The GPUs in the Z8 are fired up. It's a LAN party for one. After 6 months on a diet of MoEs, I'd forgotten the main-course feeling of a dense 72B model.

1 reply

upvoted 2 collections 4 months ago

DeepSeek-V3.2

Collection

4 items • Updated Dec 1, 2025 • 516

Qwen3-Omni

Collection

6 items • Updated 22 days ago • 181

posted an update 4 months ago

Post

5666

I’ve built my blocker for AI-generated content. It’s a local AI running on my laptop with a browser extension that classifies and scrubs synthetic content from my eyeballs. I’m too old for this synthetic noise.

TL;DR I’m going full John Connor on the AI content apocalypse

Think of it as an on device AI ad-blocker, but for:
Em-dash overdose. Seriously, why is everything suddenly revolutionary—disruptive—life-changing?
AI influencers’ auto-generated posts and images, auto-posted, all hands-free.
Fake news, fake images, fake people... puff.

Surprisingly, it works. I suppose it will block some human-generated content. However, I would rather read a 2007 Myspace blog than another “10 Growth Hacks Powered By ChatGPT” post.

3 replies

posted an update 5 months ago

Post

399

Hermes4 70B synthetic dataset generation on my desktop Z8 GPU rig:
307 tok/sec
1.1M tok/hour

The bottleneck for generating massive, high-quality reinforcement learning datasets is never the GPU compute; it's always the model's willingness to actually answer the darn question.

liked a model 5 months ago

deepseek-ai/DeepSeek-V3.1-Base

Text Generation • 685B • Updated Aug 26, 2025 • 12.4k • 1.01k

posted an update 6 months ago

Post

1771

Earlier today, humanity faced a critical threat from a catastrophic chart crime. I asked my local Qwen3 Coder Flash to fix it. Sleep well, fellow humans. The visualization singularity is now high, and it runs with zero warnings.

2 replies

posted an update 6 months ago

Post

3508

I run Claude Code with Qwen3 Coder Flash locally on my MacBook Air. It works offline, zero cloud, zero internet, zero EU AI Act anxiety. No limit with all tokens on the house.

It’s not great, not terrible- adequate performance for an on device AI agent chewing through code on a 1.24 kg laptop. I wrote an interpreter to broker peace between Claude Code and my local AI runtime.

Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.

3 replies

Mitko Vasilev

AI & ML interests

Recent Activity

Organizations

mitkox's activity