DeepSeek trained its V3 model in late 2024 for a claimed $5.576 million using 2,048 Nvidia H800 GPUs and a Mixture-of-Experts architecture. Its newer V3.2 matches GPT-5-level math reasoning while costing about 10x less per token. The lab credits innovations like Multi-Head Latent Attention and FP8 precision for the savings. A V4 release is widely rumored
Article:
It is pretty wild to watch a lab most people had not heard of a year ago keep pace with the giants while spending a fraction of the money.
DeepSeek’s V3 landed in December 2024 with a claim that turned heads: a frontier-level model trained for about $5.576 million. They did it on 2,048 Nvidia H800 GPUs, using a Mixture-of-Experts setup that only activates part of the model per token.
That architecture choice is everything. Instead of lighting up the whole network every time, it routes to the right experts. Fewer FLOPs, less memory, faster training. They paired that with Multi-Head Latent Attention to shrink the KV cache and FP8 precision to cut compute.
The result was not just cheaper training, it was cheaper inference too. Their V3.2-Exp is now posting 96% on AIME 2025 and 99.2% on HMMT 2025, which puts it in GPT-5 territory for math, but at roughly $0.028 per million tokens versus over a dollar for US rivals.
And they open-sourced it under MIT license. That is the part that really shook the market. Suddenly anyone could run a top-tier model locally or build a cheap API without begging for credits.
Interestingly enough, the $5-6 million number is debated. Critics point out it likely covers the final training run, not all the data work, failed runs, and staff time. Even if the true all-in cost is higher, the efficiency gap is still real.
For developers, the impact is immediate. Prices across the board are dropping because DeepSeek set a new anchor. If you can get 90% of the performance for 10% of the cost, the expensive proprietary models have to justify themselves.
Honestly, the most interesting thing here is not the benchmark scores. It is the proof that clever engineering beats brute force. Hardware-aware co-design, sparse activation, and precision tricks matter as much as how many GPUs you can buy.
Now everyone is waiting on V4. Leaks suggest more experts, better tool use, and even lower inference costs. If it lands where people expect, we will see another wave of startups ditching the big closed APIs.
Big labs will catch up, they always do. But DeepSeek changed the conversation. You do not need a $100 million training run to play at the frontier anymore. You need a good idea, and the discipline to build around your hardware limits.
