Aashir Aftab
I study human cognition to code better.
Inside Cerebras: Wafer-Scale Architecture for 2000+ TPS AI Inference
Cerebras achieves 1000-2000+ tokens per second using their WSE-3 wafer-scale chip with 900,000 AI-optimized cores and 44GB ultra-fast SRAM. Baseten optimizes standard NVIDIA GPUs for cost efficiency and scalability, achieving 341 TPS on Kimi K2.5. Mercury diffusion models (dLLMs) by Inception Labs generate tokens in parallel rather than sequentially, hitting 1000+ TPS on standard GPUs. Each approach suits different use cases: Cerebras for maximum speed, Baseten for flexibility and cost, Mercury for diffusion-based reasoning.
- Cerebras WSE-3 is the world's fastest AI processor with 900,000 cores, 44GB SRAM, and 1 clock cycle memory access via 2D Swarm Interconnect Fabric.
- Cerebras partners include OpenAI (GPT-5.3-Codex-Spark), Meta (Scout at 2000+ TPS), GLM-4.7, Mayo Clinic, Notion, and Cognition.
- Baseten achieved 341 TPS on Kimi K2.5 and supports MiniMax M2.5, GLM 5, Whisper, and more with SOC 2 Type II and HIPAA compliance.
- Mercury diffusion models generate 5x faster than autoregressive models by producing tokens in parallel, with Mercury 2 offering 128K context.
- Pricing comparison: Mercury at $0.25/1M input and $0.75/1M output tokens. Cerebras and Baseten use competitive per-token models.
- Cerebras is OpenAI API compatible and SOC2/HIPAA certified. Mercury is available on AWS Bedrock and Azure Foundry.
Finding Speed in Silicon
I've been thinking about the cost of waiting. Not the big delays, but the small slivers of time that disappear while we wait for a cursor to move. We often accept latency as a necessary tax for complexity. But after looking at Cerebras, I'm starting to think that tax might be optional.
Cerebras, based in Sunnyvale, built the Wafer-Scale Engine (WSE-3). It isn't just a chip. It is a piece of silicon that wasn't cut.
The silicon we choose not to cut
Standard GPUs are like separate islands connected by bridges. The interconnects are the bottleneck. You spend a lot of time just moving data between them. Cerebras is different because it is a single, contiguous continent of silicon.
900,000 cores. 44GB of SRAM. All on one piece of glass.
When memory access happens in a single clock cycle, the latency disappears. It is the hardware version of "ricing," which is optimizing the foundation until the friction is gone. It reminds me of trying to build the perfect local development setup. It is minimal and focused.
Speed as a foundation
We usually talk about AI reasoning, but speed is a form of intelligence. When a model responds at 2000 tokens per second, the experience changes. It stops being a tool you call and starts being a stream of thought you can actually follow. Meta uses this for their Scout model, and OpenAI uses it for Codex.
I'm interested in what happens when this becomes the standard.
Building with care is a phrase I keep coming back to. Speed isn't just about efficiency. It is about reducing the distance between having an idea and seeing it work. I am trying to raise the bar, even if I am still learning where the bar is.
The boundaries of a single continent
A single continent has its own limits. You can't run every model on this hardware. It requires a specific kind of architectural alignment. It is like choosing a minimalist framework. You gain speed and clarity, but you lose the flexibility of a larger, more bloated system.
Other providers take different paths. Baseten is practical. They optimize the GPU islands we already have. Mercury, from Inception Labs, rethinks the math of generation itself. They use diffusion models to generate tokens in parallel instead of one after another.
The work ahead
In the end, these are all different ways of trying to make the computer get out of the way. When the friction is gone, the tool disappears.
I am still figuring out my own approach to this. Building with care is a slow process, but it is the only way I know how to reach something that feels effortless. I am focused on the basics for now. One small win at a time. Every single day.
| Approach | Philosophy | Speed |
|---|---|---|
| Cerebras | Build a single continent. | 1000-2000+ TPS |
| Baseten | Optimize the bridges. | ~341 TPS |
| Mercury | Parallelize the generation. | 1000+ TPS |
The polish I want will come as I keep building. One small win at a time. Every single day.
Aftab, A. (2026). "Inside Cerebras: Wafer-Scale Architecture for 2000+ TPS AI Inference". Aashir Aftab's Portfolio. https://aftab.me/blog/fast-inference-cerebras-baseten