Inside Cerebras: Wafer-Scale Architecture for 2000+ TPS AI Inference

Finding Speed in Silicon

I’ve been thinking about the cost of waiting. Not the big delays, but the small slivers of time that disappear while we wait for a cursor to move. We often accept latency as a necessary tax for complexity. But after looking at Cerebras, I’m starting to think that tax might be optional.

Cerebras, based in Sunnyvale, built the Wafer-Scale Engine (WSE-3). It isn’t just a chip. It is a piece of silicon that wasn’t cut.

The silicon we choose not to cut

Standard GPUs are like separate islands connected by bridges. The interconnects are the bottleneck. You spend a lot of time just moving data between them. Cerebras is different because it is a single, contiguous continent of silicon.

900,000 cores. 44GB of SRAM. All on one piece of glass.

graph LR
    classDef entry fill:#3c474d,stroke:#dbbc7f,stroke-width:2px,color:#d3c6aa;
    classDef chip fill:#323c41,stroke:#a7c080,stroke-width:2px,color:#d3c6aa;
    classDef fabric fill:#2b3339,stroke:#7fbbb3,stroke-width:2px,color:#d3c6aa;
    classDef compute fill:#323c41,stroke:#e69875,stroke-width:2px,color:#d3c6aa;
    classDef memory fill:#3c474d,stroke:#d699b6,stroke-width:2px,color:#d3c6aa;
    classDef output fill:#2b3339,stroke:#a7c080,stroke-width:2px,stroke-dasharray: 4 4,color:#d3c6aa;

    Request["Inference request"]:::entry

    subgraph Wafer["WSE-3: one wafer, one memory plane"]
        WSE["Wafer-scale chip<br/>not many GPU islands"]:::chip
        Fabric["2D swarm fabric<br/>on-wafer routing"]:::fabric
        CoreBank["900k sparse AI cores<br/>compute near data"]:::compute
        LocalMemory["44GB distributed SRAM<br/>single-cycle locality"]:::memory
        Dataflow["Dataflow execution<br/>work fires on arrival"]:::fabric
    end

    Tokens["Low-latency token stream<br/>1000-2000+ TPS"]:::output

    Request --> WSE
    WSE --> Fabric
    Fabric --> CoreBank
    CoreBank <-->|local reads/writes| LocalMemory
    CoreBank --> Dataflow
    Dataflow --> Tokens

    click Request showDiagramTip "What enters the chip?"
    click WSE showDiagramTip "What is the WSE-3?"
    click Fabric showDiagramTip "How the wafer communicates"
    click CoreBank showDiagramTip "Where model math runs"
    click LocalMemory showDiagramTip "Why local SRAM matters"
    click Dataflow showDiagramTip "How dataflow execution works"
    click Tokens showDiagramTip "What the system produces"

When memory access happens in a single clock cycle, the latency disappears. It is the hardware version of “ricing,” which is optimizing the foundation until the friction is gone. It reminds me of trying to build the perfect local development setup. It is minimal and focused.

Speed as a foundation

We usually talk about AI reasoning, but speed is a form of intelligence. When a model responds at 2000 tokens per second, the experience changes. It stops being a tool you call and starts being a stream of thought you can actually follow. Meta uses this for their Scout model, and OpenAI uses it for Codex.

I’m interested in what happens when this becomes the standard.

Building with care is a phrase I keep coming back to. Speed isn’t just about efficiency. It is about reducing the distance between having an idea and seeing it work. I am trying to raise the bar, even if I am still learning where the bar is.

The boundaries of a single continent

A single continent has its own limits. You can’t run every model on this hardware. It requires a specific kind of architectural alignment. It is like choosing a minimalist framework. You gain speed and clarity, but you lose the flexibility of a larger, more bloated system.

Other providers take different paths. Baseten is practical. They optimize the GPU islands we already have. Mercury, from Inception Labs, rethinks the math of generation itself. They use diffusion models to generate tokens in parallel instead of one after another.

The work ahead

In the end, these are all different ways of trying to make the computer get out of the way. When the friction is gone, the tool disappears.

I am still figuring out my own approach to this. Building with care is a slow process, but it is the only way I know how to reach something that feels effortless. I am focused on the basics for now. One small win at a time. Every single day.

Approach	Philosophy	Speed
Cerebras	Build a single continent.	1000-2000+ TPS
Baseten	Optimize the bridges.	~341 TPS
Mercury	Parallelize the generation.	1000+ TPS

The polish I want will come as I keep building. One small win at a time. Every single day.