MoE at the Edge: Making Sparse Activation Work on Your Phone

The cost of carrying everything

We need to stop computing everything. A smartphone with limited thermal headroom shouldn’t activate every parameter for a simple “hello” message. We can solve this with Mixture of Experts (MoE), an architecture where we split a model into specialized sub-networks and only fire the ones we need for a specific token. In my recent work, I converted Qwen3-0.6B into a sparse MoE (which I call TinyMoE), achieving 20 tokens/sec under mobile-equivalent constraints. The ultimate takeaway? We don’t need trillion-parameter models on our phones, we just need selective activation.

Refusing to scale normally

For years, the paradigm in machine learning has been dense computation. Every neuron fires for every input token. But human cognition doesn’t work that way (we don’t activate our entire brain just to recognize a coffee cup…). Why should our models?

With frontier models like Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.1, we’ve seen the raw power of immense scale. But maintaining that scale locally, in our pockets, is impossible using dense architectures.

A brief primer on selective activation

A Mixture of Experts architecture splits the dense layers of a neural network into multiple smaller “experts.” A routing network then decides which experts to use for a given token. It gives you the “superposition” of a massive model with the energy footprint of a small one.

graph LR
    classDef input fill:#3c474d,stroke:#dbbc7f,stroke-width:2px,color:#d3c6aa;
    classDef router fill:#323c41,stroke:#7fbbb3,stroke-width:2px,color:#d3c6aa;
    classDef active fill:#323c41,stroke:#a7c080,stroke-width:2px,color:#d3c6aa;
    classDef idle fill:#2b3339,stroke:#859289,stroke-width:2px,stroke-dasharray: 4 4,color:#859289;
    classDef output fill:#3c474d,stroke:#e69875,stroke-width:2px,color:#d3c6aa;

    Input["Input tokens"]:::input
    Router["Router<br/>scores experts"]:::router

    subgraph Active["Activated this token"]
        E0["Expert 0<br/>selected"]:::active
        E1["Expert 1<br/>selected"]:::active
    end

    subgraph Idle["Skipped this token"]
        E2["Expert 2<br/>idle"]:::idle
        E3["Expert 3<br/>idle"]:::idle
    end

    Aggregate["Weighted aggregation<br/>combine active paths"]:::output
    Output["Layer output<br/>sparse compute, dense result"]:::output

    Input --> Router
    Router -->|"top-k = 2"| E0
    Router -->|"top-k = 2"| E1
    Router -.->|"not selected"| E2
    Router -.->|"not selected"| E3
    E0 --> Aggregate
    E1 --> Aggregate
    Aggregate --> Output

    click Input showDiagramTip "What enters the MoE layer?"
    click Router showDiagramTip "How routing chooses experts"
    click E0 showDiagramTip "Selected expert"
    click E1 showDiagramTip "Selected expert"
    click E2 showDiagramTip "Skipped expert"
    click E3 showDiagramTip "Skipped expert"
    click Aggregate showDiagramTip "How selected experts combine"
    click Output showDiagramTip "What leaves the layer"

This architectural sleight-of-hand is how today’s frontier models scale so aggressively. It’s why DeepSeek V4 can manage ~1 trillion total parameters while keeping active parameters constrained to ~37B. It’s why Kimi k2.5 destroys coding benchmarks while running efficiently natively, and it’s the foundation beneath Google’s massive Gemini 3 Pro. They act like massive-parameter models but only activate a tiny, specialized fraction per forward pass.

The burden of choosing

The catch? The router has to actually learn how to route. If it gets lazy, it sends everything to Expert 0. We call this “router collapse,” and it destroys the entire purpose of sparsity. Why learn to route when one expert is “good enough”?

Fighting the urge to rely on a single expert

We use an auxiliary loss during training. It’s essentially a penalty for imbalance. If you multiply the load (how many tokens an expert receives) by the importance (the average routing probability) and sum it up, you get a value. A high value means extreme imbalance. We force the network to minimize this number alongside the main loss, ensuring the workload stays beautifully distributed.

Shrinking the concept (TinyMoE)

I wanted to feel this constraint in my own hands. I decided to convert a pre-trained dense model (Qwen3-0.6B, representing the incredible Qwen 2.5/3 lineage) into an MoE. I call it TinyMoE.

Taking it apart to make it better

Instead of training from scratch (which requires millions of GPU hours), you upcycle. You initialize the new experts by duplicating the weights of the pre-trained dense feed-forward networks, adding a tiny amount of scaled noise to each. Then, you freeze most of the model and train only the router mechanism on your dataset.

# A conceptual snippet of TinyMoE upcycling
def upcycle_to_moe(dense_ffn, num_experts=4, noise_scale=0.01):
    experts = nn.ModuleList()
    for _ in range(num_experts): # We used 4 experts
        # Copy dense weights and add slight perturbation for diversity
        expert = copy.deepcopy(dense_ffn)
        expert.weight.data += torch.randn_like(expert.weight) * noise_scale
        experts.append(expert)
    return experts

I didn’t convert everything. I selectively upcycled only the middle layers (layers 8, 12, 16, and 20 out of 28). Early layers handle raw feature extraction and late layers handle task-specific output. The middle is where representations become diverse enough to warrant true experts.

The result? The active parameter count per token stayed incredibly low (~450M), but the absolute capacity increased by 19% to capture richer representations. Generation held at a respectable 20 tokens/second on edge hardware.

Keeping track of what matters

(And progressive disclosure…) I’m learning that MoE isn’t just a datacenter trick for training behemoths anymore. It’s an elegant hardware optimization strategy.

What if our routing was task-aware? If my phone is at 10% battery, the router could dynamically drop from top-k=2 to top-k=1, gracefully degrading its reasoning depth to extend battery life. Total energy reduction? Roughly 32% in my tests with less than 2% drop in accuracy.

This is why I do this work. Not for the benchmarks, but for the craft. It’s about “ricing” a system until the friction is completely gone. We don’t just need bigger models. We need models that know how to care about our constraints as much as we do. It’s the small wins that push us forward.