When Smaller Models Lie About Confidence

Confidence after compression

I’ve been thinking about the kind of error that does not look like an error.

A model can choose the right label and still be wrong about how sure it is. That sounds harmless until the prediction becomes part of a workflow. In finance, healthcare, moderation, triage — any setting where a human uses model confidence to decide what to inspect next — bad confidence is not metadata. It is part of the decision.

My IEEE MLSP 2026 work, Quantization-Robust Fuzzy Calibration for Edge-Deployed LLMs, started from that unease. Edge deployment makes the problem sharper. INT4 is attractive because it cuts memory hard, but compression does not only affect weights. It can distort the probability surface. The model may still answer, but its confidence stops meaning what it used to mean.

That is the quiet failure mode I wanted to study.

The cost of making models small

We talk a lot about accuracy because it is easy to see. Calibration is harder to feel.

If a model says it is 80% confident, over many similar predictions it should be right about 80% of the time. Expected Calibration Error, or ECE, measures how far the model is from that behavior. Lower is better. A well-calibrated model does not just make predictions. It gives probabilities you can use.

INT4 forces continuous values into a very small number of buckets. That saves memory, but introduces small distortions everywhere — distortions that can move confidence boundaries in ways standard calibration methods were not trained to handle.

The question became: can a calibrator learn to survive the compression?

Fuzzy boundaries instead of one global fix

Temperature scaling is elegant, but blunt. It learns one global correction. Confidence errors are rarely uniform — a model may be too confident at the top end and underconfident in the middle. One knob cannot always fix that shape.

So I used fuzzy-gated Dirichlet calibration.

The fuzzy part divides confidence into smooth regions. Not hard bins. Smooth membership functions. A prediction can partially belong to multiple regions, which gives the calibrator room to move without brittle thresholds. The Dirichlet part performs the probability correction inside those regions, learning region-specific behavior instead of forcing one global map.

Then I added simulated INT4 noise during training. That is the quantization-aware piece. The calibrator sees probability perturbations while it learns, so the final mapping is less fragile when the deployed model is compressed.

Testing across architectures

I did not want this to be a BERT-only trick. Edge deployment is messy now. Some models use encoder classification heads. Some use decoder-only prompting. Some newer architectures expose probability behavior through different paths entirely.

So I tested four models on financial sentiment classification:

Model	Architecture	Accuracy	Calibrated ECE	INT4 Memory
FinBERT	Encoder	60.08%	0.033	132 MB
FinancialBERT	Encoder	54.52%	0.073	133 MB
Gemma 3	Decoder-only	55.90%	0.046	909 MB
Qwen 3.5-0.8B	Gated Delta Network	37.92%	0.158	723 MB

Across the models, fuzzy calibration reduced ECE by roughly 58.5-88.7%. The calibrator memory was basically noise in the budget: about 0.0002 MB.

FinBERT ended up being the practical winner — best accuracy, best calibrated ECE, smallest footprint. The edge story is not just “can this run?” It is “can this run and still tell the truth about uncertainty?”

The result that annoyed the obvious story

The most interesting result was not the decoder model. It was FinBERT vs FinancialBERT.

FinancialBERT is sentiment-specialized. On paper, that should win. But FinBERT, trained more broadly on financial language, outperformed it on every metric that matters: 60.08% vs 54.52% accuracy, 0.033 vs 0.073 calibrated ECE, at nearly identical memory (132 MB vs 133 MB). Narrow specialization is not always the better prior. Sometimes broad domain exposure gives the model a healthier representation, especially when the task is noisy and compressed deployment adds another layer of distortion.

I would not overclaim this yet. Single runs, 5,000 samples, one domain. It needs multi-domain validation. But it is a thread worth pulling.

Accuracy is not enough

What I like about this work is that it moves past leaderboard thinking.

Accuracy answers one question: did the model choose the right class? Calibration answers another: can I trust the probability attached to that choice?

On edge devices, we need both. A tiny model that is overconfident is not automatically useful. A compressed model with calibrated uncertainty is much more interesting — it can participate in a larger system, say “I think this is positive, but I am not very sure,” and that sentence changes how a human or another model should respond.

That is the direction I care about: models that are not just smaller, but more honest under constraint.

Can We Still Trust the Review?

The shape of a useful review

My paper, Classifying Scientific Peer Reviews: Distinguishing Authentic, Generic, and AI-Generated Feedback, is out on IEEE Xplore.

I’ve been thinking about the strange authority of a peer review. It is usually just text in a box, but it can decide whether months of work move forward or quietly disappear. That feels heavy. It should feel heavy.

A good review has friction. It pushes back. It notices the weak assumptions, the missing experiment, the strange claim that sounded fine until someone actually read it carefully. A bad review can still be fluent. It can still sound academic. But it does not touch the paper.

That was the thing I wanted to study. Can we tell the difference between a review that is real, a review that is generic, and a review that was generated by AI?

The review that sounds right

The hard part is that the surface is misleading. Generic reviews know the costume. They say the introduction needs work, ask for more recent references, recommend clearer motivation. None of that is wrong, exactly. It is just reusable. AI-generated reviews have the same problem, but with better grammar and more confidence. They can summarize the paper back to you and still avoid the actual critique.

So I did not want to build a simple “AI or human” detector. That framing is too small. A human can write a lazy review. A model can write a detailed one. The better question is whether the feedback has enough specificity to be useful.

The paper uses three classes: authentic, generic, and AI-generated. That third bucket matters, but the second one matters just as much.

Building a small audit trail

We built a dataset of 399 review entries — 202 authentic, 117 generic, 80 AI-generated. Each row had the paper title, abstract, review text, and review type. Authentic reviews came from journals that publish open peer reviews, including MDPI and PeerJ. Generic reviews came from sentence concatenation using an online review generator. AI-generated reviews came from ChatGPT, DeepSeek AI, and Claude.ai, using the paper abstract as context.

It was not a huge dataset. It was intentionally narrow. I wanted the model to look at the title, the abstract, and the review together, then decide what kind of feedback it was seeing. You can feel the limits. You can see where the model starts confusing “sounds like a review” with “is actually a review.” That confusion is the whole problem.

Putting models in the reviewer seat

I tested the usual transformer family first: BERT (the baseline), DistilBERT (lighter and faster), XLNet (bidirectional context), and RoBERTa (the strongest BERT-family performer). They are not glamorous anymore, but they are still useful — small enough to reason about, cheap enough to train, practical enough for the kind of screening tool a publisher might actually deploy. Then we added GPT-4o-nano through supervised fine-tuning. That changed the ceiling.

The input was simple: title, abstract, and review text packed together with special tokens. The question was whether the model could learn the difference between engagement and performance.

Where the signal showed up

GPT-4o-nano was the clear winner. RoBERTa was the strongest of the smaller models, which was not too surprising.

Model	Accuracy	Macro F1	AI-Gen F1	Authentic F1	Generic F1
BERT	0.638	0.636	0.690	0.730	0.430
DistilBERT	0.675	0.611	0.770	0.760	0.220
RoBERTa	0.713	0.706	0.840	0.740	0.550
XLNet	0.633	0.653	0.820	0.670	0.470
GPT-4o-nano	0.807	0.806	0.880	0.830	0.730

The interesting part was the generic class. Generic reviews are slippery because they borrow the right academic shape — not obviously broken, just hollow. GPT-4o-nano reached 0.73 F1 on generic reviews. That number matters to me more than the headline accuracy. It means the model was better at spotting the absence of specificity, not just the presence of AI-like phrasing.

The thing I keep coming back to

Detection is not the same as judgment. A classifier should not replace editors or become another automatic gate people blindly trust. But it can act like a smoke alarm — flagging reviews that look unusually generic or generated before they become part of the record.

That feels like the right level of ambition for now. I also like that the smaller models still have a role. RoBERTa did not beat GPT-4o-nano, but it is cheaper and easier to host. Sometimes the practical tool is not the largest model. Sometimes it is the one that can sit quietly inside an editorial workflow and flag the obvious failures.

Leaving the trail open

The paper is only a starting point. The next version needs more disciplines, more review styles, better long-context handling, and richer signals like reviewer history or paper metadata. But the basic direction is right: peer review needs tools that look at usefulness, not just authorship.

The work is open here:

Resource	Link
IEEE paper	ieeexplore.ieee.org/document/11446645
Training code	github.com/pnpm-aftab/auditing-peer-reviews
Model hub	huggingface.co/aashir021/peer-review-classifiers

The thing I want is not a world where models review papers for us. I want a world where bad reviews are harder to hide behind polished language. That feels like a small but meaningful step toward research that is treated with the care it deserves.

MoE at the Edge: Making Sparse Activation Work on Your Phone

The cost of carrying everything

We need to stop computing everything. A smartphone with limited thermal headroom shouldn’t activate every parameter for a simple “hello” message. We can solve this with Mixture of Experts (MoE), an architecture where we split a model into specialized sub-networks and only fire the ones we need for a specific token. In my recent work, I converted Qwen3-0.6B into a sparse MoE (which I call TinyMoE), achieving 20 tokens/sec under mobile-equivalent constraints. The ultimate takeaway? We don’t need trillion-parameter models on our phones, we just need selective activation.

Refusing to scale normally

For years, the paradigm in machine learning has been dense computation. Every neuron fires for every input token. But human cognition doesn’t work that way (we don’t activate our entire brain just to recognize a coffee cup…). Why should our models?

With frontier models like Gemini 2.5 Pro, Claude Sonnet 4.6, and Llama 3.3, we’ve seen the raw power of immense scale. But maintaining that scale locally, in our pockets, is impossible using dense architectures.

A brief primer on selective activation

A Mixture of Experts architecture splits the dense layers of a neural network into multiple smaller “experts.” A routing network then decides which experts to use for a given token. It gives you the “superposition” of a massive model with the energy footprint of a small one.

This architectural sleight-of-hand is how today’s frontier models scale so aggressively. It’s why DeepSeek V4 can manage ~1 trillion total parameters while keeping active parameters constrained to ~37B. It’s why Kimi k2.5 destroys coding benchmarks while running efficiently natively, and it’s the foundation beneath Google’s massive Gemini 3 Pro. They act like massive-parameter models but only activate a tiny, specialized fraction per forward pass.

The burden of choosing

The catch? The router has to actually learn how to route. If it gets lazy, it sends everything to Expert 0. We call this “router collapse,” and it destroys the entire purpose of sparsity. Why learn to route when one expert is “good enough”?

Fighting the urge to rely on a single expert

We use an auxiliary loss during training. It’s a penalty for imbalance. Multiply the load (how many tokens an expert receives) by the importance (the average routing probability), sum it up, and a high value means extreme imbalance. We force the network to minimize this alongside the main loss, ensuring the workload stays beautifully distributed.

Shrinking the concept (TinyMoE)

I wanted to feel this constraint in my own hands. I decided to convert a pre-trained dense model (Qwen3-0.6B, representing the incredible Qwen 2.5/3 lineage) into an MoE. I call it TinyMoE.

Taking it apart to make it better

Instead of training from scratch (which requires millions of GPU hours), you upcycle. You initialize the new experts by duplicating the weights of the pre-trained dense feed-forward networks, adding a tiny amount of scaled noise to each. Then, you freeze most of the model and train only the router mechanism on your dataset.

# A conceptual snippet of TinyMoE upcycling
def upcycle_to_moe(dense_ffn, num_experts=4, noise_scale=0.01):
    experts = nn.ModuleList()
    for _ in range(num_experts): # We used 4 experts
        # Copy dense weights and add slight perturbation for diversity
        expert = copy.deepcopy(dense_ffn)
        expert.weight.data += torch.randn_like(expert.weight) * noise_scale
        experts.append(expert)
    return experts

I didn’t convert everything. I selectively upcycled only the middle layers (layers 8, 12, 16, and 20 out of 28). Early layers handle raw feature extraction, late layers handle task-specific output. The middle is where representations become diverse enough to warrant true experts.

The result? Active parameter count per token stayed incredibly low (~450M), but absolute capacity increased by 19%. Generation held at a respectable 20 tokens/second on edge hardware.

Keeping track of what matters

(And progressive disclosure…) I’m learning that MoE isn’t just a datacenter trick for training behemoths anymore. It’s an elegant hardware optimization strategy.

What if our routing was task-aware? If my phone is at 10% battery, the router could dynamically drop from top-k=2 to top-k=1, gracefully degrading its reasoning depth to extend battery life. Total energy reduction? Roughly 32% in my tests with less than 2% drop in accuracy.

This is why I do this work. Not for the benchmarks, but for the craft. It’s about “ricing” a system until the friction is completely gone. We don’t just need bigger models. We need models that know how to care about our constraints as much as we do. It’s the small wins that push us forward.

Inside Cerebras: Wafer-Scale Architecture for 2000+ TPS AI Inference

Finding Speed in Silicon

I’ve been thinking about the cost of waiting. Not the big delays, but the small slivers of time that disappear while we wait for a cursor to move. We often accept latency as a necessary tax for complexity. But after looking at Cerebras, I’m starting to think that tax might be optional.

Cerebras, based in Sunnyvale, built the Wafer-Scale Engine (WSE-3). It isn’t just a chip. It is a piece of silicon that wasn’t cut.

The silicon we choose not to cut

Standard GPUs are like separate islands connected by bridges. The interconnects are the bottleneck. Cerebras is different because it is a single, contiguous continent of silicon.

900,000 cores. 44GB of SRAM. All on one piece of glass.

When memory access happens in a single clock cycle, the latency disappears. It is the hardware version of “ricing” — optimizing the foundation until the friction is gone. Minimal and focused.

Speed as a foundation

We usually talk about AI reasoning, but speed is a form of intelligence. When a model responds at 2000 tokens per second, the experience changes. It stops being a tool you call and starts being a stream of thought you can actually follow. Meta uses this for their Scout model, and OpenAI uses it for Codex.

Building with care is a phrase I keep coming back to. Speed isn’t just about efficiency. It is about reducing the distance between having an idea and seeing it work.

The boundaries of a single continent

A single continent has its own limits. You can’t run every model on this hardware. It requires a specific kind of architectural alignment — like choosing a minimalist framework. You gain speed and clarity, but you lose the flexibility of a larger, more bloated system.

Other providers take different paths. Baseten is practical. They optimize the GPU islands we already have. Mercury, from Inception Labs, rethinks the math of generation itself. They use diffusion models to generate tokens in parallel instead of one after another.

The work ahead

In the end, these are all different ways of trying to make the computer get out of the way. When the friction is gone, the tool disappears.

I am still figuring out my own approach to this. Building with care is a slow process, but it is the only way I know how to reach something that feels effortless. One small win at a time. Every single day.

Approach	Philosophy	Speed
Cerebras	Build a single continent.	1000-2000+ TPS
Baseten	Optimize the bridges.	~341 TPS
Mercury	Parallelize the generation.	1000+ TPS

The polish I want will come as I keep building. One small win at a time. Every single day.

About

Work

Experience

May 2026

Neuroimaging & data science research, building devtools and data pipelines.

Visit Website

2025 — 2026

Developed multimodal AI clinical triage DSS systems using LLMs, RAG, and fine-tuning (95% F1 accuracy).

Visit Website

2025

Developed marketplace SaaS, Shopify data pipelines, and AI agent for natural language product search.

Visit Website

2023 — 2026

Developed SaaS platform features, built ETL ingestion pipelines, and scaled platform to 50k monthly users.

Visit Website

Contact

Get In Touch

Say hello!

Open to new opportunities, collaborations, or just a quick chat.

Georgia, USA

Email GitHub LinkedIn Twitter

Listening

Playlist

Get Lucky (Radio Edit)

Daft Punk

Random Access Memories

Listen on Spotify

Notebook

May 3, 2026 • 8 min read