When Smaller Models Lie About Confidence

Confidence after compression

I’ve been thinking about the kind of error that does not look like an error.

A model can choose the right label and still be wrong about how sure it is. That sounds harmless until the prediction becomes part of a workflow. In finance, healthcare, moderation, triage, or any setting where a human uses model confidence to decide what to inspect next, bad confidence is not metadata. It is part of the decision.

My IEEE MLSP 2026 work, Quantization-Robust Fuzzy Calibration for Edge-Deployed LLMs, started from that unease.

Edge deployment makes the problem sharper. If you want an LLM to run on a constrained device, you usually quantize it. INT4 is attractive because it cuts memory hard. But compression does not only affect weights. It can also distort the probability surface. The model may still answer, but its confidence stops meaning what it used to mean.

That is the quiet failure mode I wanted to study.

The cost of making models small

We talk a lot about accuracy because it is easy to see. Calibration is harder to feel.

If a model says it is 80% confident, then over many similar predictions it should be right about 80% of the time. Expected Calibration Error, or ECE, measures how far the model is from that behavior. Lower is better. A well-calibrated model does not just make predictions. It gives probabilities you can use.

Quantization complicates this. When we push a model down to INT4, we are forcing continuous values into a very small number of buckets. That saves memory, but it introduces small distortions everywhere. Those distortions can move confidence boundaries in ways that standard calibration methods were not trained to handle.

The question became: can a calibrator learn to survive the compression?

graph LR
    classDef source fill:#3c474d,stroke:#dbbc7f,stroke-width:2px,color:#d3c6aa;
    classDef compression fill:#323c41,stroke:#e69875,stroke-width:2px,color:#d3c6aa;
    classDef train fill:#2b3339,stroke:#7fbbb3,stroke-width:2px,color:#d3c6aa;
    classDef calibrate fill:#323c41,stroke:#a7c080,stroke-width:2px,color:#d3c6aa;
    classDef output fill:#3c474d,stroke:#d699b6,stroke-width:2px,color:#d3c6aa;
    classDef edge fill:#2b3339,stroke:#859289,stroke-width:2px,stroke-dasharray: 4 4,color:#d3c6aa;

    Model["Classifier<br/>raw probabilities"]:::source
    Quant["INT4 deployment<br/>compressed model"]:::compression
    Noise["Simulated quantization noise<br/>training only"]:::train
    Fuzzy["Fuzzy membership gates<br/>confidence regions"]:::calibrate
    Dirichlet["Dirichlet correction<br/>region-specific mapping"]:::calibrate
    Output["Calibrated probabilities<br/>lower ECE"]:::output
    Labels["Linguistic confidence<br/>low / moderate / high"]:::output
    Edge["Tiny calibrator<br/>~0.0002 MB"]:::edge

    Model --> Quant
    Quant --> Noise
    Noise --> Fuzzy
    Fuzzy --> Dirichlet
    Dirichlet --> Output
    Output --> Labels
    Dirichlet --> Edge

    click Model showDiagramTip "Raw probabilities"
    click Quant showDiagramTip "INT4 compression"
    click Noise showDiagramTip "Quantization noise"
    click Fuzzy showDiagramTip "Fuzzy gates"
    click Dirichlet showDiagramTip "Dirichlet correction"
    click Output showDiagramTip "Calibrated output"
    click Labels showDiagramTip "Interpretable confidence"
    click Edge showDiagramTip "Edge overhead"

Fuzzy boundaries instead of one global fix

Temperature scaling is elegant, but it is blunt. It learns one global correction. That can help, but confidence errors are rarely uniform. A model may be too confident at the top end and underconfident in the middle. One knob cannot always fix that shape.

So I used fuzzy-gated Dirichlet calibration.

The fuzzy part divides confidence into smooth regions. Not hard bins. Smooth membership functions. A prediction can partially belong to multiple confidence regions, which gives the calibrator room to move without creating brittle thresholds.

The Dirichlet part performs the probability correction inside those regions. Instead of forcing one global map over the whole probability space, the calibrator learns region-specific behavior.

Then I added simulated INT4 noise during training. That is the quantization-aware piece. The calibrator sees probability perturbations while it learns, so the final mapping is less fragile when the deployed model is compressed.

Testing across architectures

I did not want this to be a BERT-only trick. Edge deployment is messy now. Some models use encoder classification heads. Some use decoder-only prompting. Some newer architectures expose probability behavior through different paths entirely.

So I tested four models on financial sentiment classification:

Model	Architecture	Accuracy	Calibrated ECE	INT4 Memory
FinBERT	Encoder	60.08%	0.033	132 MB
FinancialBERT	Encoder	54.52%	0.073	133 MB
Gemma 3	Decoder-only	55.90%	0.046	909 MB
Qwen 3.5-0.8B	Gated Delta Network	37.92%	0.158	723 MB

Across the models, fuzzy calibration reduced ECE by roughly 58.5-88.7%. The calibrator memory was basically noise in the budget: about 0.0002 MB.

FinBERT ended up being the practical winner. It had the best accuracy, the best calibrated ECE, and the smallest memory footprint. That matters because the edge story is not just “can this run?” It is “can this run and still tell the truth about uncertainty?”

The result that annoyed the obvious story

The most interesting result was not the decoder model. It was the comparison between FinBERT and FinancialBERT.

FinancialBERT is sentiment-specialized. On paper, that sounds like it should win. But FinBERT, trained more broadly on financial language, performed better:

Comparison	FinBERT	FinancialBERT
Accuracy	60.08%	54.52%
Calibrated ECE	0.033	0.073
INT4 Memory	132 MB	133 MB

That surprised me in a useful way. Narrow specialization is not always the better prior. Sometimes broad domain exposure gives the model a healthier representation, especially when the downstream task is noisy and compressed deployment adds another layer of distortion.

I would not overclaim this yet. The experiments are single runs on 5,000 samples from one domain. It needs multi-domain validation. But it is a thread worth pulling.

Accuracy is not enough

What I like about this work is that it moves past leaderboard thinking.

Accuracy answers one question: did the model choose the right class?

Calibration answers another: can I trust the probability attached to that choice?

On edge devices, we need both. A tiny model that is overconfident is not automatically useful. A compressed model with calibrated uncertainty is much more interesting because it can participate in a larger system. It can say “I think this is positive, but I am not very sure,” and that sentence changes how a human or another model should respond.

That is the direction I care about: models that are not just smaller, but more honest under constraint.