Confidence after compression
I’ve been thinking about the kind of error that does not look like an error.
A model can choose the right label and still be wrong about how sure it is. That sounds harmless until the prediction becomes part of a workflow. In finance, healthcare, moderation, triage, or any setting where a human uses model confidence to decide what to inspect next, bad confidence is not metadata. It is part of the decision.
My IEEE MLSP 2026 work, Quantization-Robust Fuzzy Calibration for Edge-Deployed LLMs, started from that unease.
Edge deployment makes the problem sharper. If you want an LLM to run on a constrained device, you usually quantize it. INT4 is attractive because it cuts memory hard. But compression does not only affect weights. It can also distort the probability surface. The model may still answer, but its confidence stops meaning what it used to mean.
That is the quiet failure mode I wanted to study.
The cost of making models small
We talk a lot about accuracy because it is easy to see. Calibration is harder to feel.
If a model says it is 80% confident, then over many similar predictions it should be right about 80% of the time. Expected Calibration Error, or ECE, measures how far the model is from that behavior. Lower is better. A well-calibrated model does not just make predictions. It gives probabilities you can use.
Quantization complicates this. When we push a model down to INT4, we are forcing continuous values into a very small number of buckets. That saves memory, but it introduces small distortions everywhere. Those distortions can move confidence boundaries in ways that standard calibration methods were not trained to handle.
The question became: can a calibrator learn to survive the compression?
graph LR
classDef source fill:#3c474d,stroke:#dbbc7f,stroke-width:2px,color:#d3c6aa;
classDef compression fill:#323c41,stroke:#e69875,stroke-width:2px,color:#d3c6aa;
classDef train fill:#2b3339,stroke:#7fbbb3,stroke-width:2px,color:#d3c6aa;
classDef calibrate fill:#323c41,stroke:#a7c080,stroke-width:2px,color:#d3c6aa;
classDef output fill:#3c474d,stroke:#d699b6,stroke-width:2px,color:#d3c6aa;
classDef edge fill:#2b3339,stroke:#859289,stroke-width:2px,stroke-dasharray: 4 4,color:#d3c6aa;
Model["Classifier<br/>raw probabilities"]:::source
Quant["INT4 deployment<br/>compressed model"]:::compression
Noise["Simulated quantization noise<br/>training only"]:::train
Fuzzy["Fuzzy membership gates<br/>confidence regions"]:::calibrate
Dirichlet["Dirichlet correction<br/>region-specific mapping"]:::calibrate
Output["Calibrated probabilities<br/>lower ECE"]:::output
Labels["Linguistic confidence<br/>low / moderate / high"]:::output
Edge["Tiny calibrator<br/>~0.0002 MB"]:::edge
Model --> Quant
Quant --> Noise
Noise --> Fuzzy
Fuzzy --> Dirichlet
Dirichlet --> Output
Output --> Labels
Dirichlet --> Edge
click Model showDiagramTip "Raw probabilities"
click Quant showDiagramTip "INT4 compression"
click Noise showDiagramTip "Quantization noise"
click Fuzzy showDiagramTip "Fuzzy gates"
click Dirichlet showDiagramTip "Dirichlet correction"
click Output showDiagramTip "Calibrated output"
click Labels showDiagramTip "Interpretable confidence"
click Edge showDiagramTip "Edge overhead"
Fuzzy boundaries instead of one global fix
Temperature scaling is elegant, but it is blunt. It learns one global correction. That can help, but confidence errors are rarely uniform. A model may be too confident at the top end and underconfident in the middle. One knob cannot always fix that shape.
So I used fuzzy-gated Dirichlet calibration.
The fuzzy part divides confidence into smooth regions. Not hard bins. Smooth membership functions. A prediction can partially belong to multiple confidence regions, which gives the calibrator room to move without creating brittle thresholds.
The Dirichlet part performs the probability correction inside those regions. Instead of forcing one global map over the whole probability space, the calibrator learns region-specific behavior.
Then I added simulated INT4 noise during training. That is the quantization-aware piece. The calibrator sees probability perturbations while it learns, so the final mapping is less fragile when the deployed model is compressed.
Testing across architectures
I did not want this to be a BERT-only trick. Edge deployment is messy now. Some models use encoder classification heads. Some use decoder-only prompting. Some newer architectures expose probability behavior through different paths entirely.
So I tested four models on financial sentiment classification:
| Model | Architecture | Accuracy | Calibrated ECE | INT4 Memory |
|---|---|---|---|---|
| FinBERT | Encoder | 60.08% | 0.033 | 132 MB |
| FinancialBERT | Encoder | 54.52% | 0.073 | 133 MB |
| Gemma 3 | Decoder-only | 55.90% | 0.046 | 909 MB |
| Qwen 3.5-0.8B | Gated Delta Network | 37.92% | 0.158 | 723 MB |
Across the models, fuzzy calibration reduced ECE by roughly 58.5-88.7%. The calibrator memory was basically noise in the budget: about 0.0002 MB.
FinBERT ended up being the practical winner. It had the best accuracy, the best calibrated ECE, and the smallest memory footprint. That matters because the edge story is not just “can this run?” It is “can this run and still tell the truth about uncertainty?”
The result that annoyed the obvious story
The most interesting result was not the decoder model. It was the comparison between FinBERT and FinancialBERT.
FinancialBERT is sentiment-specialized. On paper, that sounds like it should win. But FinBERT, trained more broadly on financial language, performed better:
| Comparison | FinBERT | FinancialBERT |
|---|---|---|
| Accuracy | 60.08% | 54.52% |
| Calibrated ECE | 0.033 | 0.073 |
| INT4 Memory | 132 MB | 133 MB |
That surprised me in a useful way. Narrow specialization is not always the better prior. Sometimes broad domain exposure gives the model a healthier representation, especially when the downstream task is noisy and compressed deployment adds another layer of distortion.
I would not overclaim this yet. The experiments are single runs on 5,000 samples from one domain. It needs multi-domain validation. But it is a thread worth pulling.
Accuracy is not enough
What I like about this work is that it moves past leaderboard thinking.
Accuracy answers one question: did the model choose the right class?
Calibration answers another: can I trust the probability attached to that choice?
On edge devices, we need both. A tiny model that is overconfident is not automatically useful. A compressed model with calibrated uncertainty is much more interesting because it can participate in a larger system. It can say “I think this is positive, but I am not very sure,” and that sentence changes how a human or another model should respond.
That is the direction I care about: models that are not just smaller, but more honest under constraint.