The shape of a useful review
My paper, Classifying Scientific Peer Reviews: Distinguishing Authentic, Generic, and AI-Generated Feedback, is out on IEEE Xplore.
I’ve been thinking about the strange authority of a peer review. It is usually just text in a box, but it can decide whether months of work move forward or quietly disappear. That feels heavy. It should feel heavy.
A good review has friction. It pushes back. It notices the weak assumptions, the missing experiment, the strange claim that sounded fine until someone actually read it carefully. A bad review can still be fluent. It can still be polite. It can even sound academic. But it does not touch the paper.
That was the thing I wanted to study.
Can we tell the difference between a review that is real, a review that is generic, and a review that was generated by AI?
The review that sounds right
The hard part is that the surface is misleading. Generic reviews know the costume. They say the introduction needs work. They ask for more recent references. They mention typos. They recommend clearer motivation. None of that is wrong, exactly. It is just reusable.
AI-generated reviews have the same problem, but with better grammar and more confidence. They can look thoughtful while staying safely abstract. They can summarize the paper back to you and still avoid the actual critique.
So I did not want to build a simple “AI or human” detector. That framing is too small. A human can write a lazy review. A model can write a detailed one. The better question is whether the feedback has enough specificity to be useful.
The paper uses three classes: authentic, generic, and AI-generated. That third bucket matters, but the second one matters just as much.
graph LR
classDef source fill:#3c474d,stroke:#dbbc7f,stroke-width:2px,color:#d3c6aa;
classDef combine fill:#323c41,stroke:#7fbbb3,stroke-width:2px,color:#d3c6aa;
classDef model fill:#2b3339,stroke:#a7c080,stroke-width:2px,color:#d3c6aa;
classDef result fill:#323c41,stroke:#e69875,stroke-width:2px,color:#d3c6aa;
classDef action fill:#3c474d,stroke:#d699b6,stroke-width:2px,stroke-dasharray: 4 4,color:#d3c6aa;
Paper["Paper context<br/>title + abstract"]:::source
Review["Review text<br/>feedback to audit"]:::source
Pack["Combined input<br/>context + review"]:::combine
Encoder["Transformer encoder<br/>BERT / RoBERTa / XLNet / GPT"]:::model
Classifier["Three-way classifier<br/>review behavior"]:::model
Authentic["Authentic<br/>specific critique"]:::result
Generic["Generic<br/>reusable comments"]:::result
Generated["AI-generated<br/>fluent but suspect"]:::result
Flag["Editorial signal<br/>needs human look"]:::action
Paper --> Pack
Review --> Pack
Pack --> Encoder
Encoder --> Classifier
Classifier --> Authentic
Classifier --> Generic
Classifier --> Generated
Generic --> Flag
Generated --> Flag
click Paper showDiagramTip "Paper context"
click Review showDiagramTip "Review text"
click Pack showDiagramTip "Combined model input"
click Encoder showDiagramTip "Transformer encoder"
click Classifier showDiagramTip "Three-way classifier"
click Authentic showDiagramTip "Authentic reviews"
click Generic showDiagramTip "Generic reviews"
click Generated showDiagramTip "AI-generated reviews"
click Flag showDiagramTip "Editorial signal"
Building a small audit trail
We built a dataset of 399 review entries. Each row had the paper title, abstract, review text, and review type. The authentic reviews came from journals that publish open peer reviews, including MDPI and PeerJ. The generic reviews came from sentence concatenation using an online review generator. The AI-generated reviews came from ChatGPT, DeepSeek AI, and Claude.ai, using the paper abstract as context.
It was not a huge dataset. It was intentionally narrow. I wanted the model to look at the title, the abstract, and the review together, then decide what kind of feedback it was seeing.
The split looked like this:
| Review Type | Count |
|---|---|
| Authentic | 202 |
| Generic | 117 |
| AI-generated | 80 |
There is something useful about keeping the setup small. You can feel the limits. You can see where the model starts confusing “sounds like a review” with “is actually a review.” That confusion is the whole problem.
Putting models in the reviewer seat
I tested the usual transformer family first: BERT, DistilBERT, XLNet, and RoBERTa. They are not glamorous anymore, but they are still useful baselines. They are small enough to reason about, cheap enough to train, and practical enough for the kind of screening tool a publisher might actually deploy.
Then we added GPT-4o-nano through supervised fine-tuning. That changed the ceiling.
| Model | Role in the study |
|---|---|
| BERT | The baseline. |
| DistilBERT | The lighter, faster option. |
| XLNet | The bidirectional context experiment. |
| RoBERTa | The strongest BERT-family held-out performer. |
| GPT-4o-nano | The model that raised the benchmark. |
The input was simple: title, abstract, and review text, packed together with special tokens. The question was whether the model could learn the difference between engagement and performance.
Where the signal showed up
GPT-4o-nano was the clear winner on the held-out test set:
| Metric | GPT-4o-nano |
|---|---|
| Held-out accuracy | 0.807 |
| Macro F1 | 0.806 |
| Weighted F1 | 0.810 |
| Macro precision | 0.817 |
| Macro recall | 0.807 |
RoBERTa was the strongest of the smaller models, which was not too surprising. It usually handles this kind of text classification work better than plain BERT.
| Model | Accuracy | Macro F1 |
|---|---|---|
| BERT | 0.638 | 0.636 |
| DistilBERT | 0.675 | 0.611 |
| RoBERTa | 0.713 | 0.706 |
| XLNet | 0.633 | 0.653 |
| GPT-4o-nano | 0.807 | 0.806 |
The interesting part was the generic class. That is where the smaller models struggled most. Generic reviews are slippery because they borrow the right academic shape. They are not obviously broken. They are just hollow.
| Model | AI-Generated F1 | Authentic F1 | Generic F1 |
|---|---|---|---|
| BERT | 0.690 | 0.730 | 0.430 |
| DistilBERT | 0.770 | 0.760 | 0.220 |
| RoBERTa | 0.840 | 0.740 | 0.550 |
| XLNet | 0.820 | 0.670 | 0.470 |
| GPT-4o-nano | 0.880 | 0.830 | 0.730 |
GPT-4o-nano reached 0.73 F1 on generic reviews. That number matters to me more than the headline accuracy. It means the model was better at spotting the absence of specificity, not just the presence of AI-like phrasing.
The thing I keep coming back to
I keep coming back to this: detection is not the same as judgment.
A classifier should not replace editors. It should not become another automatic gate that people blindly trust. But it can act like a smoke alarm. It can say: this review looks unusually generic, or this review looks generated, or this review may need another human look before it becomes part of the record.
That feels like the right level of ambition for now.
I also like that the smaller models still have a role. RoBERTa did not beat GPT-4o-nano, but it is cheaper and easier to host. Sometimes the practical tool is not the largest model. Sometimes it is the one that can sit quietly inside an editorial workflow and flag the obvious failures.
Leaving the trail open
The paper is only a starting point. The next version needs more disciplines, more review styles, better long-context handling, and richer signals like reviewer history or paper metadata. But I think the basic direction is right: peer review needs tools that look at usefulness, not just authorship.
The work is open here:
| Resource | Link |
|---|---|
| IEEE paper | ieeexplore.ieee.org/document/11446645 |
| Training code | github.com/pnpm-aftab/auditing-peer-reviews |
| Model hub | huggingface.co/aashir021/peer-review-classifiers |
The thing I want is not a world where models review papers for us. I want a world where bad reviews are harder to hide behind polished language. That feels like a small but meaningful step toward research that is treated with the care it deserves.