Can We Still Trust the Review?

The shape of a useful review

My paper, Classifying Scientific Peer Reviews: Distinguishing Authentic, Generic, and AI-Generated Feedback, is out on IEEE Xplore.

I’ve been thinking about the strange authority of a peer review. It is usually just text in a box, but it can decide whether months of work move forward or quietly disappear. That feels heavy. It should feel heavy.

A good review has friction. It pushes back. It notices the weak assumptions, the missing experiment, the strange claim that sounded fine until someone actually read it carefully. A bad review can still be fluent. It can still be polite. It can even sound academic. But it does not touch the paper.

That was the thing I wanted to study.

Can we tell the difference between a review that is real, a review that is generic, and a review that was generated by AI?

The review that sounds right

The hard part is that the surface is misleading. Generic reviews know the costume. They say the introduction needs work. They ask for more recent references. They mention typos. They recommend clearer motivation. None of that is wrong, exactly. It is just reusable.

AI-generated reviews have the same problem, but with better grammar and more confidence. They can look thoughtful while staying safely abstract. They can summarize the paper back to you and still avoid the actual critique.

So I did not want to build a simple “AI or human” detector. That framing is too small. A human can write a lazy review. A model can write a detailed one. The better question is whether the feedback has enough specificity to be useful.

The paper uses three classes: authentic, generic, and AI-generated. That third bucket matters, but the second one matters just as much.

graph LR
    classDef source fill:#3c474d,stroke:#dbbc7f,stroke-width:2px,color:#d3c6aa;
    classDef combine fill:#323c41,stroke:#7fbbb3,stroke-width:2px,color:#d3c6aa;
    classDef model fill:#2b3339,stroke:#a7c080,stroke-width:2px,color:#d3c6aa;
    classDef result fill:#323c41,stroke:#e69875,stroke-width:2px,color:#d3c6aa;
    classDef action fill:#3c474d,stroke:#d699b6,stroke-width:2px,stroke-dasharray: 4 4,color:#d3c6aa;

    Paper["Paper context<br/>title + abstract"]:::source
    Review["Review text<br/>feedback to audit"]:::source
    Pack["Combined input<br/>context + review"]:::combine
    Encoder["Transformer encoder<br/>BERT / RoBERTa / XLNet / GPT"]:::model
    Classifier["Three-way classifier<br/>review behavior"]:::model

    Authentic["Authentic<br/>specific critique"]:::result
    Generic["Generic<br/>reusable comments"]:::result
    Generated["AI-generated<br/>fluent but suspect"]:::result
    Flag["Editorial signal<br/>needs human look"]:::action

    Paper --> Pack
    Review --> Pack
    Pack --> Encoder
    Encoder --> Classifier
    Classifier --> Authentic
    Classifier --> Generic
    Classifier --> Generated
    Generic --> Flag
    Generated --> Flag

    click Paper showDiagramTip "Paper context"
    click Review showDiagramTip "Review text"
    click Pack showDiagramTip "Combined model input"
    click Encoder showDiagramTip "Transformer encoder"
    click Classifier showDiagramTip "Three-way classifier"
    click Authentic showDiagramTip "Authentic reviews"
    click Generic showDiagramTip "Generic reviews"
    click Generated showDiagramTip "AI-generated reviews"
    click Flag showDiagramTip "Editorial signal"

Building a small audit trail

We built a dataset of 399 review entries. Each row had the paper title, abstract, review text, and review type. The authentic reviews came from journals that publish open peer reviews, including MDPI and PeerJ. The generic reviews came from sentence concatenation using an online review generator. The AI-generated reviews came from ChatGPT, DeepSeek AI, and Claude.ai, using the paper abstract as context.

It was not a huge dataset. It was intentionally narrow. I wanted the model to look at the title, the abstract, and the review together, then decide what kind of feedback it was seeing.

The split looked like this:

Review Type	Count
Authentic	202
Generic	117
AI-generated	80

There is something useful about keeping the setup small. You can feel the limits. You can see where the model starts confusing “sounds like a review” with “is actually a review.” That confusion is the whole problem.

Putting models in the reviewer seat

I tested the usual transformer family first: BERT, DistilBERT, XLNet, and RoBERTa. They are not glamorous anymore, but they are still useful baselines. They are small enough to reason about, cheap enough to train, and practical enough for the kind of screening tool a publisher might actually deploy.

Then we added GPT-4o-nano through supervised fine-tuning. That changed the ceiling.

Model	Role in the study
BERT	The baseline.
DistilBERT	The lighter, faster option.
XLNet	The bidirectional context experiment.
RoBERTa	The strongest BERT-family held-out performer.
GPT-4o-nano	The model that raised the benchmark.

The input was simple: title, abstract, and review text, packed together with special tokens. The question was whether the model could learn the difference between engagement and performance.

Where the signal showed up

GPT-4o-nano was the clear winner on the held-out test set:

Metric	GPT-4o-nano
Held-out accuracy	0.807
Macro F1	0.806
Weighted F1	0.810
Macro precision	0.817
Macro recall	0.807

RoBERTa was the strongest of the smaller models, which was not too surprising. It usually handles this kind of text classification work better than plain BERT.

Model	Accuracy	Macro F1
BERT	0.638	0.636
DistilBERT	0.675	0.611
RoBERTa	0.713	0.706
XLNet	0.633	0.653
GPT-4o-nano	0.807	0.806

The interesting part was the generic class. That is where the smaller models struggled most. Generic reviews are slippery because they borrow the right academic shape. They are not obviously broken. They are just hollow.

Model	AI-Generated F1	Authentic F1	Generic F1
BERT	0.690	0.730	0.430
DistilBERT	0.770	0.760	0.220
RoBERTa	0.840	0.740	0.550
XLNet	0.820	0.670	0.470
GPT-4o-nano	0.880	0.830	0.730

GPT-4o-nano reached 0.73 F1 on generic reviews. That number matters to me more than the headline accuracy. It means the model was better at spotting the absence of specificity, not just the presence of AI-like phrasing.

The thing I keep coming back to

I keep coming back to this: detection is not the same as judgment.

A classifier should not replace editors. It should not become another automatic gate that people blindly trust. But it can act like a smoke alarm. It can say: this review looks unusually generic, or this review looks generated, or this review may need another human look before it becomes part of the record.

That feels like the right level of ambition for now.

I also like that the smaller models still have a role. RoBERTa did not beat GPT-4o-nano, but it is cheaper and easier to host. Sometimes the practical tool is not the largest model. Sometimes it is the one that can sit quietly inside an editorial workflow and flag the obvious failures.

Leaving the trail open

The paper is only a starting point. The next version needs more disciplines, more review styles, better long-context handling, and richer signals like reviewer history or paper metadata. But I think the basic direction is right: peer review needs tools that look at usefulness, not just authorship.

The work is open here:

Resource	Link
IEEE paper	ieeexplore.ieee.org/document/11446645
Training code	github.com/pnpm-aftab/auditing-peer-reviews
Model hub	huggingface.co/aashir021/peer-review-classifiers

The thing I want is not a world where models review papers for us. I want a world where bad reviews are harder to hide behind polished language. That feels like a small but meaningful step toward research that is treated with the care it deserves.