The weight of a review
I’ve been thinking about the integrity of the systems we rely on. In academia, that system is peer review. It is the foundation that ensures the validity of scientific knowledge. But lately, I’ve noticed a shift. The friction of writing a detailed, critical review is being bypassed by Large Language Models (LLMs). We are increasingly seeing reviews that are syntactically perfect but analytically hollow.
It is a subtle kind of degradation. When we accept generic feedback or “pal-reviews” (superficial comments from colleagues), we let flawed research slip through the cracks.
In our recent paper, Classifying Scientific Peer Reviews: Distinguishing Authentic, Generic, and AI-Generated Feedback, my co-authors and I decided to tackle this problem head-on. We wanted to see if we could automatically audit and categorize peer reviews to safeguard the academic publishing ecosystem.
Building the dataset
To train a model to catch a generic or AI-generated review, you first need a high-quality dataset. We curated a specialized dataset of 399 reviews divided into three categories:
- Authentic reviews: Sourced directly from journals that publish open peer reviews, such as MDPI and PeerJ. These serve as our baseline for human, expert-level critique.
- AI-Generated reviews: Prompted from models like ChatGPT, DeepSeek, and Claude using published paper abstracts.
- Generic reviews: Assembled using sentence concatenation tools to mimic low-effort “pal-reviews” (e.g., “The paper has several typos,” “The title is too long”).
Finding the right filter
We wanted to evaluate how well different transformer architectures could handle the nuanced task of classifying academic text. We tested five models: BERT, DistilBERT, XLNet, RoBERTa, and GPT-4o-nano.
For the BERT-family models, we implemented tokenization (max length: 512), optimized batch sizes, and used focal loss to handle class imbalance. We employed a rigorous training routine with k-fold cross-validation and early stopping.
graph TD
classDef input fill:#fff,stroke:#333,stroke-width:2px;
classDef model fill:#f9f9f9,stroke:#333;
classDef output fill:#e1f5fe,stroke:#333;
Input["Paper Title + Abstract + Review Text"]:::input
subgraph Architectures [Evaluated Models]
M1["RoBERTa"]:::model
M2["DistilBERT"]:::model
M3["XLNet"]:::model
M4["BERT"]:::model
M5["GPT-4o-nano"]:::model
end
Input --> Architectures
Architectures --> Classify
Classify["Three-Class Categorization"]
Classify --> O1["Authentic"]:::output
Classify --> O2["AI-Generated"]:::output
Classify --> O3["Generic"]:::output
The results: Why GPT-4o-nano won
When evaluating on our held-out test set, the traditional BERT-family models struggled, particularly with the “Generic” class. DistilBERT and RoBERTa showed moderate performance, with RoBERTa achieving a 71.3% accuracy.
However, GPT-4o-nano completely shifted the narrative. It substantially surpassed all other models:
- Accuracy: 80.7%
- Macro-F1: 80.6%
Class-wise F1 Scores
| Metric | BERT | DistilBERT | RoBERTa | XLNet | GPT-4o-nano |
|---|---|---|---|---|---|
| AI-Generated | 0.690 | 0.770 | 0.840 | 0.820 | 0.880 |
| Authentic | 0.730 | 0.760 | 0.740 | 0.670 | 0.830 |
| Generic | 0.430 | 0.220 | 0.550 | 0.470 | 0.730 |
GPT-4o-nano demonstrated a massive leap in identifying “Generic” reviews—a category where older models failed to capture the subtle generative characteristics and lack of critical depth.
Future directions
We have made our training pipeline and the fine-tuned models openly available on the Hugging Face Model Hub and our code on GitHub for reproducibility and future research.
As LLMs continue to evolve and refine their language mimicry, classifying human and AI-generated text will become more challenging. Future work will involve exploring multi-modal approaches that incorporate reviewer history, paper metadata, or review sentiment to further enhance classification accuracy. We also plan to investigate custom classification architectures built around CNNs and advanced transformers like LONGFORMER or ELECTRA, as well as techniques like few-shot learning to apply models to new domains with limited data.