Benchmarks

AI Detector Benchmarks & Leaderboards

A single page that maps the benchmark-related search demand to an HTML landing page (instead of a PDF) and explains where to find the underlying evaluations.

View detailed results Try the detector

What this page covers

Users search for RAiD benchmark, AI detection leaderboard, and dataset names like HC3 or GRiD. This page is a navigation hub: it explains the metrics and points to the source leaderboards and our Accuracy report.

Leaderboards (RAiD, MGTD)
Datasets (HC3, GRiD, GhostBuster, ASAP, CUDRT…)
Methodology & metrics (ROC-AUC, FPR, F1)
Compare detectors

Leaderboards

If your query includes “leaderboard”, you likely want an external scoreboard you can verify. Here are the main destinations:

RAiD leaderboard — raid-bench.xyz/leaderboard
MGTD benchmark publication — Towards Reliable Machine-Generated Text Detection (MGTD)

For the full context (screenshots, explanations, and the set of datasets we track across releases), see the Accuracy page.

Datasets we reference

Benchmarks differ by domain (essays, Q&A, Reddit answers), language, and attack setup (paraphrasing, rewriting, style transfer). That’s why we reference multiple datasets instead of relying on a single test.

RAiD

A large benchmark designed to test robustness beyond “clean” generations, including adversarial settings and multiple domains.

MGTD

A unified framework aggregating multiple datasets to compare detectors in a standardised way.

GRiD / HC3 / GhostBuster

Public datasets covering Reddit-style answers, Q&A pairs, and mixed writing tasks — useful for checking false positives and “edited AI” behaviour.

ASAP / CUDRT

Education-focused essays (ASAP) and bilingual rewrite/translate operations (CUDRT) that stress-test detectors on realistic transformations.

Methodology & metrics (quick guide)

Benchmark reports often mention ROC-AUC, FPR, Accuracy and F1-score. Here’s the practical meaning:

ROC-AUC: how well a model separates human vs AI text across all thresholds (closer to 1 is better).
FPR (False Positive Rate): what share of human texts gets flagged as AI by mistake.
Accuracy: share of correct classifications at a chosen threshold.
F1-score: balance of precision and recall (useful when classes are imbalanced).

If you want the “audit trail” (what benchmarks we run before shipping a new model), we describe it on Accuracy.

Compare detectors

For queries like “compare AI detectors” or “best AI detector benchmark”, the only fair approach is to look at the same benchmark under the same constraints (especially at a fixed FPR). We keep the comparison anchored to public benchmarks and link out to sources where possible.

Next step: open Accuracy and use it as the detailed report page (tables, links, explanations).

AI Detector Advanced AI Scan Plagiarism Checker

Chrome Extension ChatGPT Plugin API Zapier Integration Python SDK Moodle Integration

For customers For enterprise

Teachers Writers Managers Students Recruiters ML Engineers

Educational establishments Publishers

About us Blog Accuracy FAQ Contacts

support@its-ai.org

sales@its-ai.org