Users search for RAiD benchmark, AI detection leaderboard, and dataset names like HC3 or GRiD. This page is a navigation hub: it explains the metrics and points to the source leaderboards and our Accuracy report.
If your query includes “leaderboard”, you likely want an external scoreboard you can verify. Here are the main destinations:
For the full context (screenshots, explanations, and the set of datasets we track across releases), see the Accuracy page.
Benchmarks differ by domain (essays, Q&A, Reddit answers), language, and attack setup (paraphrasing, rewriting, style transfer). That’s why we reference multiple datasets instead of relying on a single test.
A large benchmark designed to test robustness beyond “clean” generations, including adversarial settings and multiple domains.
A unified framework aggregating multiple datasets to compare detectors in a standardised way.
Public datasets covering Reddit-style answers, Q&A pairs, and mixed writing tasks — useful for checking false positives and “edited AI” behaviour.
Education-focused essays (ASAP) and bilingual rewrite/translate operations (CUDRT) that stress-test detectors on realistic transformations.
Benchmark reports often mention ROC-AUC, FPR, Accuracy and F1-score. Here’s the practical meaning:
If you want the “audit trail” (what benchmarks we run before shipping a new model), we describe it on Accuracy.
For queries like “compare AI detectors” or “best AI detector benchmark”, the only fair approach is to look at the same benchmark under the same constraints (especially at a fixed FPR). We keep the comparison anchored to public benchmarks and link out to sources where possible.
Next step: open Accuracy and use it as the detailed report page (tables, links, explanations).