Datasets
Benchmark Datasets
Section titled “Benchmark Datasets”All datasets used in Manuscript benchmarks are publicly available.
Text Datasets
Section titled “Text Datasets”HC3 (Human ChatGPT Comparison)
Section titled “HC3 (Human ChatGPT Comparison)”- Source: HuggingFace
- License: Apache 2.0
- Size: 37,000+ QA pairs
- Content: Human vs ChatGPT responses to questions
Defactify-Text
Section titled “Defactify-Text”- Source: arXiv:2510.22874
- License: Research
- Size: 58,000+ articles
- Content: NYT articles + LLM-generated versions (GPT-4, Gemma, Mistral)
HATC-2025
Section titled “HATC-2025”- Source: Hastewire
- License: Research
- Size: 50,000+ samples
- Content: Human vs AI passages benchmark
LLMSciTxt
Section titled “LLMSciTxt”- Source: arXiv:2507.05157
- License: Research
- Size: 10,000+ papers
- Content: Scientific papers: human vs ChatGPT/Gemini/Llama-3
Beemo Benchmark
Section titled “Beemo Benchmark”- Source: Toloka
- License: Research
- Size: Varied
- Content: Human, machine-generated, and edited content
Image Datasets
Section titled “Image Datasets”MS COCOAI
Section titled “MS COCOAI”- Source: arXiv:2601.00553
- License: Research
- Size: 96,000 pairs
- Content: MS COCO + SD3, SDXL, DALL-E 3, Midjourney v6
GenImage
Section titled “GenImage”- Source: GitHub
- License: Research
- Size: 1M+ images
- Content: Midjourney, Stable Diffusion, ADM, GLIDE, etc.
AIGIBench
Section titled “AIGIBench”- Source: arXiv:2505.12335
- License: Research
- Size: 6,000+ samples
- Content: SD-XL, SD-3, DALL-E 3, Midjourney v6, FLUX, Imagen-3
Human Image Sources
Section titled “Human Image Sources”- Unsplash: Free high-resolution photos
- Pexels: Free stock photos
- COCO: Common Objects in Context dataset
Audio Datasets
Section titled “Audio Datasets”WaveFake
Section titled “WaveFake”- Source: Zenodo
- License: CC BY 4.0
- Size: 117,985 samples
- Content: 7 vocoder architectures
LibriSpeech
Section titled “LibriSpeech”- Source: OpenSLR
- License: CC BY 4.0
- Size: 1000+ hours
- Content: Clean speech from audiobooks
LJSpeech
Section titled “LJSpeech”- Source: Keith Ito
- License: Public Domain
- Size: 13,100 clips
- Content: Single female speaker recordings
ASVspoof
Section titled “ASVspoof”- Source: asvspoof.org
- License: Research
- Size: 180,000+ samples
- Content: Spoofing and deepfake detection
TIMIT-ElevenLabs
Section titled “TIMIT-ElevenLabs”- Source: arXiv:2307.07683
- License: Research
- Size: Varied
- Content: Real vs ElevenLabs cloned voices
Video Datasets
Section titled “Video Datasets”Deepfake-Eval-2024
Section titled “Deepfake-Eval-2024”- Source: arXiv:2503.02857
- License: Research
- Size: 44+ hours
- Content: In-the-wild deepfakes from 2024 (includes Sora)
DF40/DeepfakeBench
Section titled “DF40/DeepfakeBench”- Source: GitHub
- License: MIT
- Size: Large
- Content: 40 deepfake techniques
FaceForensics++
Section titled “FaceForensics++”- Source: GitHub
- License: Research
- Size: 1.8M+ images
- Content: DeepFakes, Face2Face, FaceSwap, NeuralTextures
Microsoft Deepfake Dataset
Section titled “Microsoft Deepfake Dataset”- Source: Microsoft
- License: Research
- Size: 50,000+ samples
- Content: Real-world deepfakes and synthetic media
Kaggle DFD
Section titled “Kaggle DFD”- Source: Kaggle
- License: Research
- Size: 10,000+ samples
- Content: Original deepfake detection dataset
Downloading Datasets
Section titled “Downloading Datasets”Automatic Download
Section titled “Automatic Download”make download-benchmark-dataManual Download
Section titled “Manual Download”See benchmark/DATASET_SOURCES.md for detailed instructions on accessing each dataset.
License Compliance
Section titled “License Compliance”| Dataset | License | Commercial Use |
|---|---|---|
| HC3 | Apache 2.0 | Yes |
| WaveFake | CC BY 4.0 | Yes (with attribution) |
| LibriSpeech | CC BY 4.0 | Yes (with attribution) |
| LJSpeech | Public Domain | Yes |
| DeepfakeBench | MIT | Yes |
| Others | Research | Academic only |
Contributing Datasets
Section titled “Contributing Datasets”We welcome contributions of:
- Labeled human vs AI content
- New AI generator samples
- Edge case examples
- Multi-language content
Submit via GitHub Issues.