Skip to content

Datasets

All datasets used in Manuscript benchmarks are publicly available.

  • Source: HuggingFace
  • License: Apache 2.0
  • Size: 37,000+ QA pairs
  • Content: Human vs ChatGPT responses to questions
  • Source: arXiv:2510.22874
  • License: Research
  • Size: 58,000+ articles
  • Content: NYT articles + LLM-generated versions (GPT-4, Gemma, Mistral)
  • Source: Hastewire
  • License: Research
  • Size: 50,000+ samples
  • Content: Human vs AI passages benchmark
  • Source: arXiv:2507.05157
  • License: Research
  • Size: 10,000+ papers
  • Content: Scientific papers: human vs ChatGPT/Gemini/Llama-3
  • Source: Toloka
  • License: Research
  • Size: Varied
  • Content: Human, machine-generated, and edited content
  • Source: arXiv:2601.00553
  • License: Research
  • Size: 96,000 pairs
  • Content: MS COCO + SD3, SDXL, DALL-E 3, Midjourney v6
  • Source: GitHub
  • License: Research
  • Size: 1M+ images
  • Content: Midjourney, Stable Diffusion, ADM, GLIDE, etc.
  • Source: arXiv:2505.12335
  • License: Research
  • Size: 6,000+ samples
  • Content: SD-XL, SD-3, DALL-E 3, Midjourney v6, FLUX, Imagen-3
  • Unsplash: Free high-resolution photos
  • Pexels: Free stock photos
  • COCO: Common Objects in Context dataset
  • Source: Zenodo
  • License: CC BY 4.0
  • Size: 117,985 samples
  • Content: 7 vocoder architectures
  • Source: OpenSLR
  • License: CC BY 4.0
  • Size: 1000+ hours
  • Content: Clean speech from audiobooks
  • Source: Keith Ito
  • License: Public Domain
  • Size: 13,100 clips
  • Content: Single female speaker recordings
  • Source: asvspoof.org
  • License: Research
  • Size: 180,000+ samples
  • Content: Spoofing and deepfake detection
  • Source: arXiv:2307.07683
  • License: Research
  • Size: Varied
  • Content: Real vs ElevenLabs cloned voices
  • Source: arXiv:2503.02857
  • License: Research
  • Size: 44+ hours
  • Content: In-the-wild deepfakes from 2024 (includes Sora)
  • Source: GitHub
  • License: MIT
  • Size: Large
  • Content: 40 deepfake techniques
  • Source: GitHub
  • License: Research
  • Size: 1.8M+ images
  • Content: DeepFakes, Face2Face, FaceSwap, NeuralTextures
  • Source: Microsoft
  • License: Research
  • Size: 50,000+ samples
  • Content: Real-world deepfakes and synthetic media
  • Source: Kaggle
  • License: Research
  • Size: 10,000+ samples
  • Content: Original deepfake detection dataset
Terminal window
make download-benchmark-data

See benchmark/DATASET_SOURCES.md for detailed instructions on accessing each dataset.

DatasetLicenseCommercial Use
HC3Apache 2.0Yes
WaveFakeCC BY 4.0Yes (with attribution)
LibriSpeechCC BY 4.0Yes (with attribution)
LJSpeechPublic DomainYes
DeepfakeBenchMITYes
OthersResearchAcademic only

We welcome contributions of:

  • Labeled human vs AI content
  • New AI generator samples
  • Edge case examples
  • Multi-language content

Submit via GitHub Issues.