Skip to content

Benchmark Overview

This page summarizes Manuscript’s detection performance across all content types.

Content TypeDataset SizeAccuracyPrecisionRecallF1 Score
Text10090.00%100.00%80.00%88.89%
Image10050.00%50.00%100.00%66.67%
Audio10046.00%47.92%92.00%63.01%
Video100Pending---

Benchmark run: January 2026 with Manuscript v0.2.0

Content TypeMetricv0.1.0v0.2.0Change
AudioAccuracy38.00%46.00%+8.00%
Recall76.00%92.00%+16.00%
F1 Score55.07%63.01%+7.94%
TextAccuracy90.00%90.00%No change
ImageAccuracy50.00%50.00%No change

The v0.2.0 enhancements (FFT spectral analysis, MFCC computation, temporal consistency) significantly improved audio detection.

MetricValue
Accuracy90.00%
Precision100.00%
Recall80.00%
F1 Score88.89%
Predicted HumanPredicted AI
Actual Human500
Actual AI1040
  • Text detection shows excellent precision (100%) - zero false positives
  • 80% recall indicates 10 AI samples were misclassified as human
  • 90% accuracy exceeds the baseline of 58-65% from academic benchmarks
  • Primary challenges: Some AI content is indistinguishable from human
MetricValue
Accuracy50.00%
Precision50.00%
Recall100.00%
F1 Score66.67%
Predicted HumanPredicted AI
Actual Human050
Actual AI050
  • Image detection correctly identified all 50 AI-generated images (100% recall)
  • However, all 50 human images were also flagged as AI (0% specificity)
  • Root cause: Downloaded images have EXIF metadata stripped
  • The detector needs tuning to reduce false positives on web-sourced images
DetectorAccuracy
Hive Moderation98-99.9%
AI or Not88.89%
Manuscript50.00%
Metricv0.1.0v0.2.0
Accuracy38.00%46.00%
Precision43.18%47.92%
Recall76.00%92.00%
F1 Score55.07%63.01%
Predicted HumanPredicted AI
Actual Human050
Actual AI446
  • v0.2.0 improvements increased AI audio detection significantly (+16% recall)
  • FFT spectral analysis and MFCC computation catch 8 more AI samples
  • False positive rate unchanged - clean human audio still triggers false positives
  • Root cause: LibriSpeech audiobook recordings are “too clean” and resemble synthesized audio
DetectorAccuracy
ElevenLabs Classifier>99% (unlaundered)
ElevenLabs Classifier>90% (laundered)
Manuscript v0.2.046.00%

Video benchmark is pending - requires video file downloads via API keys.

Based on industry benchmarks:

  • Off-the-shelf detectors show 21.3% lower accuracy on Sora-like videos
  • Target accuracy: >75%
  • Primary challenges: New diffusion video models, compression
SourceSamplesDescription
HC337,000+Human vs ChatGPT responses
Defactify-Text58,000+Articles + LLM versions
HATC-202550,000+Benchmark samples
SourceSamplesDescription
MS COCOAI96,000Real + SD3/DALL-E/Midjourney
GenImage1M+Multi-generator dataset
AIGIBench6,000+Latest generators
SourceSamplesDescription
WaveFake117,9857 vocoder architectures
LibriSpeech1000+ hrsReal audiobook speech
ASVspoof180,000+Spoofing detection
SourceSamplesDescription
Deepfake-Eval-202444+ hrsIn-the-wild deepfakes (includes Sora)
DeepfakeBenchLarge40 deepfake techniques
FaceForensics++1.8M+Face manipulation

Despite accuracy differences from commercial solutions:

  1. Privacy-First: No data leaves your infrastructure
  2. Multi-Modal: Single tool for all content types
  3. No API Costs: No per-request charges
  4. Transparent: Open-source, auditable algorithms
  5. Customizable: Adjustable detection weights
Terminal window
# Clone the repository
git clone https://github.com/vinpatel/manuscript
cd manuscript
# Download benchmark datasets
make download-benchmark-data
# Run the full benchmark suite
make benchmark-all
# Generate report
make benchmark-report
@misc{manuscript2025benchmark,
title={Manuscript Benchmark: Multi-Modal AI Content Detection Evaluation},
author={Manuscript Contributors},
year={2025},
url={https://github.com/manuscript/manuscript/benchmark}
}