This page summarizes Manuscript’s detection performance across all content types.
| Content Type | Dataset Size | Accuracy | Precision | Recall | F1 Score |
|---|
| Text | 100 | 90.00% | 100.00% | 80.00% | 88.89% |
| Image | 100 | 50.00% | 50.00% | 100.00% | 66.67% |
| Audio | 100 | 46.00% | 47.92% | 92.00% | 63.01% |
| Video | 100 | Pending | - | - | - |
Benchmark run: January 2026 with Manuscript v0.2.0
| Content Type | Metric | v0.1.0 | v0.2.0 | Change |
|---|
| Audio | Accuracy | 38.00% | 46.00% | +8.00% |
| Recall | 76.00% | 92.00% | +16.00% |
| F1 Score | 55.07% | 63.01% | +7.94% |
| Text | Accuracy | 90.00% | 90.00% | No change |
| Image | Accuracy | 50.00% | 50.00% | No change |
The v0.2.0 enhancements (FFT spectral analysis, MFCC computation, temporal consistency) significantly improved audio detection.
| Metric | Value |
|---|
| Accuracy | 90.00% |
| Precision | 100.00% |
| Recall | 80.00% |
| F1 Score | 88.89% |
| Predicted Human | Predicted AI |
|---|
| Actual Human | 50 | 0 |
| Actual AI | 10 | 40 |
- Text detection shows excellent precision (100%) - zero false positives
- 80% recall indicates 10 AI samples were misclassified as human
- 90% accuracy exceeds the baseline of 58-65% from academic benchmarks
- Primary challenges: Some AI content is indistinguishable from human
| Metric | Value |
|---|
| Accuracy | 50.00% |
| Precision | 50.00% |
| Recall | 100.00% |
| F1 Score | 66.67% |
| Predicted Human | Predicted AI |
|---|
| Actual Human | 0 | 50 |
| Actual AI | 0 | 50 |
- Image detection correctly identified all 50 AI-generated images (100% recall)
- However, all 50 human images were also flagged as AI (0% specificity)
- Root cause: Downloaded images have EXIF metadata stripped
- The detector needs tuning to reduce false positives on web-sourced images
| Detector | Accuracy |
|---|
| Hive Moderation | 98-99.9% |
| AI or Not | 88.89% |
| Manuscript | 50.00% |
| Metric | v0.1.0 | v0.2.0 |
|---|
| Accuracy | 38.00% | 46.00% |
| Precision | 43.18% | 47.92% |
| Recall | 76.00% | 92.00% |
| F1 Score | 55.07% | 63.01% |
| Predicted Human | Predicted AI |
|---|
| Actual Human | 0 | 50 |
| Actual AI | 4 | 46 |
- v0.2.0 improvements increased AI audio detection significantly (+16% recall)
- FFT spectral analysis and MFCC computation catch 8 more AI samples
- False positive rate unchanged - clean human audio still triggers false positives
- Root cause: LibriSpeech audiobook recordings are “too clean” and resemble synthesized audio
| Detector | Accuracy |
|---|
| ElevenLabs Classifier | >99% (unlaundered) |
| ElevenLabs Classifier | >90% (laundered) |
| Manuscript v0.2.0 | 46.00% |
Video benchmark is pending - requires video file downloads via API keys.
Based on industry benchmarks:
- Off-the-shelf detectors show 21.3% lower accuracy on Sora-like videos
- Target accuracy: >75%
- Primary challenges: New diffusion video models, compression
| Source | Samples | Description |
|---|
| HC3 | 37,000+ | Human vs ChatGPT responses |
| Defactify-Text | 58,000+ | Articles + LLM versions |
| HATC-2025 | 50,000+ | Benchmark samples |
| Source | Samples | Description |
|---|
| MS COCOAI | 96,000 | Real + SD3/DALL-E/Midjourney |
| GenImage | 1M+ | Multi-generator dataset |
| AIGIBench | 6,000+ | Latest generators |
| Source | Samples | Description |
|---|
| WaveFake | 117,985 | 7 vocoder architectures |
| LibriSpeech | 1000+ hrs | Real audiobook speech |
| ASVspoof | 180,000+ | Spoofing detection |
| Source | Samples | Description |
|---|
| Deepfake-Eval-2024 | 44+ hrs | In-the-wild deepfakes (includes Sora) |
| DeepfakeBench | Large | 40 deepfake techniques |
| FaceForensics++ | 1.8M+ | Face manipulation |
Despite accuracy differences from commercial solutions:
- Privacy-First: No data leaves your infrastructure
- Multi-Modal: Single tool for all content types
- No API Costs: No per-request charges
- Transparent: Open-source, auditable algorithms
- Customizable: Adjustable detection weights
git clone https://github.com/vinpatel/manuscript
# Download benchmark datasets
make download-benchmark-data
# Run the full benchmark suite
@misc{manuscript2025benchmark,
title={Manuscript Benchmark: Multi-Modal AI Content Detection Evaluation},
author={Manuscript Contributors},
url={https://github.com/manuscript/manuscript/benchmark}