Skip to content

Methodology

This document describes how Manuscript’s detection accuracy is measured.

We use standard machine learning metrics:

The percentage of correctly classified samples:

Accuracy = (True Positives + True Negatives) / Total Samples

How many detected AI samples are actually AI:

Precision = True Positives / (True Positives + False Positives)

How many actual AI samples were detected:

Recall = True Positives / (True Positives + False Negatives)

Harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

For each content type:

1. Load dataset (50 human + 50 AI-generated samples)
2. Run Manuscript detection on each sample
3. Record: prediction, confidence, processing time
4. Calculate aggregate metrics
5. Analyze failure cases
CategoryCountSource
Human - Essays15Verified authors
Human - Articles15NYT/News
Human - Creative10Short stories, poetry
Human - Technical10Documentation
AI - GPT-415OpenAI
AI - Claude15Anthropic
AI - Gemini10Google
AI - Llama-310Meta
CategoryCountSource
Human - Photos25COCO/Unsplash
Human - Artwork15Digital art
Human - Screenshots10UI captures
AI - DALL-E 315OpenAI
AI - Midjourney v615Midjourney
AI - Stable Diffusion 315Stability AI
AI - FLUX5Black Forest
CategoryCountSource
Human - Speech20LibriSpeech
Human - Podcast15Various
Human - Music15CC-licensed
AI - ElevenLabs20TTS synthesis
AI - WaveFake15Vocoders
AI - Music AI15Suno/Udio
CategoryCountSource
Human - UGC25YouTube/Vimeo
Human - Professional15Stock footage
Human - Mobile10Smartphone
AI - Deepfake20DeepfakeBench
AI - Sora10OpenAI
AI - Runway10Text-to-video
AI - Other10Various tools

Each detection signal contributes to the final score with different weights:

SignalWeight
AI Phrase Detection0.20
Vocabulary Richness0.20
Sentence Variance0.15
Contractions Usage0.10
Punctuation Variety0.10
Burstiness0.10
Repetition Penalty0.10
Word Length Variance0.05
SignalWeight
Metadata Score0.25
Color Distribution0.20
Edge Consistency0.15
Noise Pattern0.15
Compression Analysis0.15
Symmetry Detection0.10
SignalWeight
Metadata Score0.25
Pattern Analysis0.20
Format Analysis0.15
Quality Indicators0.15
AI Signatures0.15
Noise Profile0.10
SignalWeight
Metadata Score0.25
Container Analysis0.20
Audio Presence0.15
Temporal Pattern0.15
Encoding Signature0.15
Bitrate Consistency0.10

The final confidence score is calculated as:

confidence = Σ(signal_value × signal_weight) / Σ(signal_weight)

A threshold of 0.5 determines the verdict:

  • confidence >= 0.5 → “ai”
  • confidence < 0.5 → “human”

All benchmark results can be reproduced:

Terminal window
git clone https://github.com/vinpatel/manuscript
cd manuscript
make download-benchmark-data
make benchmark-all

Results are saved to benchmark/results/ with timestamps.

  • Text samples may over-represent certain writing styles
  • Image samples are web-sourced (stripped metadata)
  • Audio samples are high-quality studio recordings
  • Video samples require API downloads
  • Short content (<100 words text, <5s audio/video) is less reliable
  • Heavily edited content may evade detection
  • New AI models may not be in signature databases
  • Language support is primarily English

We regularly update:

  1. Signature databases with new AI tool markers
  2. Training datasets with fresh samples
  3. Detection algorithms based on failure analysis
  4. Benchmark reports with each release