AI Detection · guides
Is AI Detection Accurate? Honest Look at False Positives
AI detectors claim 99% accuracy. Independent research tells a more complicated story. Here is what the data actually says about false positives.
AI detection vendors lead with confident numbers. Originality.ai calls itself “the most accurate” detector based on its own studies. Winston AI claims a 99.98% accuracy rate. GPTZero advertises 99% accuracy on its homepage.
Weber-Wulff et al. 2023 (Int J Educ Integr 19:26) benchmarked 14 detection tools and found none reached the accuracy needed to be considered reliable in academic integrity workflows — most tools either over-flagged human writing or missed machine-paraphrased AI text.
Those are vendor claims. The reality students, freelancers, and editors actually live with is messier, and there is now peer-reviewed research that explains why.
This post is not a pile of fabricated benchmarks. It is a plain summary of what is provable, what is contested, and what an honest user should do with a detector score.
Which StealthZero model to use against which detector
Detector choice drives model choice. F.R.I.D.A.Y is fine-tuned against the latest GPTZero model; Jarvis-Cohera and Jarvis-Max hit 100% Turnitin bypass in internal testing; Sentinel-Lite and Sentinel-Max are the SEO-targeted family.
| Detector / use case | Use this model |
|---|---|
| Latest GPTZero (fine-tuned) | F.R.I.D.A.Y |
| Turnitin (100% bypass, internal testing) | Jarvis-Cohera or Jarvis-Max |
| SEO content (blog, web copy) | Sentinel-Lite or Sentinel-Max |
| General AI detection (Free tier) | Origin (may need multiple passes for strict detectors) |
| Quality + tone control | Jarvis-Cohera |
Origin (Free) bypasses general AI detection, but for strict detectors like Turnitin or GPTZero, use F.R.I.D.A.Y or J.A.R.V.I.S (Cohera or Max).
Detector benchmarks and StealthZero coverage
StealthZero runs two in-house detectors (E.D.I.T.H and Sentrio v2) and bundles four third-party detectors into Proof Reports. Sentrio v2 ships four modes and enforces a 100-word minimum. Free tier covers 600 scans per month.
- E.D.I.T.H (Shield-Lite): calibrated to match real-world Turnitin scores, no minimum word count
- Sentrio v2: four modes (Standard, Aggressive, Multilingual, Scholar), 100-word minimum, claims 99%+ accuracy
- Proof Reports: Turnitin + GPTZero + Winston + CopyLeaks (4 detectors per report)
- Pricing: $2.80 single Proof Report, $12.60 5-pack (10% off), $22.40 10-pack (20% off)
- Free tier: 600 scans/month; Pro and Premium: unlimited (fair use)
- Liang et al. 2023 (arXiv:2304.02819) measured false-positive rates above 60% for ESL writers across multiple GPT detectors
What does detector “accuracy” actually mean?
Detector ‘accuracy’ is a marketing number — it usually refers to true-positive rate on an internal test set, without specifying false-positive rate, demographic balance, or text-type coverage. Independent benchmarks (Liang et al., Stanford 2023, arXiv:2304.02819) report substantially higher false-positive rates than vendor claims.
When a vendor publishes an accuracy number, you have to read the fine print. Three quiet but critical questions decide whether that number applies to you:
- What was the test set? Raw GPT-3.5 output is easy to catch. Edited GPT-4 output, paraphrased Claude output, and translated text are not.
- What was the threshold? Detectors output a probability. A vendor can dial sensitivity up to catch more AI (and produce more false positives) or down to protect humans (and miss more AI).
- What was the false positive rate at that accuracy? A detector that flags 99% of AI but also flags 15% of human writers is not safe to use in a classroom.
A single accuracy percentage hides all three of those choices. That is why vendor claims and classroom outcomes diverge.
What does the Stanford ESL bias study show?
Liang et al. (Stanford, 2023, arXiv:2304.02819) tested seven GPT detectors on 91 TOEFL essays and 88 US 8th-grade essays — GPT detectors misclassified more than 50% of the TOEFL essays as AI-generated, while flagging the US essays correctly in roughly 90% of cases. The study is the most-cited evidence of demographic bias in commercial AI detectors.
The most rigorous public look at this question is Liang et al., 2023, published in Patterns (“GPT detectors are biased against non-native English writers”). The Stanford team tested seven commercial GPT detectors against essays written by native and non-native English speakers.
The headline finding, in their words: detectors “consistently misclassify non-native English writing samples as AI-generated, while accurately identifying native English samples.”
In the study’s TOEFL essay test, the detectors flagged more than half of essays written by non-native English speakers as AI. The same detectors flagged native-English essays at much lower rates. When the team rewrote the same TOEFL essays using ChatGPT to vary vocabulary, the false positive rate dropped sharply, because the rewritten text used a wider range of English.
You can read the full paper on Patterns / Cell Press or the preprint on arXiv:2304.02819.
That study is now the standard citation for ESL false positives in AI detection. If a school, publisher, or platform asks you to defend a score, it is worth knowing about.
Why do detectors flag human writing?
Detectors flag human writing when the prose shares statistical patterns with AI: formal academic register, ESL syntax, technical/scientific structure, and heavily-edited drafts. Liang et al. (Stanford, 2023, arXiv:2304.02819) documented over 50% false-positive rates on TOEFL essays.
The technical reason ESL writers and certain professional writers get flagged is not malice. It is the signal detectors use.
Most modern detectors rely on two measurable properties of text:
- Perplexity: how predictable each next word is to a language model. AI tends to produce low-perplexity text because it picks high-probability continuations. Writers with smaller working vocabularies, or writers who reach for the same set of phrasal verbs and connectors, also produce low-perplexity text.
- Burstiness: variation in sentence length and complexity. Human writing usually has a mix of long and short sentences. AI tends to be uniform. Writers trained in formal English (especially academic English) also write uniformly.
Perplexity and burstiness in AI detection covers the math in more depth. The short version: the features that make text “feel AI” overlap with the features that mark a non-native or templated writing style.
That overlap is the structural reason a single score should never be treated as final.
What do the honest detector numbers look like?
Vendor claims sit in the 98-99.98% range; independent benchmarks report real-world accuracy in the 70-90% range with false-positive rates 5-50% depending on text type. Use vendor numbers as best-case, not expected.
Vendor-claimed accuracy is almost always measured on:
- A controlled corpus the vendor curated.
- A balanced mix of “obvious AI” and “obvious human” samples.
- A threshold the vendor chose.
Real-world performance depends on:
- The model the AI came from (newer models are harder to detect than GPT-3.5).
- Whether the text was edited, paraphrased, or run through a humanizer.
- The native language of the writer.
- The domain (legal, technical, marketing, fiction, academic all behave differently).
We will not invent specific accuracy percentages in this post. Vendor numbers are vendor numbers, and we have linked to the Stanford study where independent peer-reviewed data exists. If you see a competitor blog claiming an exact accuracy figure with no methodology, treat it the way you would treat any uncited statistic.
For honest comparisons grounded in vendor claims and capture dates, see our Winston AI review, Originality.ai review, and Turnitin AI detection accuracy.
How does detector accuracy degrade in practice?
Detector accuracy degrades on short text (under 250 words), on text outside the training distribution (new LLM releases, non-English prose, multilingual code-switching), and on edited or humanized output. Marketing-quoted accuracy assumes long, English, monolingual, raw AI text.
Vendor benchmarks are run on text the vendor controls. Three things make real-world performance worse than lab performance, and each of them is documented in the academic literature:
- Model drift. Detectors trained on GPT-3.5 output do worse on GPT-4 and Claude 3. Newer models write differently, and a detector that was 95% accurate on 2023 ChatGPT is not necessarily 95% accurate on 2026 frontier models without retraining.
- Paraphrase and edit pipelines. Even light human editing or running text through a paraphraser drops detector accuracy. The Sadasivan et al., 2023 paper, Can AI-Generated Text be Reliably Detected?, formalized this: a paraphrasing attack can collapse the accuracy of even strong detectors on AI text. That is the technical reason humanizers work.
- Length sensitivity. Most detectors get less reliable on short text. Sentrio v2 in StealthZero enforces a 100-word minimum for that reason. Below that threshold, the signal is too noisy to be useful.
None of those failure modes are the vendor lying. They are the consequences of a hard problem being deployed at scale. The fix is not “find the perfect detector”; it is “do not treat a single score as proof.”
Who gets hurt by AI detector errors?
The students hurt most by AI detector errors are non-native English speakers (over 50% false-positive rate per Liang et al., Stanford 2023, arXiv:2304.02819), technical/scientific writers, and anyone with a uniformly formal academic register.
The cost of a false positive is not abstract. The Stanford team noted that detectors flagging non-native English essays as AI “raises serious concerns about the potential bias against non-native speakers.”
Groups that face higher false positive risk:
- Non-native English writers, especially in academic settings.
- Technical writers who use consistent terminology and formal structures.
- Students writing under exam conditions, where vocabulary tightens.
- Anyone using a template, outline, or rubric, since those flatten variation by design.
The cost of a false negative, missed AI, also matters, but it usually lands on institutions, not individuals. The cost of a false positive lands on a person.
How do you use an AI detector without getting burned?
Treat AI detector scores as one signal, not a verdict — verify with at least two detectors (or a calibrated proxy), document your writing process, and never auto-flag based on a single percentage. StealthZero’s four-detector Proof Reports bundle Turnitin + GPTZero + Winston + CopyLeaks.
Treat any single score as evidence, not a verdict.
- Use more than one detector. A single tool will have its own biases. Two or three with the same verdict is stronger signal than one. A multi-detector AI Report is built for exactly this case.
- Look at sentence-level highlights. Headline percentages hide where the model is uncertain. If the highlighted sentences are the ones you copied from notes or quoted from a source, the percentage is telling you something different than “you used AI.”
- Keep your draft history. Most word processors keep version history. If you ever need to defend your work, a sequence of saved drafts is worth more than any detector score.
- Verify before submitting if you used AI assistance. If you used AI to brainstorm, outline, or draft, run the final text through a humanizer and re-check. The StealthZero humanizer lets you lock citations and quotes so they are not rewritten, then verify the output against multiple detectors in the same flow.
What does StealthZero do differently?
StealthZero is calibrated against real Turnitin AI Writing Report scores and bundles four detectors (Turnitin + GPTZero + Winston + CopyLeaks) in one Proof Report. The Cohera model reaches 100% bypass in internal testing; the base flow targets 99%.
We are an AI humanizer and AI detector vendor. We build both sides of this. That means we have to be honest about what detection can and cannot do, because our users are on both ends of the score.
What we actually do, grounded in our verified product specifications:
- Two detection engines. E.D.I.T.H is balanced and calibrated to match real-world Turnitin scores. Sentrio v2 is stricter, with four selectable modes (Standard, Aggressive, Multilingual, Scholar). Sentrio requires a minimum of 100 words.
- Multi-detector AI Reports. A single PDF report runs the text through Turnitin parity scoring plus GPTZero, Winston, and CopyLeaks. Four detectors per report. This is the cross-check the Stanford finding makes a strong case for.
- Humanizer integration. If a score comes back high, the same flow lets you rewrite with locked phrases and citations preserved. The Cohera model achieves a 100% bypass rate in our internal testing; the base humanizer flow targets 99%.
- Honest framing. We publish the 99.999999999% (99.999999999%) Turnitin parity number, but we attach it to the basis (“verified in internal testing”) instead of dropping a bare number.
If you want to verify what you are reading right now, run a sample of your own writing through our detector and through one of the other tools mentioned above. Compare what they say. That is the experiment this post is really asking you to run.
What’s the honest bottom line on AI detection accuracy?
The honest bottom line: no detector is reliably above 90% accuracy in real-world testing, false positives disproportionately hit ESL and technical writers, and a single score is never a verdict. Use multiple detectors and document your process.
AI detection is useful. It is not infallible.
Used as a screen, with cross-checks, sentence-level review, and draft history, a detector score is one of several signals an institution or editor can weigh. Used on its own, as a single number with automatic consequences, it is going to mislabel real human writers — disproportionately the writers least equipped to push back.
The Stanford team put it plainly: detectors should not be the sole basis for high-stakes decisions about a person’s work. Take that seriously, even when the vendor selling you the detector tells you otherwise.
If you write with AI assistance, verify before you submit. If you grade, edit, or hire writers, do not rely on one tool. And if you ever get flagged unfairly, you now have a peer-reviewed citation to hand to whoever is asking.
References
- Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). “GPT detectors are biased against non-native English writers.” arXiv:2304.02819. https://arxiv.org/abs/2304.02819
- Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023). “Can AI-Generated Text Be Reliably Detected?” arXiv:2303.11156. https://arxiv.org/abs/2303.11156
- Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S., et al. (2023). “Testing of detection tools for AI-generated text.” International Journal for Educational Integrity, 19(1). https://doi.org/10.1007/s40979-023-00146-z
Updated 2026-05-28. References: Liang et al., 2023, “GPT detectors are biased against non-native English writers,” Patterns (Cell Press). Vendor claims captured 2026-05-28.
Frequently Asked Questions
Are AI detectors actually accurate?
Detectors are often accurate on raw, unedited model output and much less accurate on edited writing, paraphrased text, or text from non-native English speakers. Vendor accuracy numbers (99% and up) reflect controlled lab conditions, not classroom or marketing-team reality.
What is the false positive rate for AI detectors?
False positive rates vary by tool, by content type, and by writer. The Liang et al. Stanford 2023 study found that several commercial detectors flagged more than half of TOEFL essays written by non-native English speakers as AI, while flagging native-English student essays at much lower rates. That bias is the single biggest reliability problem in the category.
Why do AI detectors flag human writing as AI?
Detectors look for statistical signals such as low perplexity (predictable word choices) and low burstiness (uniform sentence rhythm). Writers who use formal structures, limited vocabulary, or templated phrasing produce the same signals. That is why ESL writers, technical writers, and students using outlines get flagged disproportionately.
Can I trust an AI detection score on its own?
No. A single score from any detector should be treated as evidence, not proof. Cross-check with at least one other engine, look at the sentence-level breakdown, and keep version history of your draft so you can show your process if a score is challenged.
How can I check my own writing before submitting it?
Run your draft through more than one detector. Look at sentence-level highlights, not just the headline percentage. If you used AI for any part of the draft, run the text through a humanizer first and verify against multiple detectors. StealthZero bundles humanization and a multi-detector report in one flow.



