Public benchmark
PII detection accuracy benchmark for NeutralAI
This page summarizes our reproducible benchmark comparing a Presidio-vanilla baseline against the current NeutralAI detection stack across multilingual and multi-entity prompt samples.
Benchmark cases
1000
NeutralAI overall F1
99.2%
PERSON F1
94.4%
False positive rate
0.0%
Tracked exact-match accuracy
98.4%
Tracked extra entity rate
0.0%
Headline result
NeutralAI clears the current public acceptance guard with strong overall recall and a bounded multilingual PERSON quality profile.
NeutralAI
- Precision
- 100.0%
- Recall
- 98.4%
- F1
- 99.2%
- False positive rate
- 0.0%
Presidio vanilla baseline
- Precision
- 100.0%
- Recall
- 40.3%
- F1
- 57.5%
- False positive rate
- 0.0%
Overall F1 uplift
41.7%
Overall recall uplift
58.1%
PERSON F1 uplift
5.4%
Exact-match uplift
58.1%
What NeutralAI adds beyond the baseline
We use proven open components as a foundation, but the product difference is the operational layer we add around detection: multilingual entity coverage, PERSON false-positive calibration, locale-aware context gating, masking and tokenization flows, and enforcement inside the gateway and browser extension.
Detection hardening
Context-aware rules reduce false positives on names, phone numbers, and locale-specific identifiers while keeping recall high across supported languages.
Product enforcement
Detection is wired into masking, audit-safe handling, tenant controls, and extension enforcement so the benchmark reflects product behavior rather than a standalone recognizer demo.
Methodology note
This is a reproducible product benchmark, not an academic corpus. The public report is synthetic by design, tracks the published entity set, and compares a Presidio-vanilla baseline against the current NeutralAI production configuration. Internal shadow and holdout packs are used separately to watch for overfitting and generalization drift.
Coverage by language
The current public benchmark includes English, Turkish, Spanish, French, and German prompt samples.
| Language | Precision | Recall | F1 | False positive rate |
|---|---|---|---|---|
| DE | 100.0% | 100.0% | 100.0% | 0.0% |
| EN | 100.0% | 100.0% | 100.0% | 0.0% |
| ES | 100.0% | 90.1% | 94.8% | 0.0% |
| FR | 100.0% | 98.8% | 99.4% | 0.0% |
| TR | 100.0% | 100.0% | 100.0% | 0.0% |
Coverage by entity
This benchmark release tracks the entity families most relevant to our current public product posture.
CREDIT_CARD
- Precision
- 100.0%
- Recall
- 100.0%
- F1
- 100.0%
- False positive rate
- 0.0%
EMAIL_ADDRESS
- Precision
- 100.0%
- Recall
- 100.0%
- F1
- 100.0%
- False positive rate
- 0.0%
IP_ADDRESS
- Precision
- 100.0%
- Recall
- 100.0%
- F1
- 100.0%
- False positive rate
- 0.0%
PERSON
- Precision
- 100.0%
- Recall
- 89.4%
- F1
- 94.4%
- False positive rate
- 0.0%
PHONE_NUMBER
- Precision
- 100.0%
- Recall
- 100.0%
- F1
- 100.0%
- False positive rate
- 0.0%
TR_ID_NUMBER
- Precision
- 100.0%
- Recall
- 100.0%
- F1
- 100.0%
- False positive rate
- 0.0%
UK_NHS_NUMBER
- Precision
- 100.0%
- Recall
- 100.0%
- F1
- 100.0%
- False positive rate
- 0.0%