Public benchmark

PII detection accuracy benchmark for NeutralAI

This page summarizes our reproducible benchmark comparing a Presidio-vanilla baseline against the current NeutralAI detection stack across multilingual and multi-entity prompt samples.

Benchmark cases

1000

NeutralAI overall F1

99.2%

PERSON F1

94.4%

False positive rate

0.0%

Tracked exact-match accuracy

98.4%

Tracked extra entity rate

0.0%

Headline result

NeutralAI clears the current public acceptance guard with strong overall recall and a bounded multilingual PERSON quality profile.

Acceptance passed

NeutralAI

Precision
100.0%
Recall
98.4%
F1
99.2%
False positive rate
0.0%

Presidio vanilla baseline

Precision
100.0%
Recall
40.3%
F1
57.5%
False positive rate
0.0%

Overall F1 uplift

41.7%

Overall recall uplift

58.1%

PERSON F1 uplift

5.4%

Exact-match uplift

58.1%

What NeutralAI adds beyond the baseline

We use proven open components as a foundation, but the product difference is the operational layer we add around detection: multilingual entity coverage, PERSON false-positive calibration, locale-aware context gating, masking and tokenization flows, and enforcement inside the gateway and browser extension.

Detection hardening

Context-aware rules reduce false positives on names, phone numbers, and locale-specific identifiers while keeping recall high across supported languages.

Product enforcement

Detection is wired into masking, audit-safe handling, tenant controls, and extension enforcement so the benchmark reflects product behavior rather than a standalone recognizer demo.

Methodology note

This is a reproducible product benchmark, not an academic corpus. The public report is synthetic by design, tracks the published entity set, and compares a Presidio-vanilla baseline against the current NeutralAI production configuration. Internal shadow and holdout packs are used separately to watch for overfitting and generalization drift.

Coverage by language

The current public benchmark includes English, Turkish, Spanish, French, and German prompt samples.

LanguagePrecisionRecallF1False positive rate
DE100.0%100.0%100.0%0.0%
EN100.0%100.0%100.0%0.0%
ES100.0%90.1%94.8%0.0%
FR100.0%98.8%99.4%0.0%
TR100.0%100.0%100.0%0.0%

Coverage by entity

This benchmark release tracks the entity families most relevant to our current public product posture.

CREDIT_CARD

Precision
100.0%
Recall
100.0%
F1
100.0%
False positive rate
0.0%

EMAIL_ADDRESS

Precision
100.0%
Recall
100.0%
F1
100.0%
False positive rate
0.0%

IP_ADDRESS

Precision
100.0%
Recall
100.0%
F1
100.0%
False positive rate
0.0%

PERSON

Precision
100.0%
Recall
89.4%
F1
94.4%
False positive rate
0.0%

PHONE_NUMBER

Precision
100.0%
Recall
100.0%
F1
100.0%
False positive rate
0.0%

TR_ID_NUMBER

Precision
100.0%
Recall
100.0%
F1
100.0%
False positive rate
0.0%

UK_NHS_NUMBER

Precision
100.0%
Recall
100.0%
F1
100.0%
False positive rate
0.0%

Report generated from the checked-in benchmark artifact on 2026-05-08.