Public benchmark

PII detection accuracy benchmark for NeutralAI

This page summarizes our reproducible benchmark comparing a Presidio-vanilla baseline against the current NeutralAI detection stack across multilingual and multi-entity prompt samples.

Benchmark cases

1000

NeutralAI overall F1

99.2%

PERSON F1

94.4%

False positive rate

0.0%

Tracked exact-match accuracy

98.4%

Tracked extra entity rate

0.0%

Headline result

NeutralAI clears the current public acceptance guard with strong overall recall and a bounded multilingual PERSON quality profile.

Acceptance passed

NeutralAI

Precision: 100.0%
Recall: 98.4%
F1: 99.2%
False positive rate: 0.0%

Presidio vanilla baseline

Precision: 100.0%
Recall: 40.3%
F1: 57.5%
False positive rate: 0.0%

Overall F1 uplift

41.7%

Overall recall uplift

58.1%

PERSON F1 uplift

5.4%

Exact-match uplift

58.1%

What NeutralAI adds beyond the baseline

We use proven open components as a foundation, but the product difference is the operational layer we add around detection: multilingual entity coverage, PERSON false-positive calibration, locale-aware context gating, masking and tokenization flows, and enforcement inside the gateway and browser extension.

Detection hardening

Context-aware rules reduce false positives on names, phone numbers, and locale-specific identifiers while keeping recall high across supported languages.

Product enforcement

Detection is wired into masking, audit-safe handling, tenant controls, and extension enforcement so the benchmark reflects product behavior rather than a standalone recognizer demo.

Methodology note

This is a reproducible product benchmark, not an academic corpus. The public report is synthetic by design, tracks the published entity set, and compares a Presidio-vanilla baseline against the current NeutralAI production configuration. Internal shadow and holdout packs are used separately to watch for overfitting and generalization drift.

View latency benchmark Open API reference

Coverage by language

The current public benchmark includes English, Turkish, Spanish, French, and German prompt samples.

Language	Precision	Recall	F1	False positive rate
DE	100.0%	100.0%	100.0%	0.0%
EN	100.0%	100.0%	100.0%	0.0%
ES	100.0%	90.1%	94.8%	0.0%
FR	100.0%	98.8%	99.4%	0.0%
TR	100.0%	100.0%	100.0%	0.0%

Coverage by entity

This benchmark release tracks the entity families most relevant to our current public product posture.

CREDIT_CARD

Precision: 100.0%
Recall: 100.0%
F1: 100.0%
False positive rate: 0.0%

EMAIL_ADDRESS

Precision: 100.0%
Recall: 100.0%
F1: 100.0%
False positive rate: 0.0%

IP_ADDRESS

Precision: 100.0%
Recall: 100.0%
F1: 100.0%
False positive rate: 0.0%

PERSON

Precision: 100.0%
Recall: 89.4%
F1: 94.4%
False positive rate: 0.0%

PHONE_NUMBER

Precision: 100.0%
Recall: 100.0%
F1: 100.0%
False positive rate: 0.0%

TR_ID_NUMBER

Precision: 100.0%
Recall: 100.0%
F1: 100.0%
False positive rate: 0.0%

UK_NHS_NUMBER

Precision: 100.0%
Recall: 100.0%
F1: 100.0%
False positive rate: 0.0%

Report generated from the checked-in benchmark artifact on 2026-05-08.

Privacy Support