Guard Brasil: 16 Brazilian PII patterns in 4ms

E
Enio Rocha
4 min read GitHub →

TL;DR: Guard Brasil is an open-source API that detects 16 Brazilian personal data patterns (CPF, CNPJ, RG, MASP, REDS, SUS card...) in 4ms. Real check digit validation, not just regex. MIT license, self-hostable, free tier of 500 calls per month. Try it now.

The problem: Brazilian data in generic APIs #

When you use Microsoft Presidio or AWS Macie to detect personal data in text, they find emails, phone numbers, and credit cards. But they do not find MASP (functional ID for state employees in Minas Gerais), REDS (police incident report number in MG), SUS card (national health system ID), NIS and PIS (worker registration), or Titulo de Eleitor (voter ID with specific check digits). If you work with Brazilian data in chatbots, ERPs, health systems, or police investigation, these patterns matter as much as CPF and CNPJ. And no global library covers them.

The 16 patterns #

CategoryPatternsValidation
IdentityCPF, CNPJ, RG, CNH, Titulo de EleitorReal check digit (CPF, CNPJ, Titulo)
Health and governmentNIS/PIS, SUS Card, MASPFormat and length
InvestigationREDS (MG), judicial process number (CNJ)CNJ/REDS standard format
ContactEmail, Phone (landline and mobile BR), CEPRegex plus BR format
VehiclesMercosul plate, legacy plateFormat ABC1D23 and ABC-1234
FinancialCredit cardLuhn algorithm

Each pattern has an associated LGPD (Brazil's data protection law, similar to GDPR) risk level. CPF, CNH, and health data are CRITICAL under Art. 5 and Art. 11. Email and CEP are MEDIUM. The classification follows the ANPD (Brazil's data protection authority) interpretation of sensitive personal data.

Live test #

The API is public. No signup, no credit card:

curl -X POST https://guard.egos.ia.br/v1/inspect   -H "Content-Type: application/json"   -d '{"text": "Patient CPF: 123.456.789-09, SUS card 898 0016 0045 0004"}'

Response in about 4ms:

{
  "patterns": [
    {"type": "CPF", "value": "123.456.789-09", "valid": true},
    {"type": "SUS_CARD", "value": "898 0016 0045 0004"}
  ],
  "lgpd_risk": "CRITICAL",
  "has_sensitive_data": true,
  "latency_ms": 4
}

The field valid: true on the CPF means the check digits pass. This is what separates Guard from pure regex: 000.000.000-00 would return valid: false because it fails the algorithm. Pattern matching without validation creates false positives. Check digit validation is the difference.

Real use cases #

Use caseHow Guard helpsRisk without it
LLM chatbotInspect user input before sending to modelCPF or CNH leaks to third-party API
ETL pipelineClassify PII fields before writing to data lakeSensitive data in table with no access control
Police investigation (our case)Audit trail of who accessed investigation dataNo LGPD Art. 37 compliance
HealthtechDetect health data (Art. 11) in free-text fieldsANPD fine for irregular sensitive data treatment
Log sanitizationFind PII in application logsPersonal data in Elasticsearch without protection

Compliance, not masking #

Guard Brasil does not mask data from operators by default. This is intentional. In a police precinct, the investigator needs to see the suspect's CPF. In a hospital, the doctor needs to see the patient's SUS number. Masking that data would break their work. What Guard does is generate the audit trail: who accessed, when, what type of data, what risk level. That is what LGPD Art. 37 requires, a record of processing operations, not blocking legitimate access. Each call to the API internally generates a SHA-256 hash of the evidence as a provenance receipt, usable as auditable proof if ANPD requests it.

Guard Brasil versus alternatives #

MetricGuard BrasilPresidioAWS Macie
Latency p954ms~50ms (Python NLP)Batch (minutes)
Native BR patterns162-3 if configuredGeneric
Check digit validationCPF, CNPJ, TituloRegex onlyN/A
Self-hostableYes (MIT)Yes (MIT)No (AWS only)
LGPD classificationNative (Art. 5, 11)Generic (GDPR)Generic
CostFree tier 500/monthFree (self-host)Pay-per-GB

Guard does not replace Presidio or Macie. For global patterns (SSN, passport), use Presidio. For Brazilian structured data with real validation, use Guard Brasil. Running both in sequence is a valid architecture.

What did not work #

  • Free-text name detection: Guard detects structured patterns, not names or addresses in free text. For unstructured PII, combine with an NLP approach.
  • Partial masking heuristics: partially redacted data like 123.XXX.XXX-00 is not detected as CPF. Structural PII without digits is outside scope.
  • Volume at free tier: 500 calls per month is tight for high-traffic apps. Self-hosting is the intended path for production scale.

Open questions #

  • How to audit a Drive with thousands of files for PII retroactively at reasonable cost?
  • What is the right granularity for LGPD risk levels in a multi-tenant system where tenants have different compliance needs?
  • When does self-hosting Guard Brasil make more sense than the hosted API?

Files referenced in this article #

Open source. Everything here is available at github.com/enioxt/egos. If you are building something similar or want to apply this in your context, reach out on X: @eniorocha_. Building in public.