Guard Brasil: 16 Brazilian PII patterns in 4ms

TL;DR: Guard Brasil is an open-source API that detects 16 Brazilian personal data patterns (CPF, CNPJ, RG, MASP, REDS, SUS card...) in 4ms. Real check digit validation, not just regex. MIT license, self-hostable, free tier of 500 calls per month. Try it now.

The problem: Brazilian data in generic APIs #

When you use Microsoft Presidio or AWS Macie to detect personal data in text, they find emails, phone numbers, and credit cards. But they do not find MASP (functional ID for state employees in Minas Gerais), REDS (police incident report number in MG), SUS card (national health system ID), NIS and PIS (worker registration), or Titulo de Eleitor (voter ID with specific check digits). If you work with Brazilian data in chatbots, ERPs, health systems, or police investigation, these patterns matter as much as CPF and CNPJ. And no global library covers them.

The 16 patterns #

Category	Patterns	Validation
Identity	CPF, CNPJ, RG, CNH, Titulo de Eleitor	Real check digit (CPF, CNPJ, Titulo)
Health and government	NIS/PIS, SUS Card, MASP	Format and length
Investigation	REDS (MG), judicial process number (CNJ)	CNJ/REDS standard format
Contact	Email, Phone (landline and mobile BR), CEP	Regex plus BR format
Vehicles	Mercosul plate, legacy plate	Format ABC1D23 and ABC-1234
Financial	Credit card	Luhn algorithm

Each pattern has an associated LGPD (Brazil's data protection law, similar to GDPR) risk level. CPF, CNH, and health data are CRITICAL under Art. 5 and Art. 11. Email and CEP are MEDIUM. The classification follows the ANPD (Brazil's data protection authority) interpretation of sensitive personal data.

Live test #

The API is public. No signup, no credit card:

curl -X POST https://guard.egos.ia.br/v1/inspect   -H "Content-Type: application/json"   -d '{"text": "Patient CPF: 123.456.789-09, SUS card 898 0016 0045 0004"}'

Response in about 4ms:

{
  "patterns": [
    {"type": "CPF", "value": "123.456.789-09", "valid": true},
    {"type": "SUS_CARD", "value": "898 0016 0045 0004"}
  ],
  "lgpd_risk": "CRITICAL",
  "has_sensitive_data": true,
  "latency_ms": 4
}

The field valid: true on the CPF means the check digits pass. This is what separates Guard from pure regex: 000.000.000-00 would return valid: false because it fails the algorithm. Pattern matching without validation creates false positives. Check digit validation is the difference.

Real use cases #

Use case	How Guard helps	Risk without it
LLM chatbot	Inspect user input before sending to model	CPF or CNH leaks to third-party API
ETL pipeline	Classify PII fields before writing to data lake	Sensitive data in table with no access control
Police investigation (our case)	Audit trail of who accessed investigation data	No LGPD Art. 37 compliance
Healthtech	Detect health data (Art. 11) in free-text fields	ANPD fine for irregular sensitive data treatment
Log sanitization	Find PII in application logs	Personal data in Elasticsearch without protection

Compliance, not masking #

Guard Brasil does not mask data from operators by default. This is intentional. In a police precinct, the investigator needs to see the suspect's CPF. In a hospital, the doctor needs to see the patient's SUS number. Masking that data would break their work. What Guard does is generate the audit trail: who accessed, when, what type of data, what risk level. That is what LGPD Art. 37 requires, a record of processing operations, not blocking legitimate access. Each call to the API internally generates a SHA-256 hash of the evidence as a provenance receipt, usable as auditable proof if ANPD requests it.

Guard Brasil versus alternatives #

Metric	Guard Brasil	Presidio	AWS Macie
Latency p95	4ms	~50ms (Python NLP)	Batch (minutes)
Native BR patterns	16	2-3 if configured	Generic
Check digit validation	CPF, CNPJ, Titulo	Regex only	N/A
Self-hostable	Yes (MIT)	Yes (MIT)	No (AWS only)
LGPD classification	Native (Art. 5, 11)	Generic (GDPR)	Generic
Cost	Free tier 500/month	Free (self-host)	Pay-per-GB

Guard does not replace Presidio or Macie. For global patterns (SSN, passport), use Presidio. For Brazilian structured data with real validation, use Guard Brasil. Running both in sequence is a valid architecture.

What did not work #

Free-text name detection: Guard detects structured patterns, not names or addresses in free text. For unstructured PII, combine with an NLP approach.
Partial masking heuristics: partially redacted data like 123.XXX.XXX-00 is not detected as CPF. Structural PII without digits is outside scope.
Volume at free tier: 500 calls per month is tight for high-traffic apps. Self-hosting is the intended path for production scale.

Open questions #

How to audit a Drive with thousands of files for PII retroactively at reasonable cost?
What is the right granularity for LGPD risk levels in a multi-tenant system where tenants have different compliance needs?
When does self-hosting Guard Brasil make more sense than the hosted API?

Files referenced in this article #

packages/guard-brasil/ — Guard Brasil source (16 pattern modules, validators, classifier)
packages/guard-brasil/src/index.ts — entry point, exports guard.inspect()

Wrong Altitude — why Guard Brasil is a tool, not the central product
Documentation lies — the manifest that monitors Guard Brasil endpoints automatically

Open source. Everything here is available at github.com/enioxt/egos. If you are building something similar or want to apply this in your context, reach out on X: @eniorocha_. Building in public.

The problem: Brazilian data in generic APIs #

The 16 patterns #

Live test #

Real use cases #

Compliance, not masking #

Guard Brasil versus alternatives #

What did not work #

Open questions #

Files referenced in this article #

Related in EGOS #

More from Timeline