mBERT for Medical Data Anonymization

Model description

mBERT-AnonyMED-BR-syn is a fine-tuned version of Multilingual BERT (mBERT), adapted for the task of medical data anonymization in Brazilian Portuguese and fine-tuned exclusively on synthetic medical records from the AnonyMED-BR dataset.

Base architecture: BERT Base Multilingual (cased)
Fine-tuning: AnonyMED-BR (synthetic subset only)
Language coverage: 104 languages
Domain: Healthcare / medical anonymization

Intended uses & limitations

Intended uses

Medical data anonymization in Brazilian Portuguese.
Named Entity Recognition (NER) for sensitive entities such as names, dates, IDs, hospitals, and locations.
Research in privacy-preserving NLP where synthetic data is preferred over real patient data.

Limitations

Fine-tuned only on synthetic data → may not fully capture all linguistic variability of real clinical notes.
Extractive NER approach — identifies and tags sensitive entities but does not rewrite text.
Designed and evaluated in the medical domain; performance in other domains is not guaranteed.

How to use

Example usage with Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "Venturus/mBERT-AnonyMED-BR-syn"  # base architecture
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)  # load fine-tuned weights if available

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

text = "O paciente João da Silva foi internado no Hospital das Clínicas em 12/05/2023."
entities = nlp(text)

print(entities)

Example output:

[
  {"word": "João", "entity": "PATIENT"},
  {"word": "da Silva", "entity": "PATIENT"},
  {"word": "Hospital das Clínicas", "entity": "HOSPITAL"},
  {"word": "12/05/2023", "entity": "DATE"}
]

Training procedure

Base architecture: BERT Base Multilingual (cased).

Fine-tuning for NER using AnonyMED-BR (synthetic split only).

Hyperparameters:
- Learning rate: 5e-5
- Batch size: 4
- Epochs: 5
- Precision: FP32
Hardware: NVIDIA Tesla T4 (16 GB).

Evaluation results

The model was evaluated on the AnonyMED-BR test set.

Overall results

F1-score: 0.9257

Precision: 0.9191

Recall: 0.9359

Entities covered

Personal data:
- <PATIENT>
- <DOCTOR>
- <AGE>
- <PROFESSION>
Identifiers:
- <IDNUM>
- <MEDICAL_RECORD>
- <HEALTH_PLAN>
Locations:
- <CITY>
- <STATE>
- <COUNTRY>
- <STREET>
- <HOSPITAL>
- <LOCATION_OTHER>
- <ZIP>
Other sensitive data:
- <DATE>
- <EMAIL>
- <PHONE>
- <ORGANIZATION>
- <OTHER>

Citation

If you use this model, please cite:

Downloads last month: 6

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Venturus/mBERT-AnonyMED-BR-syn

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(928)

this model