mBERT for Medical Data Anonymization
Model description
mBERT-AnonyMED-BR-syn is a fine-tuned version of Multilingual BERT (mBERT), adapted for the task of medical data anonymization in Brazilian Portuguese and fine-tuned exclusively on synthetic medical records from the AnonyMED-BR dataset.
- Base architecture: BERT Base Multilingual (cased)
- Fine-tuning: AnonyMED-BR (synthetic subset only)
- Language coverage: 104 languages
- Domain: Healthcare / medical anonymization
Intended uses & limitations
Intended uses
- Medical data anonymization in Brazilian Portuguese.
- Named Entity Recognition (NER) for sensitive entities such as names, dates, IDs, hospitals, and locations.
- Research in privacy-preserving NLP where synthetic data is preferred over real patient data.
Limitations
- Fine-tuned only on synthetic data → may not fully capture all linguistic variability of real clinical notes.
- Extractive NER approach — identifies and tags sensitive entities but does not rewrite text.
- Designed and evaluated in the medical domain; performance in other domains is not guaranteed.
How to use
Example usage with Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "Venturus/mBERT-AnonyMED-BR-syn" # base architecture
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name) # load fine-tuned weights if available
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
text = "O paciente João da Silva foi internado no Hospital das Clínicas em 12/05/2023."
entities = nlp(text)
print(entities)
Example output:
[
{"word": "João", "entity": "PATIENT"},
{"word": "da Silva", "entity": "PATIENT"},
{"word": "Hospital das Clínicas", "entity": "HOSPITAL"},
{"word": "12/05/2023", "entity": "DATE"}
]
Training procedure
- Base architecture: BERT Base Multilingual (cased).
Fine-tuning for NER using AnonyMED-BR (synthetic split only).
Hyperparameters:
Learning rate: 5e-5
Batch size: 4
Epochs: 5
Precision: FP32
Hardware: NVIDIA Tesla T4 (16 GB).
Evaluation results
The model was evaluated on the AnonyMED-BR test set.
Overall results
F1-score: 0.9257
Precision: 0.9191
Recall: 0.9359
Entities covered
Personal data:
<PATIENT><DOCTOR><AGE><PROFESSION>
Identifiers:
<IDNUM><MEDICAL_RECORD><HEALTH_PLAN>
Locations:
<CITY><STATE><COUNTRY><STREET><HOSPITAL><LOCATION_OTHER><ZIP>
Other sensitive data:
<DATE><EMAIL><PHONE><ORGANIZATION><OTHER>
Citation
If you use this model, please cite:
- Downloads last month
- 6
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for Venturus/mBERT-AnonyMED-BR-syn
Base model
google-bert/bert-base-multilingual-cased