mBERT for Medical Data Anonymization

Model description

mBERT-AnonyMED-BR-syn is a fine-tuned version of Multilingual BERT (mBERT), adapted for the task of medical data anonymization in Brazilian Portuguese and fine-tuned exclusively on synthetic medical records from the AnonyMED-BR dataset.

  • Base architecture: BERT Base Multilingual (cased)
  • Fine-tuning: AnonyMED-BR (synthetic subset only)
  • Language coverage: 104 languages
  • Domain: Healthcare / medical anonymization

Intended uses & limitations

Intended uses

  • Medical data anonymization in Brazilian Portuguese.
  • Named Entity Recognition (NER) for sensitive entities such as names, dates, IDs, hospitals, and locations.
  • Research in privacy-preserving NLP where synthetic data is preferred over real patient data.

Limitations

  • Fine-tuned only on synthetic data → may not fully capture all linguistic variability of real clinical notes.
  • Extractive NER approach — identifies and tags sensitive entities but does not rewrite text.
  • Designed and evaluated in the medical domain; performance in other domains is not guaranteed.

How to use

Example usage with Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "Venturus/mBERT-AnonyMED-BR-syn"  # base architecture
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)  # load fine-tuned weights if available

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

text = "O paciente João da Silva foi internado no Hospital das Clínicas em 12/05/2023."
entities = nlp(text)

print(entities)

Example output:

[
  {"word": "João", "entity": "PATIENT"},
  {"word": "da Silva", "entity": "PATIENT"},
  {"word": "Hospital das Clínicas", "entity": "HOSPITAL"},
  {"word": "12/05/2023", "entity": "DATE"}
]

Training procedure

  • Base architecture: BERT Base Multilingual (cased).

Fine-tuning for NER using AnonyMED-BR (synthetic split only).

  • Hyperparameters:

    • Learning rate: 5e-5

    • Batch size: 4

    • Epochs: 5

    • Precision: FP32

  • Hardware: NVIDIA Tesla T4 (16 GB).

Evaluation results

The model was evaluated on the AnonyMED-BR test set.

Overall results

F1-score: 0.9257

Precision: 0.9191

Recall: 0.9359

Entities covered

  • Personal data:

    • <PATIENT>
    • <DOCTOR>
    • <AGE>
    • <PROFESSION>
  • Identifiers:

    • <IDNUM>
    • <MEDICAL_RECORD>
    • <HEALTH_PLAN>
  • Locations:

    • <CITY>
    • <STATE>
    • <COUNTRY>
    • <STREET>
    • <HOSPITAL>
    • <LOCATION_OTHER>
    • <ZIP>
  • Other sensitive data:

    • <DATE>
    • <EMAIL>
    • <PHONE>
    • <ORGANIZATION>
    • <OTHER>

Citation

If you use this model, please cite:

Downloads last month
6
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Venturus/mBERT-AnonyMED-BR-syn

Finetuned
(928)
this model