Auden-Voice

Auden-Voice is a general-purpose voice encoder trained to learn robust speaker representations.

The model is trained using multi-task learning, where jointly optimizing speaker identification, emotion, gender, and age classification objectives leads to more general and transferable voice representations.

Model Details

Model type: Voice encoder
Architecture: Zipformer
Embedding dimension: 768
Number of parameters: ~156M
Framework: PyTorch
Output: Frame-level embeddings [B, T, D]
Pooling: User-defined (e.g., mean pooling for utterance-level embeddings)

Training

Training Strategy

Multi-task learning was found to work best. The model is jointly trained on the following tasks:

Speaker identification
Emotion classification
Gender classification
Age classification

This setup encourages the encoder to learn robust and general-purpose voice representations.

Training Data

The model is trained on publicly available academic speech datasets, totaling approximately 2050 hours of audio.

Task	Dataset(s)	#Samples	Hours
Speaker Identification	VoxCeleb2	974k	2026
Paralinguistic Tasks	CREMA-D, RAVDESS, IEMOCAP, TESS	18.3k	20

Training Code

Full training scripts and configurations are available at:
https://github.com/AudenAI/Auden/tree/main/examples/voice

Intended Use

This model is intended to be used as a general-purpose voice encoder for:

Speaker identification and verification
Speaker diarization
Emotion, gender, and age classification
Audio–text and text–audio retrieval
Speech-related downstream tasks that benefit from pretrained voice embeddings

How to Use

Load the Encoder

from auden.auto.auto_model import AutoModel
import torch

encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-voice")
encoder = encoder.to("cuda" if torch.cuda.is_available() else "cpu")

# Extract Voice Embeddings
import torch.nn.functional as F

audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
embeddings_list = []

for audio_file in audio_files:
    x, x_lens = encoder.extract_feature([audio_file])
    x, x_lens = x.to(device), x_lens.to(device)

    with torch.no_grad():
        encoder_output = encoder(x, x_lens)
        frame_embeddings = encoder_output["encoder_out"]  # [B, T, D]

        # Global average pooling (example for speaker verification)
        T = frame_embeddings.size(1)
        mask = (torch.arange(T, device=device).unsqueeze(0) < x_lens.unsqueeze(1)).unsqueeze(-1).float()
        utterance_embedding = (frame_embeddings * mask).sum(dim=1) / mask.sum(dim=1)

        embeddings_list.append(utterance_embedding)

embeddings = torch.cat(embeddings_list, dim=0)  # [N, D]
embeddings = F.normalize(embeddings, p=2, dim=-1)

similarity = torch.matmul(embeddings[0], embeddings[1])
print(f"Cosine similarity: {similarity:.4f}")


# Expected Output
🎵 Audio 1:
   Frame embeddings shape: torch.Size([1, 97, 768])
   Utterance embedding shape: torch.Size([1, 768])

🎵 Audio 2: 
   Frame embeddings shape: torch.Size([1, 138, 768])
   Utterance embedding shape: torch.Size([1, 768])

Cosine similarity: 0.7234
Same speaker: YES

Performance

Task - Dataset	Metric
Speaker Identification - VoxCeleb2	Accuracy 95.25%
Speaker Verification - VoxCeleb1-O	EER 3%
Speaker Diarization - VoxConverse	DER 17%
Age Classification - CREMA-D	Accuracy 93.91%
Gender Classification - CREMA-D	Accuracy 99.72%
Gender Classification - RAVDESS	Accuracy 100%
Emotion Classification - CREMA-D	Accuracy 83.99%
Emotion Classification - RAVDESS	Accuracy 89.71%
Audio → Text Retrieval - ParaspeechCaps	R@1 63.31
Text → Audio Retrieval - ParaspeechCaps	R@1 61.69
LLM-QA Emotion - AirBench-MELD	Accuracy 27.23%
LLM-QA Emotion - AirBench-IEMOCAP	Accuracy 84.70%
LLM-QA Gender - AirBench-MELD	Accuracy 81.58%
LLM-QA Gender - AirBench-CommonVoice	Accuracy 93.15%
LLM-QA Age - AirBench-CommonVoice	Accuracy 58.27%

Limitations

The model is trained primarily on English speech data and may not generalize well to other languages.
The model is not evaluated on generative tasks such as speech synthesis or voice conversion.
Utterance-level representations depend on the pooling strategy selected by the user.

Citation

If you use this model in your research, please cite:

@article{huo2025auden,
  title={Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding},
  author={Huo, Mingyue and Tseng, Wei-Cheng and Shao, Yiwen and Zhang, Hao and Yu, Dong},
  journal={arXiv preprint arXiv:2511.15145},
  year={2025}
}

Downloads last month: 7

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for AudenAI/auden-encoder-voice

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

Paper • 2511.15145 • Published Nov 19, 2025