Publication Details

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

PENG Junyi, MOŠNER Ladislav, ZHANG Lin, PLCHOT Oldřich, STAFYLAKIS Themos, BURGET Lukáš and ČERNOCKÝ Jan. CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification. In: Proceedings of ICASSP 2025. Hyderabad: IEEE Biometric Council, 2025, pp. 1-5. ISBN 979-8-3503-6874-1. Available from: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10889058

Czech title

CA-MHFA: Kontextově orientovaný extraktor informace o mluvčím pro ověřování mluvčího na základě samoučení

Type

conference paper

Language

english

Authors

Peng Junyi, Msc. Eng. (DCGM FIT BUT)
Mošner Ladislav, Ing. (DCGM FIT BUT)
Zhang Lin, Ph.D. (FIT BUT)
Plchot Oldřich, Ing., Ph.D. (DCGM FIT BUT)
Stafylakis Themos (OMILIA)
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)

URL

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10889058

Keywords

Self-supervised learning, speaker verification, speaker extractor, pooling mechanism, speech classification

Abstract

Self-supervised learning (SSL) models for speaker verifica-
tion (SV) have gained significant attention in recent years. However,
existing SSL-based SV systems often struggle to capture local temporal
dependencies and generalize across different tasks. In this paper, we pro-
pose context-aware multi-head factorized attentive pooling (CA-MHFA),
a lightweight framework that incorporates contextual information from
surrounding frames. CA-MHFA leverages grouped, learnable queries to
effectively model contextual dependencies while maintaining efficiency
by sharing keys and values across groups. Experimental results on the
VoxCeleb dataset show that CA-MHFA achieves EERs of 0.42%, 0.48%,
and 0.96% on Vox1-O, Vox1-E, and Vox1-H, respectively, outperforming
complex models like WavLM-TDNN with fewer parameters and faster
convergence. Additionally, CA-MHFA demonstrates strong generalization
across multiple SSL models and tasks, including emotion recognition and
anti-spoofing, highlighting its robustness and versatility.

Published

2025

Pages

1-5

Proceedings

Proceedings of ICASSP 2025

Conference

ICASSP 2025, International Conference on Acoustics, Speech, and Signal Processing, Hyderabad, IN

ISBN

979-8-3503-6874-1

Publisher

IEEE Biometric Council

Place

Hyderabad, IN

DOI

10.1109/ICASSP49660.2025.10889058

BibTeX

@INPROCEEDINGS{FITPUB13521,
   author = "Junyi Peng and Ladislav Mo\v{s}ner and Lin Zhang and Old\v{r}ich Plchot and Themos Stafylakis and Luk\'{a}\v{s} Burget and Jan \v{C}ernock\'{y}",
   title = "CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification",
   pages = "1--5",
   booktitle = "Proceedings of ICASSP 2025",
   year = 2025,
   location = "Hyderabad, IN",
   publisher = "IEEE Biometric Council",
   ISBN = "979-8-3503-6874-1",
   doi = "10.1109/ICASSP49660.2025.10889058",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13521"
}