Detail výsledku

Analysis of ABC Frontend Audio Systems for the NIST-SRE24

BARAHONA, S.; SILNOVA, A.; MOŠNER, L.; PENG, J.; PLCHOT, O.; ROHDIN, J.; ZHANG, L.; HAN, J.; PALKA, P.; LANDINI, F.; BURGET, L.; STAFYLAKIS, T.; CUMANI, S.; BOBOŠ, D.; HLAVAČEK, M.; KODOVSKY, M.; PAVLIČEK, T. Analysis of ABC Frontend Audio Systems for the NIST-SRE24. In Proceedings of the Annual Conference of the International Speech Communication Association Interspeech. Interspeech. Rotterdam: International Speech Communication Association, 2025. p. 5763-5767.

Typ

článek ve sborníku konference

Jazyk

angličtina

Autoři

Barahona Sara
Silnova Anna, M.Sc., Ph.D., UPGM (FIT)
Mošner Ladislav, Ing., Ph.D., UPGM (FIT)
Peng Junyi, UPGM (FIT)
Plchot Oldřich, Ing., Ph.D., UPGM (FIT)
Rohdin Johan Andréas, M.Sc., Ph.D., FIT (FIT), UPGM (FIT)
Zhang Lin, Ph.D.
Han Jiangyu, UPGM (FIT)
Pálka Petr, Ing., FIT (FIT), UPGM (FIT)
Landini Federico Nicolás, Ph.D.
Burget Lukáš, doc. Ing., Ph.D., UPGM (FIT)
Stafylakis Themos
Cumani Sandro, Ph.D.
Boboš Dominik, Ing.
Hlavaček Miroslav
Kodovsky Martin
Pavliček Tomaš

Abstrakt

We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.

Klíčová slova

embedding extractors | NIST-SRE | speaker recognition | VoxBlink

URL

https://www.isca-archive.org/interspeech_2025/barahona25_interspeech.pdf

Rok

2025

Strany

5763–5767

Časopis

Interspeech, ISSN

Sborník

Proceedings of the Annual Conference of the International Speech Communication Association Interspeech

Konference

Interspeech Conference

Vydavatel

International Speech Communication Association

Místo

Rotterdam

DOI

10.21437/Interspeech.2025-2737

EID Scopus

2-s2.0-105020095403

BibTeX

@inproceedings{BUT199934,
  author="{} and Anna {Silnova} and Ladislav {Mošner} and Junyi {Peng} and Oldřich {Plchot} and Johan Andréas {Rohdin} and Lin {Zhang} and Jiangyu {Han} and Petr {Pálka} and Federico Nicolás {Landini} and Lukáš {Burget} and  {} and Sandro {Cumani} and Dominik {Boboš} and  {} and  {} and  {}",
  title="Analysis of ABC Frontend Audio Systems for the NIST-SRE24",
  booktitle="Proceedings of the Annual Conference of the International Speech Communication Association Interspeech",
  year="2025",
  journal="Interspeech",
  pages="5763--5767",
  publisher="International Speech Communication Association",
  address="Rotterdam",
  doi="10.21437/Interspeech.2025-2737",
  url="https://www.isca-archive.org/interspeech_2025/barahona25_interspeech.pdf"
}

Projekty

Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím, EU, MEZISEKTOROVÁ SPOLUPRÁCE, EH23_020/0008518, zahájení: 2025-01-01, ukončení: 2028-12-31, řešení
Výměny pro výzkum řeči a technologií, EU, Horizon 2020, zahájení: 2021-01-01, ukončení: 2025-12-31, ukončen

Výzkumné skupiny

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)
Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)