Publication Details

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

PáLKA Petr, LANDINI Federico Nicolás, KLEMENT Dominik, DIEZ Sánchez Mireia, SILNOVA Anna, DELCROIX Marc and BURGET Lukáš. Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization. In: Proceedings of Eusipco 2025. Palermo: IEEE Signal Processing Society, 2025, pp. 1-5.

Czech title

Společné tréninování extraktoru embeddingů mluvčích, detekce řeči a detekce překrytí mluvčích pro diarizaci

Type

conference paper

Language

english

Authors

Pálka Petr, Bc. (DCGM FIT BUT)
Landini Federico Nicolás (DCGM FIT BUT)
Klement Dominik, Ing. (DCGM FIT BUT)
Diez Sánchez Mireia, M.Sc., Ph.D. (DCGM FIT BUT)
Silnova Anna, MSc., Ph.D. (DCGM FIT BUT)
Delcroix Marc (NTT)
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT)

Keywords

speaker diarization, speaker embedding, voice activity detection, overlapped speech detection

Abstract

In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a modular approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.

Published

2025 (in print)

Pages

1-5

Proceedings

Proceedings of Eusipco 2025

Conference

The 33rd European Signal Processing Conference (EUSIPCO 2025), Palermo, IT

Publisher

IEEE Signal Processing Society

Place

Palermo, IT

BibTeX

@INPROCEEDINGS{FITPUB13567,
   author = "Petr P\'{a}lka and Nicol\'{a}s Federico Landini and Dominik Klement and Mireia S\'{a}nchez Diez and Anna Silnova and Marc Delcroix and Luk\'{a}\v{s} Burget",
   title = "Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization",
   pages = "1--5",
   booktitle = "Proceedings of Eusipco 2025",
   year = 2025,
   location = "Palermo, IT",
   publisher = "IEEE Signal Processing Society",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13567"
}