Publication Details

Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers

KUMAR, S.; MADIKERI, S.; NIGMATULINA, I.; VILLATORO-TELLO, E.; MOTLÍČEK, P.; PANDIA, K.; DUBAGUNTA, P.; GANAPATHIRAJU, A. Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Seoul: IEEE Signal Processing Society, 2024. p. 12592-12596. ISBN: 979-8-3503-4485-1.
Czech title
Víceúlohové rozpoznávání řeči a detekce změny mluvčího pro neznámý počet mluvčích
Type
conference paper
Language
English
Authors
KUMAR, S.
Madikeri Srikanth
NIGMATULINA, I.
VILLATORO-TELLO, E.
Motlíček Petr, doc. Ing., Ph.D. (DCGM)
PANDIA, K.
DUBAGUNTA, P.
GANAPATHIRAJU, A.
URL
Keywords

speaker change detection, speaker turn detection, speech recognition, multitask
learning, F1 score

Abstract

Traditionally, automatic speech recognition (ASR) and speaker change detection
(SCD) systems have been independently trained to generate comprehensive
transcripts accompanied by speaker turns. Recently, joint training of ASR and SCD
systems, by inserting speaker turn tokens in the ASR training text, has been
shown to be successful. In this work, we present a multitask alternative to the
joint training approach. Results obtained on the mix-headset audios of AMI corpus
show that the proposed multitask training yields an absolute improvement of 1.8%
in coverage and purity based F1 score on SCD task without ASR degradation. We
also examine the trade-offs between the ASR and SCD performance when trained
using multitask criteria. Additionally, we validate the speaker change
information in the embedding spaces obtained after different transformer layers
of a self-supervised pre-trained model, such as XLSR-53, by integrating an SCD
classifier at the output of specific transformer layers. Results reveal that the
use of different embedding spaces from XLSR-53 model for multitask ASR and SCD is
advantageous.1

Published
2024
Pages
12592–12596
Proceedings
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Conference
2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Seoul, KR
ISBN
979-8-3503-4485-1
Publisher
IEEE Signal Processing Society
Place
Seoul
DOI
BibTeX
@inproceedings{BUT196785,
  author="KUMAR, S. and MADIKERI, S. and NIGMATULINA, I. and VILLATORO-TELLO, E. and MOTLÍČEK, P. and PANDIA, K. and DUBAGUNTA, P. and GANAPATHIRAJU, A.",
  title="Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers",
  booktitle="ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
  year="2024",
  pages="12592--12596",
  publisher="IEEE Signal Processing Society",
  address="Seoul",
  doi="10.1109/ICASSP48485.2024.10446130",
  isbn="979-8-3503-4485-1",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446130"
}
Files
Back to top