Publication Details

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

KOCOUR, M.; ŽMOLÍKOVÁ, K.; ONDEL YANG, L.; ŠVEC, J.; DELCROIX, M.; OCHIAI, T.; BURGET, L.; ČERNOCKÝ, J. Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Proceedings of Interspeech. Incheon: International Speech Communication Association, 2022. p. 4955-4959. ISSN: 1990-9772.

Czech title

Návrat k rozpoznávání řeči více mluvčích založenému na společném dekódování s DNN akustickým modelem

Type

conference paper

Language

English

Authors

Kocour Martin, Ing. (DCGM)
Žmolíková Kateřina, Ing., Ph.D. (FIT)
ONDEL YANG, L.
Švec Ján, Ing. (DCGM)
Delcroix Marc
OCHIAI, T.
Burget Lukáš, doc. Ing., Ph.D. (DCGM)
Černocký Jan, prof. Dr. Ing. (DCGM)

URL

Keywords

Multi-talker speech recognition, Permutation invariant training, Factorial Hidden
Markov models

Abstract

In typical multi-talker speech recognition systems, a neural network-based
acoustic model predicts senone state posteriors for each speaker. These are later
used by a single-talker decoder which is applied on each speaker-specific output
stream separately. In this work, we argue that such a scheme is sub-optimal and
propose a principled solution that decodes all speakers jointly. We modify the
acoustic model to predict joint state posteriors for all speakers, enabling the
network to express uncertainty about the attribution of parts of the speech
signal to the speakers. We employ a joint decoder that can make use of this
uncertainty together with higher-level language information. For this, we revisit
decoding algorithms used in factorial generative models in early multi-talker
speech recognition systems. In contrast with these early works, we replace the
GMM acoustic model with DNN, which provides greater modeling power and simplifies
part of the inference. We demonstrate the advantage of joint decoding in proof of
concept experiments on a mixed-TIDIGITS dataset.

Published

2022

Pages

4955–4959

Journal

Proceedings of Interspeech, no. 9, ISSN 1990-9772

Proceedings

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Conference

Interspeech Conference, Incheon, KR

Publisher

International Speech Communication Association

Place

Incheon

DOI

10.21437/Interspeech.2022-10406

UT WoS

000900724505027

EID Scopus

2-s2.0-85140088159

BibTeX

@inproceedings{BUT179827,
  author="KOCOUR, M. and ŽMOLÍKOVÁ, K. and ONDEL YANG, L. and ŠVEC, J. and DELCROIX, M. and OCHIAI, T. and BURGET, L. and ČERNOCKÝ, J.",
  title="Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model",
  booktitle="Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
  year="2022",
  journal="Proceedings of Interspeech",
  number="9",
  pages="4955--4959",
  publisher="International Speech Communication Association",
  address="Incheon",
  doi="10.21437/Interspeech.2022-10406",
  issn="1990-9772",
  url="https://www.isca-speech.org/archive/pdfs/interspeech_2022/kocour22_interspeech.pdf"
}

Files

pdf kocour22_interspeech2022_revisiting.pdf 302 kB