Publication Details

SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

ŽMOLÍKOVÁ, K.; DELCROIX, M.; KINOSHITA, K.; OCHIAI, T.; NAKATANI, T.; BURGET, L.; ČERNOCKÝ, J. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures. IEEE J-STSP, 2019, vol. 13, no. 4, p. 800-814. ISSN: 1932-4553.
Czech title
Neuronová síť poučená o mluvčím pro extrakci cílového mluvčího ze směsi řečových signálů
Type
journal article
Language
English
Authors
Žmolíková Kateřina, Ing., Ph.D. (FIT)
Delcroix Marc
Kinoshita Keisuke
OCHIAI, T.
Nakatani Tomohiro
Burget Lukáš, doc. Ing., Ph.D. (DCGM)
Černocký Jan, prof. Dr. Ing. (DCGM)
URL
Keywords

Speaker extraction, speaker-aware neural network, multi-speaker speech
recognition.

Abstract

The processing of speech corrupted by interfering overlapping speakers is one of
the challenging problems with regards to todays automatic speech recognition
systems. Recently, approaches based on deep learning have made great progress
toward solving this problem. Most of these approaches tackle the problem as
speech separation, i.e., they blindly recover all the speakers from the mixture.
In some scenarios, such as smart personal devices, we may however be interested
in recovering one target speaker froma mixture. In this paper, we introduce
Speaker- Beam, a method for extracting a target speaker from the mixture based on
an adaptation utterance spoken by the target speaker. Formulating the problem as
speaker extraction avoids certain issues such as label permutation and the need
to determine the number of speakers in the mixture.With SpeakerBeam, we jointly
learn to extract a representation from the adaptation utterance characterizing
the target speaker and to use this representation to extract the speaker. We
explore several ways to do this, mostly inspired by speaker adaptation in
acoustic models for automatic speech recognition. We evaluate the performance on
the widely used WSJ0-2mix andWSJ0-3mix datasets, and these datasets modified with
more noise or more realistic overlapping patterns. We further analyze the learned
behavior by exploring the speaker representations and assessing the effect of the
length of the adaptation data. The results show the benefit of including speaker
information in the processing and the effectiveness of the proposed method.

Published
2019
Pages
800–814
Journal
IEEE J-STSP, vol. 13, no. 4, ISSN 1932-4553
DOI
UT WoS
000477715300003
EID Scopus
BibTeX
@article{BUT159990,
  author="ŽMOLÍKOVÁ, K. and DELCROIX, M. and KINOSHITA, K. and OCHIAI, T. and NAKATANI, T. and BURGET, L. and ČERNOCKÝ, J.",
  title="SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures",
  journal="IEEE J-STSP",
  year="2019",
  volume="13",
  number="4",
  pages="800--814",
  doi="10.1109/JSTSP.2019.2922820",
  issn="1932-4553",
  url="https://ieeexplore.ieee.org/document/8736286"
}
Back to top