Publication Details

Learning Speaker Representation for Neural Network Based Multichannel Speaker Extraction

ŽMOLÍKOVÁ, K.; DELCROIX, M.; KINOSHITA, K.; HIGUCHI, T.; OGAWA, A.; NAKATANI, T. Learning Speaker Representation for Neural Network Based Multichannel Speaker Extraction. In Proceedings of ASRU 2017. Okinawa: IEEE Signal Processing Society, 2017. p. 8-15. ISBN: 978-1-5090-4788-8.
Czech title
Učení reprezentací řečníků pro vícekanálovou extrakci jednoho řečníka založenou na neuronových sítích
Type
conference paper
Language
English
Authors
Žmolíková Kateřina, Ing., Ph.D. (FIT)
Delcroix Marc
Kinoshita Keisuke
Higuchi Takuya
Ogawa Atsunori
Nakatani Tomohiro
URL
Keywords

speaker extraction, speaker adaptive neural network, multi-speaker speech recognition, speaker representation learning, beamforming

Abstract

Recently, schemes employing deep neural networks (DNNs) forextracting speech from noisy observation have demonstratedgreat potential for noise robust automatic speech recognition.However, these schemes are not well suited when the interferingnoise is another speaker. To enable extracting a target speakerfrom a mixture of speakers, we have recently proposed to informthe neural network using speaker information extracted froman adaptation utterance from the same speaker. In our previouswork, we explored ways how to inform the network about thespeaker and found a speaker adaptive layer approach to be suitablefor this task. In our experiments, we used speaker featuresdesigned for speaker recognition tasks as the additional speakerinformation, which may not be optimal for the speaker extractiontask. In this paper, we propose a usage of a sequence summarizingscheme enabling to learn the speaker representation jointlywith the network. Furthermore, we extend the previous experimentsto demonstrate the potential of our proposed methodas a front-end for speech recognition and explore the effect ofadditional noise on the performance of the method.

Annotation

Recently, schemes employing deep neural networks (DNNs) for extracting speech from noisy observation have demonstrated great potential for noise robust automatic speech recognition. However, these schemes are not well suited when the interfering noise is another speaker. To enable extracting a target speaker from a mixture of speakers, we have recently proposed to inform the neural network using speaker information extracted from an adaptation utterance from the same speaker. In our previous work, we explored ways how to inform the network about the speaker and found a speaker adaptive layer approach to be suitable for this task. In our experiments, we used speaker features designed for speaker recognition tasks as the additional speaker information, which may not be optimal for the speaker extraction task. In this paper, we propose a usage of a sequence summarizing scheme enabling to learn the speaker representation jointly with the network. Furthermore, we extend the previous experiments to demonstrate the potential of our proposed method as a front-end for speech recognition and explore the effect of additional noise on the performance of the method.

Published
2017
Pages
8–15
Proceedings
Proceedings of ASRU 2017
Conference
2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), Okinawa, JP
ISBN
978-1-5090-4788-8
Publisher
IEEE Signal Processing Society
Place
Okinawa
DOI
UT WoS
000426066100002
EID Scopus
BibTeX
@inproceedings{BUT144503,
  author="Kateřina {Žmolíková} and Marc {Delcroix} and Keisuke {Kinoshita} and Takuya {Higuchi} and Atsunori {Ogawa} and Tomohiro {Nakatani}",
  title="Learning Speaker Representation for Neural Network Based Multichannel Speaker Extraction",
  booktitle="Proceedings of ASRU 2017",
  year="2017",
  pages="8--15",
  publisher="IEEE Signal Processing Society",
  address="Okinawa",
  doi="10.1109/ASRU.2017.8268910",
  isbn="978-1-5090-4788-8",
  url="http://www.fit.vutbr.cz/research/groups/speech/publi/2017/zmolikova_asru2017.pdf"
}
Back to top