Publication Details
Speaker activity driven neural speech extraction
Speech extraction, Speaker activity, Speech enhancement, Meeting recognition,
Neural network
Target speech extraction, which extracts the speech of a target speaker in
a mixture given auxiliary speaker clues, has recently received increased
interest. Various clues have been investigated such as pre-recorded enrollment
utterances, direction information, or video of the target speaker. In this paper,
we explore the use of speaker activity information as an auxiliary clue for
single-channel neural network-based speech extraction. We propose a speaker
activity driven speech extraction neural network (ADEnet) and show that it can
achieve performance levels competitive with enrollmentbased approaches, without
the need for pre-recordings. We further demonstrate the potential of the proposed
approach for processing meeting-like recordings, where speaker activity obtained
from a diarization system is used as a speaker clue for ADEnet. We show that this
simple yet practical approach can successfully extract speakers after
diarization, which leads to improved ASR performance when using a single
microphone, especially in high overlapping conditions, with relative word error
rate reduction of up to 25 %.
@inproceedings{BUT171749,
author="DELCROIX, M. and ŽMOLÍKOVÁ, K. and OCHIAI, T. and KINOSHITA, K. and NAKATANI, T.",
title="Speaker activity driven neural speech extraction",
booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
year="2021",
pages="6099--6103",
publisher="IEEE Signal Processing Society",
address="Toronto",
doi="10.1109/ICASSP39728.2021.9414998",
isbn="978-1-7281-7605-5",
url="https://www.fit.vut.cz/research/publication/12479/"
}