Publication Details
Speaker activity driven neural speech extraction
Speech extraction, Speaker activity, Speech enhancement,Meeting recognition, Neural network
Target speech extraction, which extracts the speech of a targetspeaker in a mixture given auxiliary speaker clues, has recentlyreceived increased interest. Various clues have been investigatedsuch as pre-recorded enrollment utterances, direction information,or video of the target speaker. In this paper, we explore the use ofspeaker activity information as an auxiliary clue for single-channelneural network-based speech extraction. We propose a speaker activitydriven speech extraction neural network (ADEnet) and showthat it can achieve performance levels competitive with enrollmentbasedapproaches, without the need for pre-recordings. We furtherdemonstrate the potential of the proposed approach for processingmeeting-like recordings, where speaker activity obtained from a diarizationsystem is used as a speaker clue for ADEnet. We show thatthis simple yet practical approach can successfully extract speakersafter diarization, which leads to improved ASR performancewhen using a single microphone, especially in high overlappingconditions, with relative word error rate reduction of up to 25 %.
@inproceedings{BUT171749,
author="DELCROIX, M. and ŽMOLÍKOVÁ, K. and OCHIAI, T. and KINOSHITA, K. and NAKATANI, T.",
title="Speaker activity driven neural speech extraction",
booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
year="2021",
pages="6099--6103",
publisher="IEEE Signal Processing Society",
address="Toronto",
doi="10.1109/ICASSP39728.2021.9414998",
isbn="978-1-7281-7605-5",
url="https://www.fit.vut.cz/research/publication/12479/"
}