Publication Details

Target Speech Extraction with Pre-Trained Self-Supervised Learning Models

PENG, J.; DELCROIX, M.; OCHIAI, T.; PLCHOT, O.; ARAKI, S.; ČERNOCKÝ, J. Target Speech Extraction with Pre-Trained Self-Supervised Learning Models. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024. p. 10421-10425. ISBN: 979-8-3503-4485-1.

Czech title

Extrakce řeči cílového mluvčího pomocí předtrénovaných modelů získaných samoučením

Type

conference paper

Language

English

Authors

Peng Junyi (DCGM)
Delcroix Marc (FIT)
OCHIAI, T.
Plchot Oldřich, Ing., Ph.D. (DCGM)
ARAKI, S.
Černocký Jan, prof. Dr. Ing. (DCGM)

URL

Keywords

Target speech extraction, pre-trained models, self-supervised learning, feature aggregation

Abstract

Pre-trained self-supervised learning (SSL) models have achieved re-
markable success in various speech tasks. However, their potential
in target speech extraction (TSE) has not been fully exploited. TSE
aims to extract the speech of a target speaker in a mixture guided by
enrollment utterances. We exploit pre-trained SSL models for two
purposes within a TSE framework, i.e., to process the input mixture
and to derive speaker embeddings from the enrollment. In this paper,
we focus on how to effectively use SSL models for TSE. We first in-
troduce a novel TSE downstream task following the SUPERB princi-
ples. This simple experiment shows the potential of SSL models for
TSE, but extraction performance remains far behind the state-of-the-
art. We then extend a powerful TSE architecture by incorporating
two SSL-based modules: an Adaptive Input Enhancer (AIE) and a
speaker encoder. Specifically, the proposed AIE utilizes intermedi-
ate representations from the CNN encoder by adjusting the time res-
olution of CNN encoder and transformer blocks through progressive
upsampling, capturing both fine-grained and hierarchical features.
Our method outperforms current TSE systems achieving a SI-SDR
improvement of 14.0 dB on LibriMix. Moreover, we can further
improve performance by 0.7 dB by fine-tuning the whole model in-
cluding the SSL model parameters.

Published

2024

Pages

10421–10425

Proceedings

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Conference

2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Seoul, KR

ISBN

979-8-3503-4485-1

Publisher

IEEE Signal Processing Society

Place

Seoul

DOI

10.1109/ICASSP48485.2024.10448315

BibTeX

@inproceedings{BUT189779,
  author="PENG, J. and DELCROIX, M. and OCHIAI, T. and PLCHOT, O. and ARAKI, S. and ČERNOCKÝ, J.",
  title="Target Speech Extraction with Pre-Trained Self-Supervised Learning Models",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2024",
  pages="10421--10425",
  publisher="IEEE Signal Processing Society",
  address="Seoul",
  doi="10.1109/ICASSP48485.2024.10448315",
  isbn="979-8-3503-4485-1",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10448315"
}

Files

pdf peng_icassp2024_Target_Speech_Extraction_with_Pre-Trained.pdf 2 MB