Publication Details

TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

PENG Junyi, ASHIHARA Takanori, DELCROIX Marc, OCHIAI Tsubasa, PLCHOT Oldřich, ARAKI Shoko and ČERNOCKÝ Jan. TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Hyderabad: IEEE Signal Processing Society, 2025, pp. 1-5. ISBN 979-8-3503-6874-1. Available from: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10887574

Czech title

TS-SUPERB: Sada dat a experimentů ověření zpracování řeči cílového mluvčího pomocí modelů řeči získaných samoučením

Type

conference paper

Language

english

Authors

Peng Junyi, Msc. Eng. (DCGM FIT BUT)
Ashihara Takanori (NTT)
Delcroix Marc (NTT)
Ochiai Tsubasa (NTT)
Plchot Oldřich, Ing., Ph.D. (DCGM FIT BUT)
Araki Shoko (NTT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)

URL

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10887574

Keywords

Self-supervised learning, target-speaker speech process, speech recognition, speech enhancement, voice activity detection

Abstract

Self-supervised learning (SSL) models have significantly
advanced speech processing tasks, and several benchmarks have been pro-
posed to validate their effectiveness. However, previous benchmarks have
primarily focused on single-speaker scenarios, with less exploration of
target-speaker tasks in noisy, multi-talker conditions-a more challenging
yet practical case. In this paper, we introduce the Target-Speaker Speech
Processing Universal Performance Benchmark (TS-SUPERB), which
includes four widely recognized target-speaker processing tasks that
require identifying the target speaker and extracting information from
the speech mixture. In our benchmark, the speaker embedding extracted
from enrollment speech is used as a clue to condition downstream models.
The benchmark result reveals the importance of evaluating SSL models
in target speaker scenarios, demonstrating that performance cannot be
easily inferred from related single-speaker tasks. Moreover, by using a
unified SSL-based target speech encoder, consisting of a speaker encoder
and an extractor module, we also investigate joint optimization across TS
tasks to leverage mutual information and demonstrate its effectiveness.

Published

2025

Pages

1-5

Proceedings

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Conference

ICASSP 2025, International Conference on Acoustics, Speech, and Signal Processing, Hyderabad, IN

ISBN

979-8-3503-6874-1

Publisher

IEEE Signal Processing Society

Place

Hyderabad, IN

DOI

10.1109/ICASSP49660.2025.10887574

EID Scopus

2-s2.0-105003873681

BibTeX

@INPROCEEDINGS{FITPUB13522,
   author = "Junyi Peng and Takanori Ashihara and Marc Delcroix and Tsubasa Ochiai and Old\v{r}ich Plchot and Shoko Araki and Jan \v{C}ernock\'{y}",
   title = "TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models",
   pages = "1--5",
   booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
   year = 2025,
   location = "Hyderabad, IN",
   publisher = "IEEE Signal Processing Society",
   ISBN = "979-8-3503-6874-1",
   doi = "10.1109/ICASSP49660.2025.10887574",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13522"
}