Publication Details
Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations
Mošner Ladislav, Ing. (DCGM)
KAKOUROS, S.
Plchot Oldřich, Ing., Ph.D. (DCGM)
Burget Lukáš, doc. Ing., Ph.D. (DCGM)
Černocký Jan, prof. Dr. Ing. (DCGM)
Speaker identification, speaker verification, emotion recognition,
self-supervised models
Self-supervised learning of speech representations from large amounts of
unlabeled data has enabled state-of-the-art results in several speech processing
tasks. Aggregating these speech representations across time is typically
approached by using descriptive statistics, and in particular, using the first-
and second-order statistics of representation coefficients. In this paper, we
examine an alternative way of extracting speaker and emotion information from
self-supervised trained models, based on the correlations between the
coefficients of the representations - correlation pooling. We show improvements
over mean pooling and further gains when the pooling methods are combined via
fusion. The code is available at github.com/Lamomal/s3prl_correlation.
@inproceedings{BUT185160,
author="STAFYLAKIS, T. and MOŠNER, L. and KAKOUROS, S. and PLCHOT, O. and BURGET, L. and ČERNOCKÝ, J.",
title="Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations",
booktitle="2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings",
year="2023",
pages="1136--1143",
publisher="IEEE Signal Processing Society",
address="Doha",
doi="10.1109/SLT54892.2023.10023345",
isbn="978-1-6654-7189-3",
url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10023345"
}