Publication Details

Comparison of wav2vec 2.0 models on three speech processing tasks

KUNEŠOVÁ, M.; ZAJÍC, Z.; ŠMÍDL, L.; KARAFIÁT, M. Comparison of wav2vec 2.0 models on three speech processing tasks. International Journal of Speech Technology, 2024, vol. 27, no. 4, p. 847-859. ISSN: 1572-8110.

Czech title

Srovnání modelů wav2vec 2.0 na třech úlohách zpracování řeči

Type

journal article

Language

English

Authors

Zajíc Zbyněk, Ing., Ph.D.
Šmíd Luboš, Ing., Ph.D.
Karafiát Martin, Ing., Ph.D. (DCGM)
Kunešová Marie, Ing., Ph.D.

URL

Keywords

Speaker change detection Voice activity detection Overlapped speech detection
Wav2vec 2.0

Abstract

The current state-of-the-art for various speech processing problems is
a sequence-to-sequence model based on a self-attention mechanism known as
transformer. The widely used wav2vec 2.0 is a self-supervised transformer model
pre-trained on large amounts of unlabeled speech and then fine-tuned for
a specific task. The data used for training and fine-tuning, along with the size
of the transformer model, play a crucial role in both of these training steps.
The most commonly used wav2vec 2.0 models are trained on relatively "clean" data
from sources such as the LibriSpeech dataset, but we can expect there to be a
benefit in using more realistic data gathered from a variety of acoustic
conditions. However, it is not entirely clear how big the difference would be.
Investigating this is the main goal of our article. To this end, we utilize
wav2vec 2.0 models in three fundamental speech processing tasks: speaker change
detection, voice activity detection, and overlapped speech detection, and test
them on four real conversation datasets. We compare four wav2vec 2.0 models with
different sizes and different data used for pre-training, and we fine-tune them
either on in-domain data from the same dataset or on artificial training data
created from the LibriSpeech corpus. Our results suggest that richer data that
are more similar to the task domain bring better performance than a larger
model.

Published

2024

Pages

847–859

Journal

International Journal of Speech Technology, vol. 27, no. 4, ISSN 1572-8110

DOI

10.1007/s10772-024-10140-6

EID Scopus

2-s2.0-85206375991

BibTeX

@article{BUT193586,
  author="Zbyněk {Zajíc} and Luboš {Šmíd} and Martin {Karafiát} and Marie {Kunešová}",
  title="Comparison of wav2vec 2.0 models on three speech processing tasks",
  journal="International Journal of Speech Technology",
  year="2024",
  volume="27",
  number="4",
  pages="847--859",
  doi="10.1007/s10772-024-10140-6",
  issn="1572-8110",
  url="https://link.springer.com/article/10.1007/s10772-024-10140-6"
}

Files

pdf kunesova_springer_2024_s10772-024-10140-6.pdf 1 MB