Publication Details

Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint

PRASAD, A.; CAROFILIS, A.; VANDERREYDT, G.; KHALIL, D.; MADIKERI, S.; MOTLÍČEK, P.; SCHUEPBACH, C. Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024. p. 11921-11925. ISBN: 979-8-3503-4485-1.
Czech title
Fine-Tuning samoučicích modelů pro identifikaci jazyka pomocí ortonormálního omezení
Type
conference paper
Language
English
Authors
Prasad Amrutha (DCGM)
CAROFILIS, A.
VANDERREYDT, G.
KHALIL, D.
Madikeri Srikanth
Motlíček Petr, doc. Ing., Ph.D. (DCGM)
SCHUEPBACH, C.
URL
Keywords

Language Identification, Transformers, Wav2Vec2, fine-tuning, low-resource,
out-of-domain,

Abstract

Self-supervised models trained with high linguistic diversity, such as the XLS-R
model, can be effectively fine-tuned for the language recognition task.
Typically, a back-end classifier followed by statistics pooling layer are added
during train- ing. Commonly used back-end classifiers require a large num- ber of
parameters to be trained, which is not ideal in limited data conditions. In this
work, we explore smaller parame- ter back-ends using factorized Time Delay Neural
Network (TDNN-F). The TDNN-F architecture is also integrated into Emphasized
Channel Attention, Propagation and Aggregation- TDNN (ECAPA-TDNN) models, termed
ECAPA-TDNN-F, reducing the number of parameters by 30 to 50% absolute, with
competitive accuracies and no change in minimum cost. The results show that the
ECAPA-TDNN-F can be extended to tasks where ECAPA-TDNN is suitable. We also test
the effectiveness of a linear classifier and a variant, the Orthonor- mal linear
classifier, previously used in x-vector type systems. The models are trained with
NIST LRE17 data and evalu- ated on NIST LRE17, LRE22 and the ATCO2 LID datasets.
Both linear classifiers outperform conventional back-ends with improvements in
accuracy between 0.9% and 9.1%

Published
2024
Pages
11921–11925
Proceedings
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Conference
2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Seoul, KR
ISBN
979-8-3503-4485-1
Publisher
IEEE Signal Processing Society
Place
Seoul
DOI
EID Scopus
BibTeX
@inproceedings{BUT193354,
  author="PRASAD, A. and CAROFILIS, A. and VANDERREYDT, G. and KHALIL, D. and MADIKERI, S. and MOTLÍČEK, P. and SCHUEPBACH, C.",
  title="Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2024",
  pages="11921--11925",
  publisher="IEEE Signal Processing Society",
  address="Seoul",
  doi="10.1109/ICASSP48485.2024.10446751",
  isbn="979-8-3503-4485-1",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446751"
}
Files
Back to top