Detail výsledku

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

YAN, B.; HAMED, I.; SHIMIZU, S.; LODAGALA, V.; CHEN, W.; IAKOVENKO, O.; TALAFHA, B.; HUSSEIN, A.; POLOK, A.; CHANG, K.; KLEMENT, D.; ALTHUBAITI, S.; PENG, P.; WIESNER, M.; SOLORIO, T.; ALI, A.; KHUDANPUR, S.; WATANABE, S. CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Interspeech. Rotterdam, Nizozemí: ISCA, 2025. p. 743-747.

Typ

článek ve sborníku konference

Jazyk

angličtina

Autoři

Yan Brian
Hamed Injy
Shimizu Shuichiro
Lodagala Vasista Sai
Chen William
Iakovenko Olga
Talafha Bashar
Hussein Amir
Polok Alexander, Ing., UPGM (FIT)
Chang Kalvin
Klement Dominik, Ing., UPGM (FIT)
Althubaiti Sara
Peng Puyuan
Wiesner Matthew
Solorio Thamar
Ali Ahmed
Khudanpur Sanjeev
Watanabe Shinji

Abstrakt

We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique codeswitched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.

Klíčová slova

code-switching, code-switched speech recognition, multilingual speech recognition and translation

URL

https://www.isca-archive.org/interspeech_2025/yan25c_interspeech.pdf

Rok

2025

Strany

743–747

Časopis

Interspeech, ISSN

Sborník

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Konference

Interspeech Conference

Vydavatel

ISCA

Místo

Rotterdam, Nizozemí

DOI

10.21437/interspeech.2025-2247

EID Scopus

2-s2.0-105020092489

BibTeX

@inproceedings{BUT199996,
  author="{} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and Alexander {Polok} and  {} and Dominik {Klement} and  {} and  {} and  {} and  {} and  {} and  {} and  {}",
  title="CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset",
  booktitle="Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
  year="2025",
  journal="Interspeech",
  pages="743--747",
  publisher="ISCA",
  address="Rotterdam, Nizozemí",
  doi="10.21437/interspeech.2025-2247",
  url="https://www.isca-archive.org/interspeech_2025/yan25c_interspeech.pdf"
}

Projekty

Soudobé metody zpracování, analýzy a zobrazování multimediálních a 3D dat, VUT, Vnitřní projekty VUT, FIT-S-23-8278, zahájení: 2023-03-01, ukončení: 2026-02-28, řešení

Výzkumné skupiny

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)
Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)