Project Details
Jazyková paměť regionů České republiky. Metody strojového učení pro uchování, dokumentaci a prezentaci nářečí českého jazyka
Project Period: 1. 3. 2023 – 31. 12. 2027
Project Type: grant
Code: DH23P03OVV010
Agency: Ministerstvo kultury ČR
Czech language, dialects, dialectology, artificial intelligence, speech and language data, automatic dialect identification, automatic speech recognition, interactive maps, language memory of regions
Language is a fundamental connecting element of every nation and its territorial dialects are an important part of regional identity. In the modern world, dialects are gradually disappearing, their variability is diminishing and they are gradually assimilating into the language represented by the mainstream media and the Internet. Due to the significant costs of acquiring and annotating training language data, the dialects have virtually zero support in modern artificial intelligence (AI) and machine learning (ML) technologies, represented mainly by automatic speech recognition (ASR). In Czechia, the dialectology department of the Czech Academy of Sciences, Czech Language Institute (ÚJČ AV ČR) is systematically engaged in research of colloquial phenomena of the Czech national language, is dedicated to the study of dialects. However, ÚJČ lacks any modern technology for automatic processing, storage, documentation and presentation of dialects. Also, the outputs of the dialectology department are available primarily to the scientific community; there is a lack of modern interactive web applications or services that could be used by the general public. The project, proposed by ASR specialists (BUT), dialectologists (ÚJČ) and interactive map imaging experts (UPOL), aims to adapt existing technologies and develop new procedures for automatic processing, storage, documentation and presentation of Czech language dialects. A detailed methodology for the transfer of structured knowledge from dialectology to machine learning (where work with data is dominant) will be developed. The existing Archive of Sound Recordings of Dialect Speech (built in ÚJČ from 1952 to the present and containing over 750 hours of recordings) will be supplemented with metadata and prepared for machine learning. As a prerequisite, we will develop software for dialect detection based on audio recording.
Kocour Martin, Ing. (DCGM)
Kotolan Martin (CVT)
Plchot Oldřich, Ing., Ph.D. (DCGM)
Yusuf Bolaji (DCGM)
Žižka Josef, Ing. (DCGM)
2024
- BENEŠ, K.; KOCOUR, M.; BURGET, L. Hystoc: Obtaining Word Confidences for Fusion of End-To-End ASR Systems. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024.
p. 11276-11280. ISBN: 979-8-3503-4485-1. Detail
2023
- MATĚJKA, P.; SILNOVA, A.; SLAVÍČEK, J.; MOŠNER, L.; PLCHOT, O.; KLČO, M.; PENG, J.; STAFYLAKIS, T.; BURGET, L. Description and Analysis of ABC Submission to NIST LRE 2022. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Proceedings of Interspeech. Dublin: International Speech Communication Association, 2023.
p. 511-515. ISSN: 1990-9772. Detail