Publication Details

ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications has been verified and confirmed by the Action Editor

ZULUAGA-GOMEZ, J.; VESELÝ, K.; SZŐKE, I.; BLATT, A.; MOTLÍČEK, P.; KOCOUR, M.; RIGAULT, M.; CHOUKRI, K.; PRASAD, A.; SARFJOO, S.; NIGMATULINA, I.; CEVENINI, C.; KOLČÁREK, P.; TART, A.; ČERNOCKÝ, J.; KLAKOW, D. ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications has been verified and confirmed by the Action Editor. Journal of Machine Learning Research, vol. 2, no. 1, p. 1-45. ISSN: 1533-7928.
Czech title
Korpus ATCO2: Rozsáhlý soubor dat pro výzkum automatického rozpoznávání řeči a porozumění přirozenému jazyku v komunikaci řízení letového provozu
Type
journal article
Language
English
Authors
ZULUAGA-GOMEZ, J.
Veselý Karel, Ing., Ph.D. (DCGM)
Szőke Igor, Ing., Ph.D. (DCGM)
BLATT, A.
Motlíček Petr, doc. Ing., Ph.D. (DCGM)
Kocour Martin, Ing. (DCGM)
RIGAULT, M.
CHOUKRI, K.
Prasad Amrutha (DCGM)
Sarfjoo Seyyed Saeed
NIGMATULINA, I.
CEVENINI, C.
KOLČÁREK, P.
TART, A.
Černocký Jan, prof. Dr. Ing. (DCGM)
KLAKOW, D.
URL
Keywords

Automatic Speech Recognition, Spoken Language Understanding, Natural Language Processing, Air Traffic Control Communications

Abstract

ersonal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC, large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU).
However, ATC is considered a low-resource domain. In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. In addition, we also open-source a GitHub repository that contains data preparation and training scripts useful to replicate our baselines related to ASR and NLU.
The ATCO2 corpus covers 1) audio and radar data collection and pre-processing, 2) pseudo-transcriptions of speech audio, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets: (i) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold transcriptions for named-entity recognition (callsign, command, value) and speaker role detection. (ii) The ATCO2-test-set-1h corpus is a one-hour open-sourced subset from the 4h test set.\footnote{Free to download, available at: https://www.atco2.org/data. (iii) The ATCO2-PL-set corpus consists of 5'281 hours of pseudo-transcribed ATC speech enriched with contextual information (list of relevant n-gram sequences per utterance), speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. The whole ATCO2 corpus is publicly distributed through ELDA catalog (https://catalog.elra.info/en-us/repository/browse/ELRA-S0484/). We expect the corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

Published
Pages
1–45
Journal
Journal of Machine Learning Research, vol. 2, no. 1, ISSN 1533-7928
BibTeX
@article{BUT194022,
  author="ZULUAGA-GOMEZ, J. and VESELÝ, K. and SZŐKE, I. and BLATT, A. and MOTLÍČEK, P. and KOCOUR, M. and RIGAULT, M. and CHOUKRI, K. and PRASAD, A. and SARFJOO, S. and NIGMATULINA, I. and CEVENINI, C. and KOLČÁREK, P. and TART, A. and ČERNOCKÝ, J. and KLAKOW, D.",
  title="ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications has been verified and confirmed by the Action Editor",
  journal="Journal of Machine Learning Research",
  volume="2",
  number="1",
  pages="1--45",
  issn="1533-7928",
  url="https://openreview.net/forum?id=3CiWvVQVfw"
}
Files
Back to top