Publication Details
Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
Watanabe Shinji
ASTUDILLO, R.
HORI, T.
Burget Lukáš, doc. Ing., Ph.D. (DCGM)
Černocký Jan, prof. Dr. Ing. (DCGM)
Sequence-to-sequence, end-to-end, ASR, TTS,semi-supervised, unsupervised, cycle consistency
Sequence-to-sequence automatic speech recognition (ASR)models require large quantities of data to attain highperformance. For this reason, there has been a recent surgein interest for unsupervised and semi-supervised training insuch models. This work builds upon recent results showingnotable improvements in semi-supervised training usingcycle-consistency and related techniques. Such techniquesderive training procedures and losses able to leverage unpairedspeech and/or text data by combining ASR with Text-to-Speech(TTS) models. In particular, this work proposes a newsemi-supervised loss combining an end-to-end differentiableASR!TTS loss with TTS!ASR loss. The method is ableto leverage both unpaired speech and text data to outperformrecently proposed related techniques in terms of %WER. Weprovide extensive results analyzing the impact of data quantityand speech and text modalities and show consistent gains acrossWSJ and Librispeech corpora. Our code is provided in ESPnetto reproduce the experiments.
@inproceedings{BUT159996,
author="BASKAR, M. and WATANABE, S. and ASTUDILLO, R. and HORI, T. and BURGET, L. and ČERNOCKÝ, J.",
title="Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text",
booktitle="Proceedings of Interspeech",
year="2019",
journal="Proceedings of Interspeech",
volume="2019",
number="9",
pages="3790--3794",
publisher="International Speech Communication Association",
address="Graz",
doi="10.21437/Interspeech.2019-3167",
issn="1990-9772",
url="https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3167.pdf"
}