Publication Details
Hystoc: Obtaining Word Confidences for Fusion of End-To-End ASR Systems
confidences measures, system fusion, end-toend systems, automatic speech
recognition
End-to-end (e2e) systems have recently gained wide popularity in automatic speech
recognition. However, these systems do generally not provide well-calibrated
word-level confidences. In this paper, we propose Hystoc, a simple method for
obtaining word-level confidences from hypothesis-level scores. Hystoc is an
iterative alignment procedure which turns hypotheses from an n-best output of the
ASR system into a confusion network. Eventually, word-level confidences are
obtained as posterior probabilities in the individual bins of the confusion
network. We show that Hystoc provides confidences that correlate well with the
accuracy of the ASR hypothesis. Furthermore, we show that utilizing Hystoc in
fusion of multiple e2e ASR systems increases the gains from the fusion by up to
1% WER absolute on Spanish RTVE2020 dataset. Finally, we experiment with using
Hystoc for direct fusion of n-best outputs from multiple systems, but we only
achieve minor gains when fusing very similar systems.
@inproceedings{BUT189696,
author="Karel {Beneš} and Martin {Kocour} and Lukáš {Burget}",
title="Hystoc: Obtaining Word Confidences for Fusion of End-To-End ASR Systems",
booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
year="2024",
pages="11276--11280",
publisher="IEEE Signal Processing Society",
address="Seoul",
doi="10.1109/ICASSP48485.2024.10446739",
isbn="979-8-3503-4485-1",
url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446739"
}