Publication Details
BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge
Klement Dominik, Bc. (DCGM)
Han Jiangyu (DCGM)
Sedláček Šimon, Ing. (DCGM)
Yusuf Bolaji (DCGM)
Maciejewski Matthew
Wiesner Matthew, PhD.
Burget Lukáš, doc. Ing., Ph.D. (DCGM)
multi-talker speech recognition, CHiME-8, NOTSOFAR-1, target-speaker
This paper presents our method for tackling the CHIME-8 chal- lenge's NOTSOFAR-1
task, which requires participants to per- form multi-speaker automatic speech
recognition (ASR) using audio from distant microphone arrays. We modify the Pyan-
note3 diarization pipeline, incorporating pre-trained WavLM as local EEND to
adapt effectively to new domains, and we intro- duce two diarization-aware
approaches to ASR by condition- ing Whisper on diarization outputs for
target-speaker ASR. The first method, which we refer to as Query-Key Biasing,
modi- fies Whisper's attention mechanism and positional embeddings with
a learnable attention mask to exclude non-target speaker segments in the audio.
The second method, called Frame- Level Diarization-Dependent Transformations,
applies affine, diarization-dependent transformations with trainable parame- ters
to the inputs of one or more transformer blocks. We also extend both the ASR and
diarization systems to a multichannel setup by incorporating cross-channel
communication into our models. Finally, we report the performance of these
approaches on the NOTSOFAR-1 dataset.
@inproceedings{BUT194002,
author="Alexander {Polok} and Dominik {Klement} and Jiangyu {Han} and Šimon {Sedláček} and Bolaji {Yusuf} and Matthew {Maciejewski} and Matthew {Wiesner} and Lukáš {Burget}",
title="BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge",
booktitle="Proceedings of CHiME 2024 Workshop",
year="2024",
pages="18--22",
publisher="International Speech Communication Association",
address="Kos Island",
doi="10.21437/CHiME.2024-4",
url="https://www.isca-archive.org/chime_2024/polok24_chime.pdf"
}