Publication Details
BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge
Klement Dominik, Bc. (DCGM)
Han Jiangyu (DCGM)
Sedláček Šimon, Ing. (DCGM)
Yusuf Bolaji (DCGM)
Maciejewski Matthew
Wiesner Matthew, PhD.
Burget Lukáš, doc. Ing., Ph.D. (DCGM)
multi-talker speech recognition, CHiME-8, NOTSOFAR-1, target-speaker
This paper presents our method for tackling the CHIME-8 chal-
lenge's NOTSOFAR-1 task, which requires participants to per-
form multi-speaker automatic speech recognition (ASR) using
audio from distant microphone arrays. We modify the Pyan-
note3 diarization pipeline, incorporating pre-trained WavLM as
local EEND to adapt effectively to new domains, and we intro-
duce two diarization-aware approaches to ASR by condition-
ing Whisper on diarization outputs for target-speaker ASR. The
first method, which we refer to as Query-Key Biasing, modi-
fies Whisper's attention mechanism and positional embeddings
with a learnable attention mask to exclude non-target speaker
segments in the audio. The second method, called Frame-
Level Diarization-Dependent Transformations, applies affine,
diarization-dependent transformations with trainable parame-
ters to the inputs of one or more transformer blocks. We also
extend both the ASR and diarization systems to a multichannel
setup by incorporating cross-channel communication into our
models. Finally, we report the performance of these approaches
on the NOTSOFAR-1 dataset.
@inproceedings{BUT194002,
author="Alexander {Polok} and Dominik {Klement} and Jiangyu {Han} and Šimon {Sedláček} and Bolaji {Yusuf} and Matthew {Maciejewski} and Matthew {Wiesner} and Lukáš {Burget}",
title="BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge",
booktitle="Proceedings of CHiME 2024 Workshop",
year="2024",
pages="18--22",
publisher="International Speech Communication Association",
address="Kos Island",
doi="10.21437/CHiME.2024-4",
url="https://www.isca-archive.org/chime_2024/polok24_chime.pdf"
}