Publication Details
Phoneme Recognition from a Long Temporal Context
Matějka Pavel, Ing., Ph.D. (DCGM)
Černocký Jan, prof. Dr. Ing. (DCGM)
phoneme recognition, feature extraction, speech recognition
Phoneme Recognition from a Long Temporal Context
We investigate techniques for acoustic modeling in automatic recognition of context-independent phoneme strings. The recognizer was evaluated on TIMIT database.
The baseline phoneme recognizer is based on TempoRAl Patterns (TRAP).
It is an HMM - Neural Network (HMM/NN) hybrid.
Critical bands energies are obtained in the conventional way. Speech signal is divided into 25 ms long frames with 10 ms shift. The Mel filter-bank is emulated by triangular weighting of FFT-derived short-term spectrum to obtain short-term critical-band logarithmic spectral densities.
TRAP feature vector describes a segment of temporal evolution of critical band spectral densities within a single critical band. The central point is actual frame and there is equal number of frames in past and in future.
The length can differ. This vector forms an input to a classifier.
Outputs of the classifier are posterior probabilities of sub-word classes which we want to distinguish among. In our case, such classes are context-independent phonemes or their parts (states). Such classifier is applied in each critical band. The merger is another classifier and its function is to combine band classifier outputs into one.
Both band classifiers and merger are neural nets.
The described techniques yield phoneme probabilities for the center frame. These phoneme probabilities are then fed into a Viterbi decoder which produces phoneme strings.
This recognizer is further simplified to shorten processing times, reduce computational requirements and optimized. This simplification optimization reduce PER absolutely about 1.8%.
More precise modeling we achieved by splitting phonemes
to 3 parts (states). This improved system of 0.9% absolutely. Separate modeling of left and right phoneme context gave us 0.38% in case of one state models. More fine modeling of these left and right contexts by three states lead to improvement 3.76%. Also bi-gram language models are incorporated into the system and evaluated.
All modifications lead to a faster system with about 23.6% relative or 6.84% absolute improvement over the baseline in phoneme
error rate.
Work is in progress on porting this recognizer to meeting data domain. The recognizer will serve as
one of front-ends for the acoustic event spotting (the task of Brno within AMI).
@inproceedings{BUT17586,
author="Petr {Schwarz} and Pavel {Matějka} and Jan {Černocký}",
title="Phoneme Recognition from a Long Temporal Context",
booktitle="poster at JOINT AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms",
year="2004",
pages="1--1",
publisher="Institute for Perceptual Artificial Intelligence",
address="Martigny",
url="http://www.fit.vutbr.cz/~matejkap/publi/2004/ami2004.pdf"
}