Publication Details

AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions

KIŠŠ, M.; BENEŠ, K.; HRADIŠ, M. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021. p. 463-477. ISBN: 978-3-030-86336-4.
Czech title
AT-ST: Samoučící strategie adaptace pro OCR v doménách s omezeným počtem přepisů
Type
conference paper
Language
English
Authors
URL
Keywords

self-training, text recognition, language model, unlabelled data, confidence
measures, data augmentation.

Abstract

This paper addresses text recognition for domains with limited manual annotations
by a simple self-training strategy. Our approach should reduce human annotation
effort when target domain data is plentiful, such as when transcribing
a collection of single person's correspondence or a large manuscript. We propose
to train a seed system on large scale data from related domains mixed with
available annotated data from the target domain. The seed system transcribes the
unannotated data from the target domain which is then used to train a better
system. We study several confidence measures and eventually decide to use the
posterior probability of a transcription for data selection. Additionally, we
propose to augment the data using an aggressive masking scheme. By self-training,
we achieve up to 55 % reduction in character error rate for handwritten data and
up to 38 % on printed data. The masking augmentation itself reduces the error
rate by about 10 % and its effect is better pronounced in case of difficult
handwritten data.

Published
2021
Pages
463–477
Proceedings
Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021
Series
Lecture Notes in Computer Science
Volume
12824
Conference
International Conference on Document Analysis and Recognition, Lausanne, Switzerland, CH
ISBN
978-3-030-86336-4
Publisher
Springer Nature Switzerland AG
Place
Lausanne
DOI
UT WoS
000711880100031
EID Scopus
BibTeX
@inproceedings{BUT175776,
  author="Martin {Kišš} and Karel {Beneš} and Michal {Hradiš}",
  title="AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions",
  booktitle="Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021",
  year="2021",
  series="Lecture Notes in Computer Science",
  volume="12824",
  pages="463--477",
  publisher="Springer Nature Switzerland AG",
  address="Lausanne",
  doi="10.1007/978-3-030-86337-1\{_}31",
  isbn="978-3-030-86336-4",
  url="https://pero.fit.vutbr.cz/publications"
}
Back to top