Project Details

Pokročilá extrakce a rozpoznávání obsahu tištěných a rukou psaných digitalizátů pro zvýšení jejich přístupnosti a využitelnosti

Project Period: 1. 3. 2018 – 31. 12. 2022

Project Type: grant

Code: DG18P02OVV055

Agency: Ministry of Culture Czech Republic

Program: Program na podporu aplikovaného výzkumu a experimentálního vývoje národní a kulturní identity na léta 2016 až 2022 (NAKI II)

English title

Advanced content extraction and recognition for printed and handwritten documents for better accessibility and usability

Type

grant

Keywords

Optical character recognition, handwriting recognition, natural language
processing, quality enhancement, language model, convolutional neural networks
recurrent neural networks

Abstract

The project aims to create technology and tools which would improve accessibility
of digitized historic documents. These tools, based on state of the art methods
from computer vision, machine learning and language modeling, will enable
existing digital archives and libraries to provide full-text search and content
extraction for low quality historic printed and all hand written documents -
which can not be automatically processed by the currently available tools. The
project extends automation and capabilities of digitization pipeline by providing
tools for automated quality assessment and control, quality improvement,
automated text transcription of historic printed documents, semi-automated hand
written text transcription, and automatic extraction of semantic information from
semi-structured documents (e.g. library catalogs and birth records). The created
tools and techniques will be validated by processing selected collections of
digitized materials and by a pilot operation by cooperation with Moravian
Library.

Team members

Smrž Pavel, doc. RNDr., Ph.D. (DCGM) – research leader
Bařina David, Ing., Ph.D. (DCGM)
Beneš Karel, Ing., Ph.D. (DCGM)
Hájková Gabriela, Mgr. (DFIT)
Hradiš Michal, Ing., Ph.D. (DCGM)
Hříbek David, Ing.
Juránek Roman, Ing., Ph.D. (DCGM)
Kodym Oldřich, Ing., Ph.D.
Kopeczinski Daniela, Mgr. (Library)
Zemčík Pavel, prof. Dr. Ing., dr. h. c. (DCGM)

Publication Results

2022

DVOŘÁKOVÁ, M.; HRADIŠ, M.; ŽABIČKA, P.; KOHÚT, J.; KIŠŠ, M.; BENEŠ, K. Využití PERO OCR při přepisu rukopisů. Archivní časopis, 2022, roč. 72, č. 1, s. 14-27. ISSN: 0004-0398. Detail
KIŠŠ, M.; KOHÚT, J.; BENEŠ, K.; HRADIŠ, M. Importance of Textlines in Historical Document Classification. In Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. Lecture Notes in Computer Science. La Rochelle: Springer Nature Switzerland AG, 2022. p. 158-170. ISBN: 978-3-031-06554-5. Detail

2021

KIŠŠ, M.; BENEŠ, K.; HRADIŠ, M. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021. p. 463-477. ISBN: 978-3-030-86336-4. Detail
KODYM, O.; HRADIŠ, M. Page Layout Analysis System for Unconstrained Historic Documents. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021. p. 492-506. ISBN: 978-3-030-86330-2. Detail
KODYM, O.; HRADIŠ, M. TG2: text-guided transformer GAN for restoring document readability and perceived quality. International Journal on Document Analysis and Recognition, 2021, vol. 2021, no. 1, p. 1-14. ISSN: 1433-2825. Detail
KOHÚT, J.; HRADIŠ, M. TS-Net: OCR Trained to Switch Between Text Transcription Styles. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021. no. 1, p. 478-493. ISBN: 978-3-030-86336-4. ISSN: 0302-9743. Detail

2020

KIŠŠ, M.; HRADIŠ, M.; KODYM, O. Brno Mobile OCR Dataset. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. Sydney: Institute of Electrical and Electronics Engineers, 2020. p. 1352-1357. ISBN: 978-1-7281-3015-6. Detail

Applied Results

2022

Software pro extrakci informace z polostrukturovaných dokumentů, software, 2022
Authors: HRADIŠ, M.; KIŠŠ, M.; KOHÚT, J.; BENEŠ, K.; KOSTELNÍK, M.

2021

Interaktivní polo-automatické rozpoznávání ručně psaného písma, software, 2021
Authors: HRADIŠ, M.; KIŠŠ, M.; KOHÚT, J.; BENEŠ, K.; KODYM, O.; BUCHAL, P.; HŘÍBEK, D.

2020

Software pro adaptabilní rozpoznávání textu starých tisků, software, 2020
Authors: HRADIŠ, M.; KIŠŠ, M.; KODYM, O.; KOHÚT, J.; BENEŠ, K.; BUCHAL, P.
Zařízení pro digitalizaci specificky poškozených dokumentů, functioning sample, 2020
Authors: HRADIŠ, M.

2019

Automatic document quality assessment software module, software, 2019
Authors: BAKO, M.; BUCHAL, P.; HRADIŠ, M.
Software module for automatic enhancement of digitized documents, software, 2019
Authors: HRADIŠ, M.; KODYM, O.