Project Details
Pokročilá extrakce a rozpoznávání obsahu tištěných a rukou psaných digitalizátů pro zvýšení jejich přístupnosti a využitelnosti
Project Period: 1. 3. 2018 – 31. 12. 2022
Project Type: grant
Code: DG18P02OVV055
Agency: Ministerstvo kultury ČR
Optical character recognition, handwriting recognition, natural language
processing, quality enhancement, language model, convolutional neural networks
recurrent neural networks
The project aims to create technology and tools which would improve accessibility
of digitized historic documents. These tools, based on state of the art methods
from computer vision, machine learning and language modeling, will enable
existing digital archives and libraries to provide full-text search and content
extraction for low quality historic printed and all hand written documents -
which can not be automatically processed by the currently available tools. The
project extends automation and capabilities of digitization pipeline by providing
tools for automated quality assessment and control, quality improvement,
automated text transcription of historic printed documents, semi-automated hand
written text transcription, and automatic extraction of semantic information from
semi-structured documents (e.g. library catalogs and birth records). The created
tools and techniques will be validated by processing selected collections of
digitized materials and by a pilot operation by cooperation with Moravian
Library.
Bařina David, Ing., Ph.D. (DCGM)
Beneš Karel, Ing. (DCGM)
Hradiš Michal, Ing., Ph.D. (DCGM)
Juránek Roman, Ing., Ph.D. (DCGM)
Kodym Oldřich, Ing., Ph.D.
Zemčík Pavel, prof. Dr. Ing., dr. h. c. (DCGM)
2022
- DVOŘÁKOVÁ, M.; HRADIŠ, M.; ŽABIČKA, P.; KOHÚT, J.; KIŠŠ, M.; BENEŠ, K. Využití PERO OCR při přepisu rukopisů. Archivní časopis, 2022, roč. 72, č. 1,
s. 14-27. ISSN: 0004-0398. Detail - KIŠŠ, M.; KOHÚT, J.; BENEŠ, K.; HRADIŠ, M. Importance of Textlines in Historical Document Classification. In Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. Lecture Notes in Computer Science. La Rochelle: Springer Nature Switzerland AG, 2022.
p. 158-170. ISBN: 978-3-031-06554-5. Detail
2021
- KIŠŠ, M.; BENEŠ, K.; HRADIŠ, M. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021.
p. 463-477. ISBN: 978-3-030-86336-4. Detail - KODYM, O.; HRADIŠ, M. Page Layout Analysis System for Unconstrained Historic Documents. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021.
p. 492-506. ISBN: 978-3-030-86330-2. Detail - KODYM, O.; HRADIŠ, M. TG2: text-guided transformer GAN for restoring document readability and perceived quality. International Journal on Document Analysis and Recognition, 2021, vol. 2021, no. 1,
p. 1-14. ISSN: 1433-2825. Detail - KOHÚT, J.; HRADIŠ, M. TS-Net: OCR Trained to Switch Between Text Transcription Styles. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021.
p. 478-493. ISBN: 978-3-030-86336-4. ISSN: 0302-9743. Detail
2020
- KIŠŠ, M.; HRADIŠ, M.; KODYM, O. Brno Mobile OCR Dataset. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. Sydney: Institute of Electrical and Electronics Engineers, 2020.
p. 1352-1357. ISBN: 978-1-7281-3015-6. Detail