Project Details
Pokročilá extrakce a rozpoznávání obsahu tištěných a rukou psaných digitalizátů pro zvýšení jejich přístupnosti a využitelnosti
Project Period: 1. 3. 2018 – 31. 12. 2022
Project Type: grant
Code: DG18P02OVV055
Agency: Ministerstvo kultury ČR
Optical character recognition, handwriting recognition, natural language processing, quality enhancement, language model, convolutional neural networks recurrent neural networks
The project aims to create technology and tools which would improve accessibility of digitized historic documents. These tools, based on state of the art methods from computer vision, machine learning and language modeling, will enable existing digital archives and libraries to provide full-text search and content extraction for low quality historic printed and all hand written documents - which can not be automatically processed by the currently available tools. The project extends automation and capabilities of digitization pipeline by providing tools for automated quality assessment and control, quality improvement, automated text transcription of historic printed documents, semi-automated hand written text transcription, and automatic extraction of semantic information from semi-structured documents (e.g. library catalogs and birth records). The created tools and techniques will be validated by processing selected collections of digitized materials and by a pilot operation by cooperation with Moravian Library.
Bařina David, Ing., Ph.D. (DCGM)
Beneš Karel, Ing. (DCGM)
Hradiš Michal, Ing., Ph.D. (DCGM)
Juránek Roman, Ing., Ph.D. (DCGM)
Kodym Oldřich, Ing., Ph.D.
Zemčík Pavel, prof. Dr. Ing., dr. h. c. (DCGM)
2022
- DVOŘÁKOVÁ, M.; HRADIŠ, M.; ŽABIČKA, P.; KOHÚT, J.; KIŠŠ, M.; BENEŠ, K. Využití PERO OCR při přepisu rukopisů. Archivní časopis, 2022, roč. 72, č. 1,
s. 14-27. ISSN: 0004-0398. Detail - KIŠŠ, M.; KOHÚT, J.; BENEŠ, K.; HRADIŠ, M. Importance of Textlines in Historical Document Classification. In Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. Lecture Notes in Computer Science. La Rochelle: Springer Nature Switzerland AG, 2022.
p. 158-170. ISBN: 978-3-031-06554-5. Detail
2021
- KIŠŠ, M.; BENEŠ, K.; HRADIŠ, M. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021.
p. 463-477. ISBN: 978-3-030-86336-4. Detail - KODYM, O.; HRADIŠ, M. Page Layout Analysis System for Unconstrained Historic Documents. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021.
p. 492-506. ISBN: 978-3-030-86330-2. Detail - KODYM, O.; HRADIŠ, M. TG2: text-guided transformer GAN for restoring document readability and perceived quality. International Journal on Document Analysis and Recognition, 2021, vol. 2021, no. 1,
p. 1-14. ISSN: 1433-2825. Detail - KOHÚT, J.; HRADIŠ, M. TS-Net: OCR Trained to Switch Between Text Transcription Styles. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021.
p. 478-493. ISBN: 978-3-030-86336-4. ISSN: 0302-9743. Detail
2020
- KIŠŠ, M.; HRADIŠ, M.; KODYM, O. Brno Mobile OCR Dataset. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. Sydney: Institute of Electrical and Electronics Engineers, 2020.
p. 1352-1357. ISBN: 978-1-7281-3015-6. Detail