Pokročilá extrakce a rozpoznávání obsahu tištěných a rukou psaných digitalizátů pro zvýšení jejich přístupnosti a využitelnosti

Project Period: 1. 3. 2018 – 31. 12. 2022

Project Type: grant

Code: DG18P02OVV055

Agency: Ministerstvo kultury ČR

Program: Program na podporu aplikovaného výzkumu a experimentálního vývoje národní a kulturní identity na léta 2016 až 2022 (NAKI II)

English title
Advanced content extraction and recognition for printed and handwritten documents for better accessibility and usability

Optical character recognition, handwriting recognition, natural language
processing, quality enhancement, language model, convolutional neural networks
recurrent neural networks


The project aims to create technology and tools which would improve accessibility
of digitized historic documents. These tools, based on state of the art methods
from computer vision, machine learning and language modeling, will enable
existing digital archives and libraries to provide full-text search and content
extraction for low quality historic printed and all hand written documents -
which can not be automatically processed by the currently available tools. The
project extends automation and capabilities of digitization pipeline by providing
tools for automated quality assessment and control, quality improvement,
automated text transcription of historic printed documents, semi-automated hand
written text transcription, and automatic extraction of semantic information from
semi-structured documents (e.g. library catalogs and birth records). The created
tools and techniques will be validated by processing selected collections of
digitized materials and by a pilot operation by cooperation with Moravian

