Result Details
TextBite: A Historical Czech Document Dataset for Logical Page Segmentation
Hradiš Michal, Ing., Ph.D., UAMT (FEEC), DCGM (FIT)
Beneš Karel, Ing., Ph.D., DCGM (FIT)
Logical page segmentation is an important step in document
analysis, enabling better semantic representations, information retrieval,
and text understanding. Previous approaches define logical segmenta-
tion either through text or geometric objects, relying on OCR or precise
geometry. To avoid the need for OCR, we define the task purely as seg-
mentation in the image domain. Furthermore, to ensure the evaluation
remains unaffected by geometrical variations that do not impact text
segmentation, we propose to use only foreground text pixels in the eval-
uation metric and disregard all background pixels. To support research
in logical document segmentation, we introduce TextBite, a dataset of
historical Czech documents spanning the 18th to 20th centuries, fea-
turing diverse layouts from newspapers, dictionaries, and handwritten
records. The dataset comprises 8,449 page images with 78,863 annotated
segments of logically and thematically coherent text. We propose a set
of baseline methods combining text region detection and relation predic-
tion. The dataset, baselines and evaluation framework can be accessed
at https://github.com/DCGM/textbite-dataset.
Dataset;Czech Historical Documents;Page Segmentation;Document Intelligence;Document Layout Analysis
@inproceedings{BUT197678,
author="Martin {Kostelník} and Michal {Hradiš} and Karel {Beneš}",
title="TextBite: A Historical Czech Document Dataset for Logical Page Segmentation",
booktitle="Document Analysis and Recognition – ICDAR 2025 Workshops",
year="2025",
pages="124--140",
publisher="Springer Nature Switzerland",
address="Cham",
doi="10.1007/978-3-032-09368-4\{_}8",
isbn="978-3-032-09367-7",
url="https://link.springer.com/chapter/10.1007/978-3-032-09368-4_8"
}