Result Details

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

KOSTELNÍK, M.; HRADIŠ, M.; BENEŠ, K. TextBite: A Historical Czech Document Dataset for Logical Page Segmentation. In Document Analysis and Recognition – ICDAR 2025 Workshops. Cham: Springer Nature Switzerland, 2025. p. 124-140. ISBN: 978-3-032-09367-7.
Type
conference paper
Language
English
Authors
Kostelník Martin, Ing., DCGM (FIT)
Hradiš Michal, Ing., Ph.D., UAMT (FEEC), DCGM (FIT)
Beneš Karel, Ing., Ph.D., DCGM (FIT)
Abstract

Logical page segmentation is an important step in document
analysis, enabling better semantic representations, information retrieval,
and text understanding. Previous approaches define logical segmenta-
tion either through text or geometric objects, relying on OCR or precise
geometry. To avoid the need for OCR, we define the task purely as seg-
mentation in the image domain. Furthermore, to ensure the evaluation
remains unaffected by geometrical variations that do not impact text
segmentation, we propose to use only foreground text pixels in the eval-
uation metric and disregard all background pixels. To support research
in logical document segmentation, we introduce TextBite, a dataset of
historical Czech documents spanning the 18th to 20th centuries, fea-
turing diverse layouts from newspapers, dictionaries, and handwritten
records. The dataset comprises 8,449 page images with 78,863 annotated
segments of logically and thematically coherent text. We propose a set
of baseline methods combining text region detection and relation predic-
tion. The dataset, baselines and evaluation framework can be accessed
at https://github.com/DCGM/textbite-dataset.

Keywords

Dataset;Czech Historical Documents;Page Segmentation;Document Intelligence;Document Layout Analysis

URL
Published
2025
Pages
124–140
Proceedings
Document Analysis and Recognition – ICDAR 2025 Workshops
Conference
International Conference on Document Analysis and Recognition
ISBN
978-3-032-09367-7
Publisher
Springer Nature Switzerland
Place
Cham
DOI
EID Scopus
BibTeX
@inproceedings{BUT197678,
  author="Martin {Kostelník} and Michal {Hradiš} and Karel {Beneš}",
  title="TextBite: A Historical Czech Document Dataset for Logical Page Segmentation",
  booktitle="Document Analysis and Recognition – ICDAR 2025 Workshops",
  year="2025",
  pages="124--140",
  publisher="Springer Nature Switzerland",
  address="Cham",
  doi="10.1007/978-3-032-09368-4\{_}8",
  isbn="978-3-032-09367-7",
  url="https://link.springer.com/chapter/10.1007/978-3-032-09368-4_8"
}
Projects
semANT - Semantic Document Exploration, MK, NAKI III – program na podporu aplikovaného výzkumu v oblasti národní a kulturní identity na léta 2023 až 2030, DH23P03OVV060, start: 2023-03-01, end: 2027-12-31, running
Departments
Back to top