Detail výsledku

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

KOSTELNÍK, M.; HRADIŠ, M.; BENEŠ, K. TextBite: A Historical Czech Document Dataset for Logical Page Segmentation. In Document Analysis and Recognition – ICDAR 2025 Workshops. Cham: Springer Nature Switzerland, 2025. p. 124-140. ISBN: 978-3-032-09367-7.
Typ
článek ve sborníku konference
Jazyk
angličtina
Autoři
Kostelník Martin, Ing., UPGM (FIT)
Hradiš Michal, Ing., Ph.D., UAMT (FEKT), UPGM (FIT)
Beneš Karel, Ing., Ph.D., UPGM (FIT)
Abstrakt

Logical page segmentation is an important step in document
analysis, enabling better semantic representations, information retrieval,
and text understanding. Previous approaches define logical segmenta-
tion either through text or geometric objects, relying on OCR or precise
geometry. To avoid the need for OCR, we define the task purely as seg-
mentation in the image domain. Furthermore, to ensure the evaluation
remains unaffected by geometrical variations that do not impact text
segmentation, we propose to use only foreground text pixels in the eval-
uation metric and disregard all background pixels. To support research
in logical document segmentation, we introduce TextBite, a dataset of
historical Czech documents spanning the 18th to 20th centuries, fea-
turing diverse layouts from newspapers, dictionaries, and handwritten
records. The dataset comprises 8,449 page images with 78,863 annotated
segments of logically and thematically coherent text. We propose a set
of baseline methods combining text region detection and relation predic-
tion. The dataset, baselines and evaluation framework can be accessed
at https://github.com/DCGM/textbite-dataset.

Klíčová slova

Dataset;Czech Historical Documents;Page Segmentation;Document Intelligence;Document Layout Analysis

URL
Rok
2025
Strany
124–140
Sborník
Document Analysis and Recognition – ICDAR 2025 Workshops
Konference
International Conference on Document Analysis and Recognition
ISBN
978-3-032-09367-7
Vydavatel
Springer Nature Switzerland
Místo
Cham
DOI
EID Scopus
BibTeX
@inproceedings{BUT197678,
  author="Martin {Kostelník} and Michal {Hradiš} and Karel {Beneš}",
  title="TextBite: A Historical Czech Document Dataset for Logical Page Segmentation",
  booktitle="Document Analysis and Recognition – ICDAR 2025 Workshops",
  year="2025",
  pages="124--140",
  publisher="Springer Nature Switzerland",
  address="Cham",
  doi="10.1007/978-3-032-09368-4\{_}8",
  isbn="978-3-032-09367-7",
  url="https://link.springer.com/chapter/10.1007/978-3-032-09368-4_8"
}
Projekty
semANT - Sémantický průzkumník textového kulturního dědictví, MK, NAKI III – program na podporu aplikovaného výzkumu v oblasti národní a kulturní identity na léta 2023 až 2030, DH23P03OVV060, zahájení: 2023-03-01, ukončení: 2027-12-31, řešení
Pracoviště
Nahoru