Publication Details

Layout Based Information Extraction from HTML Documents

BURGET, R. Layout Based Information Extraction from HTML Documents. 9th International Conference on Document Analysis and Recognition ICDAR 2007. Curitiba: IEEE Computer Society, 2007. p. 624-629. ISBN: 0-7695-2822-8.
Czech title
Extrakce informace z HTML dokumetnů založená na rozložení stránky
Type
conference paper
Language
English
Authors
Keywords

page segmentation, layout analysis, information extraction

Abstract

We propose a method of information extraction from HTML documents based on
modelling the visual information in the document. A page segmentation algorithm
is used for detecting the document layout and subsequently, the extraction
process is based on the analysis of mutual positions of the detected blocks and
their visual features. This approach is more robust that the traditional
DOM-based methods and it opens new possibilities for the extraction task
specification.

Published
2007
Pages
624–629
Proceedings
9th International Conference on Document Analysis and Recognition ICDAR 2007
Conference
9th International Conference on Document Analysis and Recognition, Curitiba, BR
ISBN
0-7695-2822-8
Publisher
IEEE Computer Society
Place
Curitiba
BibTeX
@inproceedings{BUT28821,
  author="Radek {Burget}",
  title="Layout Based Information Extraction from HTML Documents",
  booktitle="9th International Conference on Document Analysis and Recognition ICDAR 2007",
  year="2007",
  pages="624--629",
  publisher="IEEE Computer Society",
  address="Curitiba",
  isbn="0-7695-2822-8"
}
Back to top