Publication Details

Visual HTML Document Modeling for Information Extraction

BURGET, R. Visual HTML Document Modeling for Information Extraction. RAWS 2005. Ostrava: Faculty of Electrical Engineering and Computer Science, VSB-TU Ostrava, 2005. p. 17-24. ISBN: 80-248-0864-1.

Czech title

Visuální modelování HTML dokumentů pro extrakci informace

Type

conference paper

Language

English

Authors

Burget Radek, doc. Ing., Ph.D. (DIFS)

Keywords

HTML, Information Extraction, Document Modeling, Logical Document Structure, Visual Information

Abstract

Current methods for the information extraction from HTML documentsare mostly based on wrappers that read the HTML code and identify thedata to be extracted by some properties of the surrounding HTML tagsand the text. The bottleneck of this approach is too tight binding ofthe wrapper to the HTML code. The nature of HTML allows to achieve thedesired document design by various ways that can be arbitrarilycombined, which makes the wrappers limited to a narrow set of documentsand a short time period. By contrast, there exist some generallyaccepted rules for the visual data presentation in the documents. Ourapproach is based on using the visual information for identifying thedata in the documents. We define formal models of the visualinformation and we propose the method for information extraction basedon the unordered tree matching algorithms.

Published

2005

Pages

17–24

Proceedings

RAWS 2005

ISBN

80-248-0864-1

Publisher

Faculty of Electrical Engineering and Computer Science, VSB-TU Ostrava

Place

Ostrava

BibTeX

@inproceedings{BUT18057,
  author="Radek {Burget}",
  title="Visual HTML Document Modeling for Information Extraction",
  booktitle="RAWS 2005",
  year="2005",
  pages="17--24",
  publisher="Faculty of Electrical Engineering and Computer Science, VSB-TU Ostrava",
  address="Ostrava",
  isbn="80-248-0864-1"
}