Publication Details

Visual HTML Document Modeling for Information Extraction

BURGET, R. Visual HTML Document Modeling for Information Extraction. RAWS 2005. Ostrava: Faculty of Electrical Engineering and Computer Science, VSB-TU Ostrava, 2005. p. 17-24. ISBN: 80-248-0864-1.
Czech title
Visuální modelování HTML dokumentů pro extrakci informace
Type
conference paper
Language
English
Authors
Keywords

HTML, Information Extraction, Document Modeling, Logical Document Structure,
Visual Information

Abstract

Current methods for the information extraction from HTML documents are mostly
based on wrappers that read the HTML code and identify the data to be extracted
by some properties of the surrounding HTML tags and the text. The bottleneck of
this approach is too tight binding of the wrapper to the HTML code. The nature of
HTML allows to achieve the desired document design by various ways that can be
arbitrarily combined, which makes the wrappers limited to a narrow set of
documents and a short time period. By contrast, there exist some generally
accepted rules for the visual data presentation in the documents. Our approach is
based on using the visual information for identifying the data in the documents.
We define formal models of the visual information and we propose the method for
information extraction based on the unordered tree matching algorithms.

Published
2005
Pages
17–24
Proceedings
RAWS 2005
ISBN
80-248-0864-1
Publisher
Faculty of Electrical Engineering and Computer Science, VSB-TU Ostrava
Place
Ostrava
BibTeX
@inproceedings{BUT18057,
  author="Radek {Burget}",
  title="Visual HTML Document Modeling for Information Extraction",
  booktitle="RAWS 2005",
  year="2005",
  pages="17--24",
  publisher="Faculty of Electrical Engineering and Computer Science, VSB-TU Ostrava",
  address="Ostrava",
  isbn="80-248-0864-1"
}
Back to top