Publication Details

Extracting Visually Presented Element Relationships from Web Documents

BURGET, R.; SMRŽ, P. Extracting Visually Presented Element Relationships from Web Documents. International Journal of Cognitive Informatics and Natural Intelligence, 2013, vol. 2013, no. 2, p. 13-29. ISSN: 1557-3958.
Czech title
Extrakce vizuálně prezentovaných vztahů z webových dokumentů
Type
journal article
Language
English
Authors
Keywords

logical document structure; page segmentation; document analysis; web documents

Abstract

Many documents in the World Wide Web present structured information that consists
of multiple pieces of data with certain relationships among them. Although it is
usually not difficult to identify the individual data values in the document
text, their relationships are often not explicitly described in the document
content. They are expressed by visual presentation of the document content that
is expected to be interpreted by a human reader. In this paper, we propose
a formal generic model of logical relationships in a document based on an
interpretation of visual presentation patterns in the documents. The model
describes the visually expressed relationships between individual parts of the
contents independently of the document format and the particular way of
presentation. Therefore, it can be used as an appropriate document model in many
information retrieval or extraction applica- tions. We formally define the model,
we introduce a method of extracting the relationships between the content parts
based on the visual presentation analysis and we discuss the expected
applications. We also present a new dataset consisting of programmes of
conferences and other scientific events and we discuss its suitability for the
task in hand. Finally, we use the dataset to evaluate results of the implemented
system.

Published
2013
Pages
13–29
Journal
International Journal of Cognitive Informatics and Natural Intelligence, vol. 2013, no. 2, ISSN 1557-3958
DOI
EID Scopus
BibTeX
@article{BUT105971,
  author="Radek {Burget} and Pavel {Smrž}",
  title="Extracting Visually Presented Element Relationships from Web Documents",
  journal="International Journal of Cognitive Informatics and Natural Intelligence",
  year="2013",
  volume="2013",
  number="2",
  pages="13--29",
  doi="10.4018/ijcini.2013040102",
  issn="1557-3958",
  url="https://www.fit.vut.cz/research/publication/10468/"
}
Files
Back to top