Publication Details

Two-Phase Categorization of Web Documents

BARTÍK, V.; BURGET, R. Two-Phase Categorization of Web Documents. Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. Valencia: Institute for Systems and Technologies of Information, Control and Communication, 2010. p. 458-462. ISBN: 978-989-8425-28-7.
Czech title
Dvoufázová kategorizace webových dokumentů
Type
conference paper
Language
English
Authors
Keywords

Web page categorization, visual block classification, term weighting, TF-IDF,
page segmentation

Abstract

The number of pages on the World Wide Web is permanently growing and there is
a need to process pages efficiently and obtain some useful knowledge from them.
Web page categorization is a very important issue in this area. The method
proposed here takes both visual and textual information into consideration. It
consists of two phases. In the first phase, web page areas obtained by
segmentation are classified based on their visual properties, and in the second
phase, pages are classified, based on information from the first phase and
textual information. Several experiments with web pages taken from news web sites
are presented in the final part of the paper.

Published
2010
Pages
458–462
Proceedings
Proceedings of the International Conference on Knowledge Discovery and Information Retrieval
Conference
International Conference on Knowledge Discovery and Information Retrieval, Valencia, ES
ISBN
978-989-8425-28-7
Publisher
Institute for Systems and Technologies of Information, Control and Communication
Place
Valencia
BibTeX
@inproceedings{BUT34415,
  author="Vladimír {Bartík} and Radek {Burget}",
  title="Two-Phase Categorization of Web Documents",
  booktitle="Proceedings of the International Conference on Knowledge Discovery and Information Retrieval",
  year="2010",
  pages="458--462",
  publisher="Institute for Systems and Technologies of Information, Control and Communication",
  address="Valencia",
  isbn="978-989-8425-28-7"
}
Back to top