Dissertation Topic

Information Extraction from the WWW

Academic Year: 2024/2025

Supervisor: Burget Radek, doc. Ing., Ph.D.

Department: Department of Information Systems

Programs:
Information Technology (DIT) - full-time study
Information Technology (DIT) - combined study
Information Technology (DIT-EN) - full-time study
Information Technology (DIT-EN) - combined study

The topic of identifying and extracting specific information from documents on the Web has been the subject of intensive research for quite a long time. The basic obstacles that make this problem difficult are the loose structure of HTML documents and absence of meta-information (annotations) useful for recognizing the content semantics. This missing information is therefore compensated by the analysis of various aspects of web documents that include especially the following:

Document HTML code (DOM)
Document Text (Keyword Search, Statistical Text Analysis, Natural Language Processing Methods)
Visual organization (page content layout, visual properties)

A background knowledge about the target domain and the commonly used presentation patterns is also necessary for successful information extraction. This knowledge allows a more precise recognition of the individual information fields in the document body.

Current approaches to information extraction from web documents focus mainly modeling and analyzing the documents themselves; modeling the target information for more precise recognition has not yet been examined in detail in this context. The assumed goals of the dissertation are therefore the following:

Analysis of existing domain models such as UML class diagrams, E-R diagrams or ontology.
Extending these models with the specification of recognizing particular data in documents (e.g. regular expressions, advanced text classification).
Design of information extraction methods based on a comparison of the structure of the information presented in the document and the expected structure of the target information.

Experimental implementation of the proposed methods using existing tools and experimental evaluation on real-world documents available on the WWW is also an integral part of the solution.