Open Access Repository

Text noise filtering methods for web information management


Downloads per month over past year

Kim, YS 2004 , 'Text noise filtering methods for web information management', Coursework Master thesis, University of Tasmania.

PDF (Whole thesis)
whole_KimYangso...pdf | Download (2MB)
Available under University of Tasmania Standard License.

| Preview


As people use the Web information as their major knowledge resource, the development of computerized Web information management systems is becoming one of the major streams in the Internet area. There are three major
problems in this development: the first problem is about the ambiguity of target documents, the so called 'ontology problem'. Text mining and ontology research mainly focus on this aspect. The second problem is that it is not easy to find the location of information. This has been a well known problem from the early stages of Web technology. Many people focus on the push style information delivery technology to replace the current pull style - for example, RSS (Really Simple Syndication) and automated Web information monitoring systems. The third issue is about the complexity of the information on the Web page. This has been less considered in Web research, but people are now starting to recognize it as a more crucial conundrum in the real world application.
This research thesis focuses on this third problem. The goal of the research is to identify the core information from the heterogeneous Web page information. This core information contains materials which publishers want to impart to users. However, Web pages also contain 'noisy information' such as redundant information and functional information. Whereas core information helps knowledge management, 'noise information' may impede efficient
knowledge management. Noisy text filtering methods consist of three filtering modules: phrase length based filter, tag based filter, redundant words elimination filter, and redundant phrases elimination filter. Extensive comparative
experiments have been conducted with real world data sets which are collected from online news Web service sites (ABC, BBC, and CNN). Experiment results show this approach works efficiently and effectively.

Item Type: Thesis - Coursework Master
Authors/Creators:Kim, YS
Copyright Information:

Copyright 2004 the author - The University is continuing to endeavour to trace the copyright owner(s) and in the meantime this item has been reproduced here in good faith. We would be pleased to hear from the copyright owner(s).

Additional Information:

Thesis (MComp)--University of Tasmania, 2004. Includes bibliographical references

Item Statistics: View statistics for this item

Actions (login required)

Item Control Page Item Control Page