Siteseeing: Using machine learning to classify segmented web pages
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The aim of this thesis is to extend the SiteSeer system—previously developed by Dr. Cormier—by incorporating machine learning into its classification component. The SiteSeer system segments web pages into regions, which are then assigned a label from a set of labels called the ontology. It is this second component, the assignment of a label based on the appearance of a region of the web page, that we used machine learning for in this project. Multiple machine learning architectures were attempted for this, including a Convolutional Neural Network (CNN), and a Random Forest Ensemble—two architectures we focus on. The Random Forest showed the best performance, while the CNN's performance fell approximately in the middle. In the best case, we improved accuracy by 40% over weighted random predictions (weighted according to the imbalance in class membership), from 35% to 75%. We also roughly tripled performance compared to the earlier system, going from 20% to 59% accuracy. These results constitute a significant advancement in the performance of the SiteSeer system, and demonstrate that machine learning is an effective technique for classifying the present dataset, despite numerous challenges. The findings could be extended by combining the machine learning with the Hidden Markov Tree from earlier iterations of the system, by collecting more data, and by improving the initial segmentation algorithm that generated the dataset.
