Siteseeing: Using machine learning to classify segmented web pages

dc.contributor.authorDewan, Shekhar
dc.date.accessioned2024-12-16T14:32:44Z
dc.date.available2024-12-16T14:32:44Z
dc.date.issued2020
dc.description.abstractThe aim of this thesis is to extend the SiteSeer system—previously developed by Dr. Cormier—by incorporating machine learning into its classification component. The SiteSeer system segments web pages into regions, which are then assigned a label from a set of labels called the ontology. It is this second component, the assignment of a label based on the appearance of a region of the web page, that we used machine learning for in this project. Multiple machine learning architectures were attempted for this, including a Convolutional Neural Network (CNN), and a Random Forest Ensemble—two architectures we focus on. The Random Forest showed the best performance, while the CNN's performance fell approximately in the middle. In the best case, we improved accuracy by 40% over weighted random predictions (weighted according to the imbalance in class membership), from 35% to 75%. We also roughly tripled performance compared to the earlier system, going from 20% to 59% accuracy. These results constitute a significant advancement in the performance of the SiteSeer system, and demonstrate that machine learning is an effective technique for classifying the present dataset, despite numerous challenges. The findings could be extended by combining the machine learning with the Hidden Markov Tree from earlier iterations of the system, by collecting more data, and by improving the initial segmentation algorithm that generated the dataset.
dc.format.extent58 p.
dc.identifier.othermta:29189
dc.identifier.urihttps://hdl.handle.net/20.500.14662/453
dc.languageeng
dc.language.isoiso639-2b
dc.publisherMount Allison University
dc.rightsauthor
dc.subject.disciplineMathematics and Computer Science
dc.titleSiteseeing: Using machine learning to classify segmented web pages
dc.typeText
dc.typeDissertation/Thesis
thesis.degree.disciplineComputer Science
thesis.degree.grantorMount Allison University
thesis.degree.levelUndergraduate
thesis.degree.nameBachelor of Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
mta_29189.pdf
Size:
2.43 MB
Format:
Adobe Portable Document Format