Student research opportunities
Automated Hierarchical Classification of Web Pages or Documents
Project Code: CECS_48
This project is available at the following levels:
Honours, Summer Scholar
Please note that this project is only for undergraduate students.
Supervisor:
Dr Wray BuntineOutline:
Classifying web pages according to the classification hierarchy in the Wikipedia is a useful task for many kinds of document analysis. The challenge is in processing the HTML and text content, in applying supervised learning (we have many such tools available), in scaling to large hierarchies (100's to 1000's) and in scaling to an acceptable performance on a standard server. This is an open ended task that could use theoretical work (methods to address scaling and/or hierarchical issues) or applied work (exploring the use of existing techniques), or both.
Requirements/Prerequisites
- good programming skills
- experience in understanding and extending existing code
- performing (computer) experiments and analyzing results
- machine learning background
Background Literature
- see the challenge write-up at links below to get background
- see the Wikipedia Categorization link below
Links
Wikipedia CategorizationPASCAL Classification Challenge Report



