Honours/Masters/Grad Dip/etc projects supervised by Ian Barnes

Last updated: Wednesday 24 January 2007, 3:58 PM


Word Processing Structure Analyser

Supervisors: Dr Ian Barnes and Dr Peter Raftos

Background

An important part of current research being done into the long- term preservation and accessibility of word processor (WP) documents is the conversion of (primarily) Word and OpenOffice files to some structured XML (usually DocBook-XML).

This work assumes that the WP documents have already been structured in a systematic manner using the WP software's inbuilt style functions. However, most users do not use styles; generally, they are not even aware they exist. And even if every user were to start using them from today, there remains a legacy of unstyled (and therefore unstructured) WP documents.

Most writers rely on the look of document content to deduce structure: reasonably arbitrary choices of font size and weighting. It would be very useful if algorithms could be developed that might analyse (or guess) at the structure of a WP document. This should then lead to the development of a working tool to perform the structuration of unstructured WP documents.

The Work

Using a series of unstructured trial documents of varying complexity develop:

  1. Algorithm(s) that can deduce with a reasonable level of accuracy (80-90%), structure within the document. Basic structure includes body text and headings from level 1 to 6, as well as associated sections, multi-level lists, both numbered and bulleted, block quotes and definition lists. This may be expanded.

  2. Using the algorithms, a Java/XML-based tool to convert set WP documents from Word or OpenOffice formats to structured (that is, styled) OpenOffice documents, suitable for submission to Dr Ian Barnes' Scholars Workbench (SWB) application.

There may be a scholarship attached to this project.


Lecture slide preservation

Supervisors: Dr Ian Barnes and Dr Peter Raftos

This is a tentative draft, not yet approved and quite open at this stage.

The idea here is to do for lecture slides (Powerpoint and OpenOffice) something similar to what we've already done for text documents (word processor files). The project will involve:

Obviously this is an enormous project and part of your task will be to choose a reasonable subset of it and document that choice. Clearly also the project can be tilted in the direction of research or in the direction of implementation, depending on the course you are taking and your preference.

If you're interested, you might want to look at some or all of the following relevant references:


Any questions or comments to Ian Barnes.