Last updated: Wednesday 24 January 2007, 3:58 PM
Supervisors: Dr Ian Barnes and Dr Peter Raftos
An important part of current research being done into the long- term preservation and accessibility of word processor (WP) documents is the conversion of (primarily) Word and OpenOffice files to some structured XML (usually DocBook-XML).
This work assumes that the WP documents have already been structured in a systematic manner using the WP software's inbuilt style functions. However, most users do not use styles; generally, they are not even aware they exist. And even if every user were to start using them from today, there remains a legacy of unstyled (and therefore unstructured) WP documents.
Most writers rely on the look of document content to deduce structure: reasonably arbitrary choices of font size and weighting. It would be very useful if algorithms could be developed that might analyse (or guess) at the structure of a WP document. This should then lead to the development of a working tool to perform the structuration of unstructured WP documents.
Using a series of unstructured trial documents of varying complexity develop:
Algorithm(s) that can deduce with a reasonable level of accuracy (80-90%), structure within the document. Basic structure includes body text and headings from level 1 to 6, as well as associated sections, multi-level lists, both numbered and bulleted, block quotes and definition lists. This may be expanded.
Using the algorithms, a Java/XML-based tool to convert set WP documents from Word or OpenOffice formats to structured (that is, styled) OpenOffice documents, suitable for submission to Dr Ian Barnes' Scholars Workbench (SWB) application.
There may be a scholarship attached to this project.
Supervisors: Dr Ian Barnes and Dr Peter Raftos
This is a tentative draft, not yet approved and quite open at this stage.
The idea here is to do for lecture slides (Powerpoint and OpenOffice) something similar to what we've already done for text documents (word processor files). The project will involve:
looking at the internals of the various formats available and understanding and documenting their strengths and weaknesses from the point of view of preservation and interoperability;
picking out the essential features and those that are best discarded (e.g. maybe we want to do away with, or at least relax choice of specific fonts? although there are problems with this...);
choosing a preservation format;
building or adapting or (if you're really lucky) choosing software to perform the conversion from the various formats in common use to the chosen preservation format.
Obviously this is an enormous project and part of your task will be to choose a reasonable subset of it and document that choice. Clearly also the project can be tilted in the direction of research or in the direction of implementation, depending on the course you are taking and your preference.
If you're interested, you might want to look at some or all of the following relevant references:
Dave Raggett's recent conference paper describing HTML Slidy and suggesting some interesting future directions including Powerpoint conversion filters using OpenOffice.org and also synchronising with audio.
Peter Sefton's blog post and derived Slidy presentation describing experimental work integrating Slidy and ICE
Any questions or comments to Ian Barnes.