More discussion of the project software, including the high-level architectural design.
Explain some of the high-level system architecture.
Describe the inner workings of some of the crucial components.
The processing of an OpenOffice document follows these steps:
Use the Java library equivalent of the unzip program to extract the file content.xml from the compressed zip archive that is used for ODF – the Open Document Format.
Use the scanner to break content.xml into a stream of tokens. This process is called lexical analysis and is the first stage in any compiler. As part of this process, any irregularities in white space are normalised, so that any unbroken sequence of spaces, tabs and newlines is replaced with a single space character. What's left is divided up into logical chunks. The granularity may vary. For this scanner, the chunks are XML tags, and the ordinary data (plain text) between them. Only a particular set of XML tags is recognised, the others are ignored.
The stream of tokens from the scanner goes through a process called parsing. The end product of this is a data structure called a parse tree. (See below.) This is a recursive tree data structure representing the structure of the document. The leaf nodes (mostly) contain data (text). The non-leaf nodes are containers that give information about the structural roles and formatting of different parts of the text: headings, paragraphs, quotations, title, author's name, references, footnotes, italicised phrases and so on.
The tree has various problems that need fixing. This is done using the visitor pattern to traverse and modify the tree. The version you will be given fixes:
empty paragraphs (by deleting them),
data nodes that only contain whitespace and that shouldn't really be there (again by deleting them),
paragraphs that contain only hyphens or underscores used for horizontal rules that should be done in a more portable way (again by deleting them),
obscure style names in the text (by building a lookup table to decode them)
Once the tree has been fixed, the program uses another visitor to extract metadata from the document, like the title, author's name and affiliation etc. It writes these out as plain text.
Finally the program converts the tree to three different output file formats.
The second writes out the document in formatted plain text, with indentations and line-filling.
This is a UML class diagram giving some indication of the relationships between the classes that make up the parse tree structure. Each rectangle represents a class. Links with a triangle on them are inheritance, those with a diamond are aggregation. So the link on the left says something like: "An XML Container Element has a number of XML Elements (its children)." The top link says "An XML Element is either an XML Container Element or an XML Data Element".

This is an object diagram indicating the relationships between some objects making up part of the parse tree for a document. Each rectangle represents an object in a runtime structure. There are several objects of class XML_CONTAINER_ELEMENT and several of class XML_DATA_ELEMENT. Each data element has a string containing the actual data it represents.
These objects represent the subtree corresponding to the XML input:
<paragraph>Plain <bold>bold</bold> <italic>italic</italic> plain.</paragraph>which corresponds (approximately) to the content
Plain bold italic plain.

(This diagram is a bit of a simplification in a few ways. Firstly there should be separate array objects organising the children of the paragraph, bold and italic elements. Secondly the XML isn't exactly what you'd see in the content.xml file. There are no elements of type Bold or Italic, instead those would both be Spans, with the special formatting being determined by attributes. Instead of <italic>, you would really see <text:span text:style-name="T1">italic</text:span>. Finally, the Paragraph, and Span elements are really just objects of class XML_CONTAINER_ELEMENT, each with an appropriate strategy object of type PARAGRAPH_STRATEGY, BOLD_STRATEGY and ITALIC_STRATEGY respectively.)
Copyright © 2006, 2007, 2009 Ian Barnes, Chris Johnson The Australian
National University
$Revision: 1.1 $ $Date: 2009/03/19 02:16:16 $ $Author: cwj $
Feedback & Queries to
comp2100@cs.anu.edu.au