[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]
COMP2100/2500
Lecture 15: Project IISummary
More discussion of the project software, including the high-level architectural design.
Aims
Explain some of the high-level system architecture.
Describe the inner workings of some of the crucial components.
1. The process
The processing of an OpenOffice document follows these steps:
Use the unzip program to extract the file content.xml from the compressed zip archive.
Use the scanner to break content.xml into a stream of tokens. This process is called lexical analysis and is the first stage in any compiler. As part of this process, any irregularities in white space are normalised, so that any unbroken sequence of spaces, tabs and newlines is replaced with a single space character. What's left is divided up into logical chunks. The granularity may vary. For this scanner, the chunks are XML tags, and the ordinary data (plain text) between them.
The stream of tokens from the scanner goes through a process called parsing. The end product of this is a data structure called a parse tree. (See below.) This is a recursive tree data structure representing the structure of the document. The leaf nodes (mostly) contain data (text). The non-leaf nodes are containers that give information about the structural roles and formatting of different parts of the text: headings, paragraphs, quotations, title, author's name, references, footnotes, italicised phrases and so on.
The tree has various problems that need fixing. This is done using the visitor pattern to traverse and modify the tree. The version you will be given fixes:
empty paragraphs (by deleting them),
data nodes that only contain whitespace and that shouldn't really be there (again by deleting them),
paragraphs that contain only hyphens or underscores used for horizontal rules that should be done in a more portable way (again by deleting them),
footnotes in the middle of the document (by moving them to the end), and
obscure style names in the text (by building a lookup table to decode them)
You will add to this a new reorganisation of the parse tree that "deepens the structure" by creating extra container nodes representing sections and subsections.
Once the tree has been fixed, the program uses another visitor to extract metadata from the document, like the title, author's name and affiliation etc.
Finally the program converts the tree to three different output file formats. The first just writes the document out in XML again, but with the tree structure represented by indentation. At the moment this doesn't write a valid XML file. You will modify this so that it does.
The second and third output files are written by visitors. One writes out the document in formatted plain text. The other writes HTML. You will modify these classes to use an OUTPUT_FORMATTER like the one you have been testing in Assignment 1. You will also have to modify that class, since it doesn't yet provide all the necessary features.
2. Class diagram for the tree
This is a UML class diagram giving some indication of the relationships between the classes that make up the parse tree structure. Each rectangle represents a class. Links with a triangle on them are inheritance, those with a diamond are aggregation. So the link on the left says something like: "An XML Container Element has a number of XML Elements (its children)." The top link says "An XML Element is either an XML Container Element or an XML Data Element".
3. Object diagram for the tree
This is an object diagram indicating the relationships between some objects making up part of the parse tree for a document. Each rectangle represents an object in a runtime structure. There are several objects of class XML_CONTAINER_ELEMENT and several of class XML_DATA_ELEMENT. Each data element has a string containing the actual data it represents.
These objects represent the subtree corresponding to the XML input:
<paragraph>Plain <bold>bold</bold> <italic>italic</italic> plain.</paragraph>
(This diagram is a bit of a simplification in a few ways. Firstly there should be separate array objects organising the children of the paragraph, bold and italic elements. Secondly the XML isn't exactly what you'd see in the content.xml file. There are no elements of type Bold or Italic, instead those would both be Spans, with the special formatting being determined by attributes. Instead of <italic>, you would really see <text:span text:style-name="T1">italic</text:span>. Finally, the Paragraph, and Span elements are really just objects of class XML_CONTAINER_ELEMENT, each with an appropriate strategy object of type PARAGRAPH_STRATEGY, BOLD_STRATEGY and ITALIC_STRATEGY respectively.)
[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]
Copyright © 2004, Ian Barnes, The Australian National University
Version 2004.2, Wednesday, 31 March 2004, 09:47:31 +1000
Feedback & Queries to
comp2100@cs.anu.edu.au