An introduction to the OOPS program and assignment project, including discussion of XML.
Introduce the project software
Give a quick introduction to XML
This project was largely the idea of Tom Worthington.
Academic journals now operate electronically, and articles are published both on paper and on the web.
Some journals ask authors for "camera-ready" copy, perhaps in the form of a PDF file, and simply publish that.
Better quality journals want to do more. They want to reformat articles so they can print them all in the journal in their house style, change the page size, add headers and footers, page numbers. They also want to publish them on the web with their own custom navigation, look-and-feel.
This means doing non-trivial processing.
Many journals, like those of the ACM (Association for Computing Machinery), provide authors with a Microsoft Word template to use for preparing their article. Authors then send in the completed Word file. The journal has to process it.
The software we will be working on this semester is a prototype for a system that will solve part of this problem.
A preflight system is one where the author can submit their file and then see what the finished article will look like in the journal. If it looks wrong (for example the software mixes up the title and the author's name) then the author can edit the file and resubmit.
Star Office is Sun's competitor for Microsoft's office software.
The Open Office project is an open-source office software project that started from the same code base but is now separate. The two separated a few years ago. Although Sun want to charge for their product, Open Office will always be free.
These are large, clunky pieces of software: 120 Mb download!
They are difficult to use, overly complicated, don't always do what you think they should...
Open Office (and recent versions of Star Office) use an open file format based on XML.
The Open Office project includes input conversion filters that can read Microsoft word/office files and convert them into the Open Office XML-based format.
The projects for the last three years have been based around a program that can read an Open Office file, understand it, view it on screen and convert it to plain old ASCII text and HTML. That program was written in Eiffel. Alexei Khorev has been translating the core code into Java over the summer.
I have a prototype Eiffel program (of about 2500 lines of code) can read and parse Open Office files. The new Java program will be able to do the same.
These are stored as a zip archive containing several XML files together with subdirectories for images and other stuff. (Demonstration?)
The program can reformat the XML making its tree structure easier to understand. (Demonstration?)
It can also reformat it as plain text. (Demonstration?)
It can also reformat as clean, simple HTML. (Demonstration?)
This is a good start.
Some or all of these might be parts of assignments in COMP2100 this semester.
Rewrite the tricky scanner stage using the Decorator Pattern, a standard part of the Java idiom.
Expand the program's understanding of Open Office XML so that it understands the special features of files created using the ACM article template. In particular the program needs to be able to extract metadata like the title, author's name etc.
Recognise and re-order material like the copyright notice so that output stages can place it in the correct position.
Work on a graphical user interface for the program.
Add calls to the Open Office input conversion filters, allowing the program to read Microsoft files directly. (At the moment we have to open them in Open Office and save Word documents in Open Office format, but OpenOffice has a Java API that lets Java programs use a running instance of OpenOffice as a server, so our program can potentially work directly from Word documents.)
Add more output formats: PDF? LaTeX? for high-quality printed output.
These might be part of later work, by me or others. Anyone interested in doing an individual project (COMP3700 or honours)?
Add an editing capability, turning the program into a basic word processor.
Add an understanding of DTDs (or XML schemas) so that users can define document templates.
Add an understanding of XSL style sheets so that users can create style sheets for conversion and formatting of their documents. Perhaps a graphical way to specify formatting that then writes a style sheet?
Add more serious word processing capabilities, while trying to stay lean and mean, so that it becomes a genuinely useful tool.
Turn the file conversion part into a web service. (Nearly done - we're working on it.)
Turn the preflight system into a web service, so that authors can submit and check their article throught the ACM web site. (We have made some progress on this.)
This section is based on Ramesh Sankaranarayana's introductory XML lecture from COMP3410 Information Technology in Electronic Commerce in 2000.
XML = eXtensible Markup Language.
Unlike HTML you can create your own tags.
Syntax rules are stricter.
Separate presentation from content.
Structured data.
1969 IBM developed Generalised Markup Language (GML), used for most IBM documents.
1974 Standard Generalised Markup Language (SGML), now used for a great deal of serious publishing work. It is complex but extremely powerful.
1990's Hyper Text Markup Language (HTML) used for web pages. Unlike the other markup languages, the set of tags is fixed, and describes the appearance more than the structure or meaning of the content.
1998 XML 1.0 attempts to combine the flexibility and power of SGML with the simplicity of XML.
Usually containers.
Start with a start tag e.g. “<name>”.
Must end with an end tag e.g. “</name>”.
Content may be data (text, characters) or other elements.
End tags may never be omitted (unlike in HTML).
Empty elements can be shortened to e.g. “<name/>”.
Start tags (and empty elements) may have attributes.
These are name-value pairs.
Syntax is “name="value"”. Values must be quoted (unlike in HTML).
For example “<text:paragraph style-name="Standard">”.
An element may have any number of attributes. Order isn't important.
Special characters can be specified using entities.
In particular, the less than (<), greater than (>) and ampersand (&) signs have special meaning in XML, so if you want them in the content, you have to use an entity:
| Entity | Character |
|---|---|
| < | < |
| > | > |
| & | & |
You can also specify a special character using an entity and its character code: e.g. the special character ‘é’ (the letter ‘e’ with an acute accent on it) has character code 233. You can get it with the entity “é”.
This only touched the surface of entities and special characters, but it's enough for now.
If a document obeys the sytax rules then it is well-formed.
To make serious use of structured information, we need to specify which elements can (or must) go inside which others, and what attributes the different elements may have (and the types of their values).
This is done with a Document Type Description (DTD).
A document that conforms to a particular DTD is said to be valid.
Part of the Open Office project is the development of an XML DTD for describing office documents.
The description of this format is a over 500 pages.
We will only use a tiny fraction of it.
An Open Office document is stored as several XML files, bundled into one binary file using the zip archiving and compression utility (same as PKZIP on Windows machines).
To open an Open Office file, we have to:
Unzip it and extract the XML file we want (content.xml).
Separate it into meaningful chunks called tokens, smoothing over irrelevant details like extra spaces, line endings and so on. This is called lexical analysis or just scanning.
Work through the XML keeping track of the nesting of elements and building up a tree that represents the structure of the document. This is called parsing.
Once we have the parse tree we can write code to traverse it in different ways, extracting, processing or modifying the information stored in it.
COMP3410 Information Technology in Electronic Commerce. See particularly the lectures on XML from 2000 and 2001.
A Technical Introduction to XML by Norman Walsh, one of the major figures in the development of XML.
The Official XML 1.0 Specification from the W3C. (A difficult document to read, but all the information is there - more than you will ever need.)
XML Basics, an article from Linux Mag that gives a good gentle introduction to XML.
Copyright © 2005,2007, 20099 Ian Barnes, Chris Johnson, The Australian
National University
$Revision: 1.1 $ $Date: 2009/03/19 02:15:01 $ $Author: cwj $
Feedback & Queries to
comp2100@cs.anu.edu.au