[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]
COMP2100/2500
Assignment 1: The Parse TreeDue at 12 midday on Friday 8th April 2005
This assignment is to be done in pairs. You must find a partner to work with. If you can't find a partner, email comp2100@cs.anu.edu.au and I will find you one.
Introduction
In this assignment, you and your partner will work on the new Java port of Oops, the Open Office Publication System. At the moment this program reads an Open Office word processor file (converted from a Microsoft Word file that was prepared using the ACM journal template), scans and parses its XML content, builds a parse tree representing the logical structure of the document, and then writes:
a .txt file containing a formatted plain text version of the document,
a .xml file that shows the XML source with indentation to indicate the tree structure of the document.
Your assignment is to extend the program toward the eventual goal of making it a useful publication system.
You are to carry out the following tasks:
Add a new visitor class called TreeFixer that traverses the tree removing all paragraphs that consist only of hyphens or underscore characters. (The ACM template uses these to imitate horizontal rules. We don't want them in our tree or our output.)
Improve the way the TextRenderer handles ordered (numbered) lists by making it understand and implement the ‘text:continue-numbering’ attribute. An ordered list with text:continue-numbering="true" should not have its items numbered starting from 1, but from the next number after the last item of the previous ordered list.
Add a new visitor called StyleDecoder that goes over the tree looking at style information and extracts some of this information (about parent styles) into a lookup table.
Add a new visitor called MetadataExtractor that goes over the tree and uses the style lookup table to identify, extract and print document metadata such as might be used to compile a table of contents for a journal.
Add an HTMLRenderer that operates in a similar fashion to the TextRenderer but instead writes a simple, well-formatted HTML version of the document.
(Short but tricky) Fix the mistakes the TextRenderer makes with spaces around ‘words’. It always adds them; this is incorrect. Work out how to do it right.
(Hard) The scanner has some problems. It makes some mistakes with white space, removing some data tokens that it shouldn't. It is also poorly designed and inefficient. Rewrite the scanner using the Decorator pattern, and make it deal with white space correctly. This will probably mean leaving the question of removing white-space-only data elements to a later stage, in which case you should add code to TreeFixer to accomplish this task.
1. Write a tree fixer
If you run the program and look at the .xml file you will see that there are some paragraph elements in the tree that contain a single data element whose content consists only of hyphens (or only of underscore characters). These are a nuisance. They are meant to represent a horizontal line on the page, but they're not the right way to do it. They don't stand up to a change in paper size, they don't work in all fonts... We're going to get rid of them.
Write a new implementation of class Visitor, called TreeFixer, that traverses the tree hunting down and removing these rogue paragraphs. Tree fixers should produce no output (although you may want to have it write some diagnostic output during debugging) and should make no other changes to the tree. Since the children of a container node are stored in a Vector, you can use one of class Vector's remove... methods to remove a child node; all the remaining children will renumber themselves automatically, sliding down to fill the gap.
But watch out! It is easy to come up with solutions here that involve the computing equivalent of sawing off the branch you're sitting on. It's also easy to mess up the traversal when you change the structure you're traversing.
Once you have written your tree fixer, you'll need to “uncomment” the code in class Converter that creates a TreeFixer object and runs it over the tree. Test the output carefully to be sure that your code works as expected. (You might want to consider printing out the tree before and after the tree fixer traverses it, and comparing the resulting files, just to be sure that you're not deleting any other paragraphs.)
2. Implement list continuation
Our documents can contain two types of lists: unordered lists where the items are marked with ‘bullets’ and ordered lists where the items are numbered. The text renderer already knows how to format these correctly.
Sometimes an author will insert some other stuff between the items of a list. In sample2.sxw it is a few paragraphs of stuff that's supposed to be at the bottom of the first page: copyright notices and disclaimers and acknowledgments and so on. In sample9.sxw, it is just another paragraph, an interruption to the sequence of numbered list items. In this case, OpenOffice terminates the list, adds the new material, and then starts a new list for the remaining items. The second list has an attribute text:continue-numbering="true" to indicate that the item numbers should not start again at 1, but should continue from where they left off. The code you have been given ignores this attribute, and gets the item numbers wrong on the continuation of the list. It is your task to fix this.
This isn't hard. Your text renderer will have to remember what list item number it is up to, and reset that to zero (or one) at the start of a new list but not if the list is a continuation of an earlier list.
You do not have to deal with the possibility of nested lists. (This would make this question much harder, especially since OpenOffice doesn't make very nice XML for nested lists. Try it and see, if you like.)
3. Write a style decoder
If you look at one of the XML output files, for example sample1.xml you will see that a lot of it is concerned with the different styles used in the document. There's even more of this in the styles.xml file inside the zip archive. (Just unzip the file sample1.sxw and take a look.)
In the styles.xml file it defines a whole lot of styles with names like “Title”, “Author's Name”, “Affiliation”, “Abstract”, “Categories”, “Primary Head”, “Secondary Head”, “Text Body” and so on. These are all paragraph styles. You might see something in the document like:
<text:p text:style-name="Title"> "Title" </text:p>This paragraph is the title of the document. In the next question you will look at this node and extract its content. The problem comes when you get to the next paragraph, which looks like this:
<text:p text:style-name="P1"> "AUTHOR" </text:p>By looking at the content you can guess that this must be the author's name, but how can our software work that out just by looking at the XML tags? The style name “P1” tells us nothing. The answer is to go back to the earlier styles part of the document, where you will see an element <office:automatic-styles>. Each child of this is a style definition, based on a parent style but with some (perhaps minor or even non-existent) modifications. We don't care about the formatting stuff. What we do care about is the name of the parent style. Here's the one we're looking for:
<style:style style:parent-style-name="Author's Name" style:family="paragraph" style:name="P1">This tells us that style “P1” is based on a style called “Author's Name”, and so this paragraph contains an author's name. We need to be able to grab its content in the next question.
Your task here is to write a new visitor (a class that implements the Visitor interface) called StyleDecoder that traverses the tree looking for these automatic style definitions and recording any it finds that have both a style:name attribute and a style:parent-style-name attribute. Note that you should only record these relationships for styles defined inside an <office:automatic-styles> element. There may be others, but your program must ignore them.
Your class must store these relationships in a public field lookupTable of type Hashtable. This is a standard library class; you will need to look it up in the API documentation to figure out how to use it. You can also get an idea of how to create and use a lookup table by examining the code for handling attributes in class comp2100.oops.tree.AttributeList.
Your style decoder must also have a toString() method that returns a String showing the relationships it has found. The string should look like this:
P1 -> Author's Name P2 -> Affiliation P3 -> Abstract P4 -> Affiliationand so on.
Modify class Converter to run your style decoder on the tree after the tree fixer has finished. (Just “uncomment” the relevant lines. Actually, you'll have to change one of those commented-out lines. See Question 2 in the Assignment 1 FAQ.)
4. Write a metadata extractor
Write another visitor called MetadataExtractor that traverses the tree and locates the document title, authors' names and affiliations. It will need to use the lookup table prepared in Question 3 (so you will need to have completed Question 3 before you can do this) in order to identify this special content correctly in all cases.
Look at the (commented out) lines in class Converter that create and run the metadata extractor. You will see that you need a constructor that takes a single argument of type Hashtable, which is assumed to contain pairs of styles and parent styles.
Your metadata extractor must find and store all of these as it traverses the document. It must also have a printMetadata() method that writes the collected metadata to System.out in the following format:
Title = "This is the Title of the Paper" Author's Name = "AUTHOR 1", "AUTHOR 2 AND AUTHOR 3" Affiliation = "Affiliation 1", "and", "Affiliation 2"A few points to note about the output format. The content must be enclosed in double quotes. Where there are several paragraphs with the same style (or with styles that inherit from that same style), they must all be listed, in the order that they appear in the document, separated by commas, as above.
Modify class Converter to run your metadata extractor over the tree after the style decoder and before generating any of the output files. (Again, this is just a matter of “uncommenting” the relevant lines.)
5. Write an HTML renderer
Add a new visitor called HTMLRenderer that produces an HTML version of the document tree.
You will find that HTML and OpenOffice document structures are quite similar. In this version you won't be putting any information from the top-level elements before office:body directly into the HTML output, but the body of the two documents will have the same structure.
Paragraph styles. Most of the content in the body of the OpenOffice document is in paragraph nodes (text:p). Your HTML renderer must use the style attributes on these to determine their formatting. It should format them according to this table:
Style Name Before After Title <H1 ALIGN="CENTER"> </H1> Abstract <BLOCKQUOTE><FONT SIZE="-1"> </FONT></BLOCKQUOTE> Author's Name <P ALIGN="CENTER"><B> </B></P> Affiliation <P ALIGN="CENTER"> </P> Categories <P><FONT SIZE="-1"> </FONT></P> Text Body <P> </P> Quoted Text <BLOCKQUOTE> </BLOCKQUOTE> Footnote <P><FONT SIZE="-1"> </FONT></P> Figure Caption <P ALIGN="CENTER"><FONT SIZE="-1"> </FONT></P> Table Head <P ALIGN="CENTER"> </P> References <P> </P> Primary Head <H2> </H2> Secondary Head <H3> </H3> Displayed Equation <BLOCKQUOTE><I> </I></BLOCKQUOTE> Initial Body Text <P> </P> To do this correctly, your HTML renderer will need the help of the style decoder from Question 3.
Think hard about how you code this. There are several options. You should choose a method that will lead to simple, clear Java code that is easy to maintain. I don't want to see huge switch or if-then-else if-then-else... constructions inside visitParagraph().
Metadata in the HTML document header. To help with online searching, the web version of an article needs metadata. This is inserted in the HEAD element using META tags. You should also use the document title in the TITLE element. You will need the document metadata extracted by the metadata extractor in Question 4. So for example, with sample9.sxw, the HEAD element should look like this:
<HEAD> <TITLE> Nested ordered lists?</TITLE> <META NAME="Title" CONTENT="Nested ordered lists?"> <META NAME="Author's Name" CONTENT="Ian Barnes"> <META NAME="Affiliation" CONTENT="The Australian National University"> </HEAD>You may want to add some accessor methods getTitle(), getAuthor() and getAffiliation() to class MetadataExtractor to help you with this part.
Italic and bold spans. The other thing your HTML renderer has to be able to do is handle text:span elements. If you look in sample8.xml, you will see a paragraph with two spans, one that is supposed to be in italics and the other in boldface. These are marked up as text:span with an attribute text:style-name. The style names are automatically generated in the same way as the ones you dealt with in Question 3. The one for the italic span looks like this:
<style:style style:family="text" style:name="T2"> <style:properties style:font-style-asian="italic" fo:font-style="italic" style:font-style-complex="italic"/> </style:style>The interesting information is in the child element style:properties in its fo:font-style attribute. When you traverse this part of the tree you need to make a lookup table with this information for all styles with style-family="text". For boldface it is the fo:font-weight attribute you're looking for.
When you later hit a text:span element, you need to use the information in this lookup table to decide what HTML formatting to apply. If the span has fo:font-style="italic" then its content should be inside <I>...</I>. If the span has fo:font-weight="bold" then it should be inside <B>...</B>. (If it has both, make sure that you close them in the correct order.)
How to get started on this question. You will probably find it easiest to start by making class HTMLRenderer a direct copy of TextRenderer and then making modifications to add tags and so on. This will do for this assignment. Obviously a better solution would be to abstract the common code into a third class and have both TextRenderer and HTMLRenderer inherit from that. Don't worry about doing that for this assignment. Think of it this way: you're creating a prototype. Later on in the development process, other engineers will refactor the code to improve the design.
Note: The specs above are somewhat incomplete. For more information about how to do this question, look at the Assignment 1 FAQ Page, particularly item 13.
6*. Fix bad spaces in text renderer
This is meant to be tricky. If you are aiming for a Pass or Credit on this assignment, you don't need to attempt it.
Download the new sample file sample10.sxw and run the processor on it. If you look carefully at the output file sample10.txt, you will notice some problems with the way the names are printed in the last paragraph. The first letters of the names are separated from the rest of the name by a space. There is also a space between the name and any following punctuation. This is because there is a problem with the way the text renderer outputs data elements. What it currently does is to split them into ‘words’ and then add each word in turn to the current line, putting a space before it if it isn't right at the start of the line. (Take a close look at the methods visitData() and addWord() in class TextRenderer.)
This is wrong. Sometimes the first word in a content element should not have a space before it, as in the sample file, where part of the word is in a span element, or where a span element is immediately followed by punctuation. Check the contents of sample10.xml to see what's really in the document.
Your task is to modify the behaviour of those two routines so that they operate correctly. A space or newline should only be put into the output if there is a space in the input.
This does not require a lot of code. In the end you will probably change very few lines. If you find yourself writing lots of code, considering lots of different cases, then stop and think some more.
Note that when you get this right, the first part of sample10 will suddenly be wrong and there will not be a space between “italic” and “bold”. Fixing that is part of the next question.
7**. Rewrite the scanner
This is longer and harder than the other parts of the assignment. You will need to write a few new classes, with the details of the design being up to you. Only attempt this if you have completed everything else and you are sure that what you have done is correct. Your marker will only look at this if the rest of your assignment is complete and has no serious problems. You do not need to attempt this if you only want a Pass, Credit or Distinction for the assignment.
The scanner is a mess. It also has some errors in it. Your task here is to replace the existing class Scanner with something better.
The current scanner reads the entire input file into a String, then performs various operations on that string, eventually producing an array of tokens that client classes can move through using the methods advance(), item() and endOfInput(). (It also allows clients to go back one token using retreat() and to go right back to the beginning of the token stream using reset(), but these methods are never used.)
Reimplement the scanner with a different and more streamlined design. Instead of one large class, there will be several small classes. Instead of storing the entire input file and the entire list of tokens, the process must be dynamic, with the scanner producing one token at a time, when requested, and reading through the input file only as needed.
Your scanner must have the same public interface as the old one (except for retreat() and reset() which you should omit). That is, it must have a public constructor that takes a single argument of class Reader, and it must have public methods public void advance(), public Token item() and public boolean endOfInput(). Using the new scanner should not require a single change to the rest of the program. (I want to be able to test your program with your scanner and with the original one.)
Your design should use the Decorator pattern. (Look it up.) That is, the various stages of cleaning up the input should each be peformed by a different object (of a different class). Between the scanner itself and the Reader it reads characters from, you put a CommentFilter object, a ProcessingInstructionFilter object, a DoctypeDeclarationFilter object and a WhitespaceFilter object. Each of these needs to have a readCharacter() method (although you can call it whatever you like). The idea is that one call to readCharacter() will get passed down through all the filters resulting in possibly several calls to read() on the actual input stream as it skips over comments, processing instructions, doctype declarations and irregular white space. A single call to advance() on the Scanner will result in it reading several characters until it has read a complete tag or a complete data token. It assembles this token and makes it available to clients as item().
You may not use class java.util.Scanner from the Java 1.5 API.
There is one error in the current scanner implementation that you should correct in your version. (There may well be other errors, but I don't know about them yet. If you find any, you may fix them too, but you should also write to me about them to confirm that they are really errors and not features or misunderstandings.) The scanner trims leading and trailing whitespace (space characters) from data tokens and throws away any data tokens that consist only of whitespace. Your scanner should not do this. When you have fixed this, you will find that the space between the words “italic” and “bold” in sample10.txt will reappear.
In earlier versions of the program, this would mean that a large number of extra data tokens, each consisting of a single space character, would appear in the token stream and would be inserted into the tree. Most of these would not belong there and would have to be removed by the tree fixer. They came from newline characters in the content.xml file created by OpenOffice. I think more recent versions of OpenOffice don't put newlines into content.xml, and so the problem disappears, but it's worth checking. If you find anything funny, let me know.
Getting started
Download the JAR file a1start.jar. This contains the source code for the starting version of the program. Uncompress and unpack the archive.
Compile the program by typing build. (The ‘build’ command is a one-line shell script that recompiles every Java source file. It just saves typing. We'll learn about shell scripts and build tools later this semester, but this will do for now.) There are eight sample files provided: sample1.sxw, ... sample8.sxw. To run the program, type for example ./oops sample1.sxw. (Again, the ‘oops’ command is a one-line shell script that saves you typing ‘java comp2100.oops.Converter’ every time.)
By the way, the first two sample files were downloaded from the ACM web site's Instructions for Authors. I haven't included the original Microsoft Word versions of these documents, but I can post them on the class web site if anyone wants to check anything. I created the third document in Open Office by modifying one of the others. If you look closely at sample3.xml or sample3.html you will see that I have messed it up. Somehow I managed to put the major section headings inside one-item numbered lists. This illustrates the point of a preflight system. I can put my document into it, see that the output is wrong, fix it up and try again. A fixed version of this document is in sample4.sxw.
Take plenty of time to read and understand the code before you try to modify it. Don't just dive in and start changing things until you understand how they work.
Each time you modify a class, you should also modify the author and version information at the top. We haven't covered version control yet in lectures, so remove the automatically generated stuff between the dollar signs, and update the author and version fields like this:
* @author Ian Barnes (original Eiffel version) * @author Alexei B Khorev (Java port) * @author Modified by Josephine Bloggs for COMP2100 Assignment 1 * @version 14th March 2005If you don't understand something in the requirements, send your questions to comp2100@cs.anu.edu.au sooner rather than later. I will post some of the answers to student questions on the Assignment 1 FAQ and Hints Page, so check there before you write to me.
This will probably be quite a challenging assignment for many of you, particularly because you will have to understand at least part of quite a large program before you can start modifying it. I urge you strongly to start soon. Don't leave this until the last minute. Also remember that there is a large range of ability in this class. You don't have to complete all parts of the assignment to pass, get a Credit, or even get a Distinction. Good solutions to the first few parts will take you a long way.
Submission
You and your partner must submit only one assignment representing your joint work. The way we have set up the submission system, only the partner with the lower UniID will be able to submit.
To submit your assignment:
Open a shell window on the ANU DCS student system and cd to your assignment top directory. This is the directory containing the sample documents and the build and oops scripts.
Create a jar file called a1.jar containing all your Java sources by typing:
jar cvf a1.jar comp2100/oops/*.java comp2100/oops/*/*.java(Instead of typing, you can just cut and paste that line.)
Submit this jar file using the mark command:
mark comp2100 a1 a1.jarCheck that your submission succeeded using the markshow command:
markshow comp2100 a1If you do not see your submission in the output of this command, then it has not succeeded and you should seek help from the TSG immediately.
The mark and markshow commands will prompt you for your password. This is normal.
Do not use the submit command. It won't work.
Hint: You may submit the assignment as many times as you want. Only your last submission will be marked. I advise you to try submitting a few days before the deadline just to check that it works. This way you can avoid any panic in the last minutes before 12 noon on Friday 8th April.
Marking
Your submission will be marked out of 30 according to a standard marking guide. Your program will be compiled and run on test data as part of the marking process.
Your assignment mark will be based on: the correctness of your submitted program, the clarity, readability and simplicity of added or modified code, documentation of added or modified code using Javadoc comments and other comments. Make sure that the version you submit will compile and run, even if it won't do everything it is supposed to. You will be penalised heavily if your program crashes or fails to compile.
We will make every effort to return your assignment to you in your lab class in Week 8.
Late submissions
Late submissions will be accepted up to one week after the deadline. They will be penalised 20% (6 marks).
[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]
Copyright © 2005, Ian Barnes, The Australian National University
Version 2005.13, Monday, 18 April 2005, 15:23:01 +1000
Feedback & Queries to
comp2100@cs.anu.edu.au