ANU The Australian National University



____________________________________________________

[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]

____________________________________________________

COMP2100/2500
Assignment 1 Hints and FAQ

Hints

Tuesday 5th April: Today I did most of the assignment. I came across a funny error in the starting code. The part that is supposed to replace some common entities with the appropriate characters doesn't work. This is easy to fix, so you might as well.

The code in the static block in class TextRenderer that sets up the lookup table is wrong. It has the keys and corresponding values in the wrong order. As a result, no entities ever get replaced. The way to fix this, if you want to, is to reverse the order of the arguments in each call to entities.put(). That block of code should look like this when you're finished:

entities.put("&lt;", "<");
entities.put("&gt;", ">");
entities.put("&amp;", "&");
entities.put("&apos;", "'");
entities.put("&quot;", """);
entities.put("\u0194\u0169", "\u0169"); // Copyright symbol

How did this happen? Because the code was translated from Eiffel, and the Eiffel equivalent of a Java HashTable is a DICTIONARY. The put method for an Eiffel DICTIONARY takes its arguments in the order value, key, but the add method for a Java HashTable takes them in the order key, value.

The last line (for the copyright symbol) still doesn't work, but that's probably OK. We should really remove that line and look at the text files our system produces with UTF-8 character interpretation turned on. When you do that there's no need to substitute for the copyright symbol or any of the other multibyte characters that appear (fancy quote marks etc).


Frequently Asked Questions

If you have a question about the assignment, send it to comp2100@cs.anu.edu.au. I will answer it (but not always immediately) and if I think the answer might be useful to other students, I'll post the question and answer here.

  1. When I try to run your build or oops commands, I get a message about “permission denied”. What am I doing wrong?

    Nothing. It seems that the jar tool doesn't preserve permissions on files. In order to use those shell scripts you need to change their permissions to make them executable. You can do this with the chmod command like this:

    chmod +x build oops

    In the past I've distributed code using the standard Unix archiving tool tar, which does preserve permissions. Because we're using Java now, I thought I should use the Java archiving tool instead. In many ways they're similar. Actually jar does a little more than tar in that it does compression as well as archiving. (People usually compress tar files with gzip, leading to software download files with the extension .tar.gz.)

  2. In the specification, you say we've got to write a toString() method for StyleDecoder. But the commented-out code in class Converter calls styleDecoder.lookupTable.toString(), which calls the toString() method of class HashTable, not the one we wrote for class StyleDecoder. Is this correct?

    No, it's a mistake in the commented-out code in class Converter. It should print out styleDecoder.toString(), not styleDecoder.lookupTable.toString().

  3. In the style decoder, do we need to convert the entity back? (That is, do we have to convert the ‘&apos;’ in “Author's Name” to ‘'’?)

    No, don't worry about the entity. You'll notice that the text renderer converts some entities, but there's no need to do it in the style decoder's output.

  4. Does the toString() method in class StyleDecoder need to return the keys in any particular order? (In the specification they are in increasing order: P1, P2, P3 etc.)

    No, don't worry about sorting the keys. Whatever order the HashTable gives them to you is fine with me.

  5. I tried to run Javadoc over the sources you gave us, and it made a mess with lots of errors. Also the comments at the top of the files disappeared. What can I do?

    It's true. There are mistakes in the way the Javadoc comments have been put into the code. Most of the errors are because there's a colon after the parameter name in an @param line. There are also errors in the references to other classes in the @see lines. The class comments disappeared because Javadoc wants them to be immediately before the class declaration rather than at the top of the file.

    You don't have to do anything about this. If you want to fix it, you have two options:

    1. Fix these mistakes by hand. That's what I did, and it took me about half an hour or so.

    2. Use these Perl scripts supplied by Jan Vaughan. (Thanks Jan.) The first one oopsjdocfixer1 fixes the @param and @see lines. The second one oopsjdocfixer2 fixes the class header comments. Neither Jan nor I provide any guarantee for these scripts, so you use them at your own risk. I recommend making a backup copy of all your sources before you start, and doing careful checking afterwards.

  6. In Q3, it is certainly possible that style P1 inherits from style Author&apos;s Name. However, is it possible that style P1 inherits from style AnotherStyle, which in turn inherits from Author&apos;s Name?

    If this is the case, then our MetadataExtractor will have to consider this ‘double’ inheritance, and get a bit of a recursive thing going, e.g. does parent(parent("P1")) == "Author&apos;s Name"? And so on... Why stop at two?

    You may assume that there are no multi-level inheritances among styles. That is, you only need to check the immediate parent.

  7. In the TreeFixer, is there any point in visiting the children of a style node? It does not make sense to have a paragraph inside a style node. This applies to several other node types, such as lists, spans, style properties — none of these would ever contain a paragraph.

    So what I'm asking here is whether we have to traverse the entire XML tree. It would be more efficient to say “We've come across a style node, don't bother trying to find paragraphs in its children.” On the other hand, we could make our TreeFixer more robust by always traversing the entire XML tree, in case the XML standards change.

    Actual document content can only be inside the office:body element, so your TreeFixer doesn't have to search all those styles elements for rogue paragraphs of underscores. You may shorten the search by not exploring the other subtrees. Similarly, definitions of style inheritance can only be inside office:automatic-styles, so your StyleDecoder doesn't have to search the other branches of the tree for them. And metadata content can only be inside paragraphs, which can only be inside office:body, so your MetadataExtractor only has to search there also. (Thanks to James Barker for looking up the Open Office XML definition and confirming this.)

    Acting on this will make your program a little faster, but it is not essential. You won't lose any marks if all your visitors traverse the entire tree. You will lose marks if you try to speed up the search and end up missing something.

  8. In the MetadataExtractor, if a paragraph should contain metadata (i.e. its text:style-name attribute is or inherits from Title, Author&apos;s name,or Affiliation), and it has no actual data within it, should we store an empty string and output that at the end? Or just ignore it?

    Ignore it. You don't have to store and print empty strings. Only extract non-empty metadata content.

  9. There's nothing in Q5 to say how we should handle ordered and unordered lists. What do we do?

    Ordered lists in HTML are placed inside <ol>...</ol> and unordered lists inside <ul>...</ul>. List items go inside <li>...</li>. A paragraph inside a list item in the XML tree should come out as a paragraph inside a list item in the HTML.

  10. How much of the system are we allowed to change? Will you be relying on some classes staying unchanged?

    To do Q1-6 of the assignment you probably only need to modify the existing classes Converter and TextRenderer and add new classes TreeFixer, StyleDecoder, MetadataExtractor and HTMLRenderer. To do Q7 will also require modifying (or rewriting from scratch) class Scanner, and adding a few new classes to that package.

    But I'm not going to insist that you only change those. If you want to spend your valuable time modifying other parts of the system, be my guest.

    What I usually do with assignments like this is to run whatever program each student submits, and to print out the differences between their submission and the starting system. (That is, don't print unchanged files, print new files in full, and print a ‘diff’ of modified files.) In order not to pick up too many non-changes (like if you just changed the indentation of a line using DrJava so that tabs got replaced by spaces), the options for diff will be carefully chosen, with files possibly pre-processed as well.

    I know there's a lot of room for improvement throughout the system. I'd advise you to resist the temptation to rewrite the whole thing. Or if you can't resist, keep that version separate and submit a version in which you've made minimal changes to the existing code. This will make the really important changes stand out, and make it easier for your marker to identify the brilliant work you've done carrying out the tasks in the assignment.

  11. Is there a place on the student server to share our work so that my partner can also have a copy of all the work that we have done?

    No, as far as I know there's nothing like that. It's a pity. You'd need an area where only you and your partner have access, but not the rest of the class.

    One thing you can do is to email your changes to each other. (Using the jar tool to archive and compress the code will make this quicker and easier.)

    Another possibility is to use scp (secure copy) to copy work between your accounts. To do this you will need to be logged in to one account and then type the password for the other. The right way to do that is not to give your partner your password, but to do the copy while you are both present. You log in to your account, type the scp command, and then when it prompts you for your partner's password, he or she types it in. You can use scp to copy files to or from your account. You can copy multiple files, whole directories etc. Be careful doing this, because it is easy to copy an old version over a new one.

    You should read the manual page (type “man scp”) before you use this, but here's a quick guide. To copy from your partner's account (u1234567) to yours:

    scp u1234567@partch.anu.edu.au:directory/file.java yourdirectory/file.java

    To copy from your account to your partner's reverse the order. (The order is scp from to.)

    To recursively copy entire directories together with their subdirectories and contents, use the -r option. (This is a little tricky and it sometimes puts things one level deeper than you might expect.)

    One of the topics we will study after the break is version control, when you will learn about tools that can assist you with file sharing and keeping track of the changes to your program. Still doesn't change the need for some sort of shared access though.

  12. Your build file doesn't work on my Windows machine. What can I do?

    Replace it with a file build.bat with the following contents:

    javac comp2100\oops\*.java 
    javac comp2100\oops\scanner\*.java 
    javac comp2100\oops\strategies\*.java 
    javac comp2100\oops\tree\*.java 
    javac comp2100\oops\visitor\*.java

    (Thanks to Tim Butler & Pat Bernardi for this.)

  13. I'm confused about the extent to which the visit...() structure has to be imposed on/integrated within the HtmlRenderer. It's not at all clear on the Assignment page whether things such as Headings, Lists, etc. have to be handled within the visitParagraph() method (as opposed to in the visitHeading(), visitList(), etc. methods). For example, we're told that within the Paragraph elements we have to handle things listed as Headings (Primary and Secondary) - that's not a problem in itself, but do the same format specifications apply to headings found by a visitHeading() method?

    This is made worse by the existence of some visit...() methods with obvious HTML equivalents that you haven't mentioned in your specs... The best example is the visitAnchor() method. You haven't specified whether or not we must deal with this.

    Sorry, you're right. The specs are seriously incomplete for Q5. This should fill the gap.

    First thing though, is that there's a difference between an XML element text:h that should be dealt with by visitHeading() and an XML element text:p text:style-name="Primary Head", which is dealt with in visitParagraph(). It's the second of those two that the table in the assignment specification is about: paragraphs with all sorts of different styles on them (some of them in fact types of headings, but not recognised as that by the OpenOffice XML structure).

    OK, so let's go through the different visit...() methods that your HTMLRenderer must have. (It must have them because it must implement the Visitor interface.)

    visitDocument()

    should produce the <HTML>...</HTML> pair. It might also produce <HEAD>...</HEAD>, but the details of where that's done are up to you. This is where the XML document structure doesn't exactly mirror the HTML structure you're producing, so you need to think a little more about how it will work. Inside the HEAD element you need the TITLE and the META elements mentioned in the assignment specification.

    visitHeading()

    should output an HTML <Hn>...</Hn> pair if the XML heading element has an attribute text:level="n". (That is, if the level is 1, put out <H1>...</H1>, etc.) If there is no attribute, output an H2 element. Of course in between the start and end tags you print, you need to make this visitor traverse the children of this heading element, which you do by calling visitChildren().

    visitAutomaticStyles(), visitStyle(), visitStyleProperties()

    shouldn't produce any output, but you'll need them for catching the style names associated with bold and italic spans. If you want to catch small caps too, that would be great, but it's not required.

    visitBody()

    should produce <BODY>...</BODY>. Inside those is where all the content of the document should go.

    visitParagraph()

    produces formatting as described in the table in the specifications, depending on the style of the paragraph that you recover from the attributes and your style decoder.

    visitUnorderedList()

    produces <UL>...</UL>.

    visitOrderedList()

    produces <OL>...</OL>.

    visitListItem()

    produces <LI>...</LI>.

    visitSpan()

    has to look at the style attributes and then the lookup table for text styles that I described in the specifications. Depending on what it reads there, it either produces <I>...</I> or <B>...</B> or <I><B>...</B></I>. If you decided to do small caps (which will make the examples look nice), it might also produce <SPAN STYLE="font-variant:small-caps">...</SPAN>. If there is nothing your processor recognises in the attributes of the span, then it should do nothing other than visit its children. (One example of this is the superscript ‘2’ in the equation E = mc2 that appears in sample6.sxw and sample8.sxw. You don't have to handle these.)

    visitAnchor()

    looks for an attribute xlink:href="url". If that attribute is present, then it should produce <A HREF="url">...</A>. Otherwise it should do nothing except visit children.

    visitSpace()

    should produce a non-breaking space character: &nbsp;.

    visitLineBreak()

    should produce <BR>. (Or if you want to write nice XHTML, write <BR /> instead. That's like an XML empty element tag, but you need to leave the space before the slash because some browsers still don't understand it and get confused otherwise.)

    visitUnknown()

    should just visit its children without doing anything.

    visitData()

    should print out its contents. There's no need to do all the fancy line breaking like the text renderer, but also no particular reason not to, since you already have the technology for that — just copy it from the text renderer if you like.

    (But whatever formatting of your HTML output you do or don't do, please don't output the entire HTML file as one really, really long line. Even if you do no wrapping or indenting, output a newline at the ends of all the visitWhatever() methods except anchors, spans, spaces, data & unknowns... and of course the styles ones that don't produce any output.)

  14. To what extent must we use the Assert methods? Are we to take the most anal-retentive “Bertrand Meyer would be proud” approach of specifying pre- and post-conditions for every program structure, i.e. every method, loop, etc? Or should we only use them where we feel the program complexity justifies an explicit statement of intent?

    I won't be assigning any marks to your use of assertions, so the decision is entirely up to you. This assignment is hard enough without putting that extra pressure on you.

    Personally I favour keeping the code as clean as possible except in the case of troublesome complexity. So I'd tend towards your second option. But there are plenty of computer scientists and software engineers who would tell me to wash my mouth out for saying that...

    I guess my thinking is that what's really important is that human readers can understand the code. If assertions add to that, then great, put them in. (In a really nasty loop, seeing the invariant can be the key to figuring out what the hell is going on.) But if there are so many assertions that it's hard to find the bits of code that actually do something, then I'm not so sure that they're such a good thing.

  15. In Q6 when the TextRenderer is fixed correctly, you mentioned that the first part of sample10.txt will be wrong and there will not be a space between “italic” and “bold”. Should this also happen between the first and last names, for example between “Bertrand” and “Meyer” and also between “reference list in” and “sample2.sxw”? In all these cases the two parts appear in separate text:span elements with no leading/trailing spaces.

    Yes, indeed it will. I hadn't thought of that, but looking in the sample10.xml file it's clear that is what will happen.

    To fix all of these, you need to do Q7, correcting the behaviour of the scanner. (Basically the scanner does a trim() on the strings before it stores them in data tokens, which become data elements in the tree. That's not right.)

  16. How do you access a child of an XMLContainerElement? I tried this:

    XmlContainerElement child = x.children.elementAt(i);

    but it doesn't work. What am I doing wrong?

    OK, x.children is a Vector, which really means a Vector of Object since we're not using the Java 1.5 generics. So x.children.elementAt(i) is an Object, which is why the assignment won't work. You can't assign an Object to child because you've declared child to be an XmlContainerElement. You need a cast:

    XmlContainerElement child = (XmlContainerElement) x.children.elementAt(i);

    Actually you can't even do that, because you don't know that x.children.elementAt(i) is an XmlContainerElement. All you know is that it's an XmlElement. It might be an XmlDataElement instead.

  17. The assignment spec says: “Your scanner must have the same public interface as the old one (except for retreat() and reset() which you should omit).” But the code in Converter.java calls scanner.reset();, which conflicts with your statement: “Using the new scanner should not require a single change to the rest of the program.”

    Oops... How did I miss that?

    OK, using the scanner will require exactly one change to Converter.java, the deletion of the line “scanner.reset();

    Sorry about that.

  18. In the ‘Marking’ section of the assignment specification you wrote “documentation of added or modified code using Javadoc comments and other comments.” Does this mean we don't have to write the comments using Javadoc style?

    No, it doesn't mean that. You should have a proper Javadoc comment at the top of each class and before each method. You may want to add other (non-Javadoc) comments in your code to explain complex or difficult parts.

  19. For Question 6, I just did it in class TextRenderer. Do I need to do it in HTMLRenderer as well?

    No, you only need to do it in the text renderer.

  20. For Question 7 I can't find a definition of “irregular white space” anywhere. What is the white space filter supposed to do?

    Replace every sequence of consecutive white space characters (' ', '\r', '\n', or '\t') with a single space character (' ').

    In practice I suspect that this stage isn't really necessary when dealing with files fresh from OpenOffice. Looking at the content.xml files OpenOffice creates, I don't think they have any excess white space in them at all — they've already been white space filtered, in effect. But I'd still like you to make a white space filter for your scanner. You never know, I might just doctor up a nasty fake OpenOffice file just for testing...

____________________________________________________

[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]

____________________________________________________

Copyright © 2005, Ian Barnes, The Australian National University
Version 2005.11, Tuesday, 5 April 2005, 19:26:00 +1000
Feedback & Queries to comp2100@cs.anu.edu.au