COMP2100
Assignment 2 Hints and FAQI will do the assignment myself over Easter and post more hints and FAQs if anything comes up. If you have any questions, send them in.
Sample Files
Here are three more sample files (in Open Office .sxw format only):
sample4.sxw. This is a revision of sample3.sxw that corrects some of the formatting errors in that file. Your program should run on both files, but might give nicer HTML output for this one.
sample5.sxw. I created this file as a good test of the "deepening the structure" part of the assignment. It's fairly similar to the example given in that question.
sample6.sxw. This is the file that I referred to a couple of times in the assignment sheet and then forgot to include with the code. Now you should be able to reproduce those examples.
That's all the sample files I will provide. If you want to create more, you can do it yourself using Open Office, which is running on the student system. You should be able to open one of the existing sample files just by double-clicking on it in the window manager. Remember that our processor uses particular styles, so you have to apply those styles to the paragraphs you create. (Hint: Go to Format - Styles - Catalog, and then look at the Custom Styles.)
29th April: OK, one more. This document, sample7.sxw has an element <text:h level="1"> so you can test the upper-case mode in your text output. The body of the element is in mixed case (as you can check by looking at the .xml file) but it should appear in upper case in the .txt file.
Hints
While you're working, you may find it helpful to modify the Ace file so that debug statements are compiled into the system. This will create lots of output, but it might help you to track down any problems. Make sure you change it back before you submit.
In Question 3 you have to modify the tree structure as you traverse it. This is a process fraught with danger. Make sure that the changes you make do not interfere with the correct (depth-first) movement of your visitor object over the tree. You may need to think hard about this to be sure you have it right.
Frequently Asked Questions
In Question 3, how do I create a new container node?
You will have to add a new creation routine to class XML_CONTAINER_ELEMENT before you will be able to do this.
It seems like I need to be an XML-expert to do this assignment. I don't know anything about XML. What should I do?
Don't panic. You don't need to be an expert to do this assignment. Look again -- carefully -- at Lecture 5 and if necessary follow the links at the end of the notes. Everything you need to know about XML is covered in that lecture, but you may need to read through it a couple of times.
I tried to compile this program at home and it failed. The error message said something about not being able to find class ANY. What's wrong?
The Ace file tells the SmartEiffel compiler where to look for the library classes. If you're still using the old SmallEiffel compiler at home then it won't work because the libraries have been reorganised between versions. Download and install SmartEiffel.
What is CSS? Where can I find out about it?
As with XML, I've tried to design this assignment so that you don't need to know anything about CSS. Just use the chunks I've supplied. Each little section specifies the formatting for one type of HTML element. For example
P.Primary-Head { font-size: large; font-weight: bold; }means that whenever the browser sees an element <P CLASS="Primary-Head">, it should format it with a large bold font.
If you want to know more about CSS (because it's a really cool way to format web pages, and a very useful marketable skill if you know it) a good place to start would be the W3C's CSS Home Page. It has lots of links, including to the formal specification.
In Question 1 when you say "paragraph level or lower", what does this actually mean? What is above paragraph level besides "office:body" and "office:document-content"?
Once you have done Q3 there will also be section elements. There are also lists and list items. (The list items contain paragraphs.)
Before the body of the document there is all the messy style stuff. That should all remain indented and on separate lines as it was before.
So it's easier to say what has to change than what should stay the same. Keep indenting everything as before except the contents of paragraphs. Within paragraphs run everything together on one line.
Once we have integrated the output formatter into the visitors that produce the text and HTML files, should the new output appear exactly as the output did under the original method with no use of the output formatter? Basically I am asking if I run diff over the new output and the original output the system initially produced am I allowed to have minor differences?
The idea was that this should be a replacement, a new implementation of the existing software with the requirements unchanged. So the simple, first approximation answer is "Yes", if you run diff across the two outputs, they should be identical.
But in practice there will be some small differences unless you make major changes to the output formatter, which I don't want you to do. In particular, the following differences are OK:
The old version allowed line breaks between calls to add_word, even if there was no space there. If you get the spaces thing right, that won't happen in the new version so you'll see some different line breaks in the HTML file, where the old version had an end tag on a new line, and the new version puts the word before the end tag, together with the end tag on the new line. For example, in sample2.html, when I do a diff between the old and the new, I get this. (The "-" lines are from the old version, the "+" lines from the new.)
-<P>Fig. 5. Figure caption is set underneath the illustration. -</P> +<P>Fig. 5. Figure caption is set underneath the +illustration.</P>For the same reason, the old version allowed line breaks in the middle of a word (!!) if part of the word was in a different tree element and printed with a different call to add_word. Again in sample2.html:
-Reasoning</I>, Toronto, Canada, May 1989, H. BRACHMAN AND R. R -EITER, Eds. Morgan Kaufmann, San Mateo, CA, 276-288. </P> +Reasoning</I>, Toronto, Canada, May 1989, H. BRACHMAN AND R. +REITER, Eds. Morgan Kaufmann, San Mateo, CA, 276-288. </P>The old HTML formatter only indented at the left, while the output formatter (and the old text renderer) indent at the left and right margins. This means that you'll get some different line breaks for list items in the HTML output. Again in sample2.html:
- <LI><P>Does DCM afford more reflective cognition and concious - processing of concepts?</P> + <LI><P>Does DCM afford more reflective cognition and + concious processing of concepts?</P>(Note the misspelling of the word "conscious". Having a fancy publication system doesn't prevent simple errors like that.)
For Q1, is it alright to assume that the only elements within paragraphs will be data elements, or tags named either text:span or text:s? Is there a better way of determining which elements are below paragraph level?
As you traverse the tree, you notice when you hit a paragraph element. Everything underneath that element -- in other words the children, grandchildren etc -- is "below paragraph level" and should be treated differently.
My thought on how to implement this was to add an extra parameter to the pretty_print routine that tells it whether to add line breaks and indentation. Paragraph elements set it to false when they make the recursive call to their children. The initial call to pretty print the root of the tree sets it to true. In all other situations its value is just passed on unchanged.
For Q2, I am not sure how much we are allowed to alter the OUTPUT_FORMATTER class. Should the only alterations be those four points you mention in the question? Also, should we be able to do the question without altering the export status (the ANY or NONE thing) of some of the features of the class OUTPUT_FORMATTER?
OK, now that I've done that part of the assignment, I can be sure. The third of those four points (getting it to do a line break) is already done for you in the code I gave you. (My mistake, which just made the assignment a little easier.) You need to address the other three points.
If you want to to get identical HTML output to the old version (and see my comments on this a couple of questions back), you will need to build in a facility for changing the indentation increment. It's 4 spaces for the text file output, but should be only 2 for HTML. This is not an important point though.
You should move the routines for adding tags from the HTML renderer to the output formatter also.
You are allowed to make other changes if you wish, but don't change the existing part of the interface without a very good reason. I don't think you need to change the export status of any features of class OUTPUT_FORMATTER. I thought you did, but I'd already done it for you in the version I gave you (when I was sorting out the line break business).
Can you release the correct output (sample*.txt sample*.html and sample*.xml) of the program?
No. You need to work out for yourself from the requirements whether your output is correct or not. If you have specific questions, ask me by email.
In Q4, can I make a link from my HTML document to an external CSS document instead of embedding the CSS inside the HTML document? Isn't that a better way to do it?
Certainly it is better in general, but for this assignment I want to you embed the CSS in the HTML document anyway. It just means there are less files to keep track of and less chance of having a broken link when we come to marking your assignment.
In Q2, would it be OK to make the feature "line" of class OUTPUT_FORMATTER a "feature {ANY}" so that it can be accessed by anyone?
I'd very much rather you didn't do that. That line buffer is the most important internal data for the formatter. If other code can modify its contents, all sorts of horrible things could happen. I think a much better solution would be to work out what you want the other parts of the program to be able to do to "line" and then add new routines to the output formatter to perform those actions. That way you're guaranteed that nothing nasty goes wrong.
Can I assume that the tag <text:p> is the node before the leaf node? That is, can I assume the paragraph level is able to have children, but not grand-children?
No, you can't assume that. If there is any italic, bold or underlined text in your paragraph, it will be in a special node called a "span", so your paragraph will have grandchildren.
I'm still not clear what you mean by "same level or below text:p. When you say "same level" do you mean anything at the same level, like all children of office:body, which includes text:sequence? Or is it just text:p?
You're right that it's not clear in the specification. Let me try again. What I want is for everything to be indented as it is already except the contents of paragraphs. (Paragraphs are nodes with name text:p.) All those style and sequence things should still be on new lines and indented. Only the contents of paragraphs should be run together on one line.
I am just about to complete Q2, but I have noticed some quirks with the formatting. For instance, in sample3.sxw, there is a section on indentation of a paragraph. The text in the sample says the first line should be indented, however the original class TEXT_FORMATTER does not indent the first line of this paragraph. My question is, should I be making the generated text as similar as possible to what the original text_renderer did, or should I be following the instructions on the samples?
Ignore the formatting instructions in the sample documents. (Sorry, they shouldn't really be there. They're left over from last year or the year before, and for this year's assignment they're just confusing.)
As a first approximation, your output should be identical to that produced by the old version of the program. For more details, see Q6 above.
In sample2.doc the third list item (just after the copyright notice) has the label "3. ", but in your HTML and text files the it's item 1, as if it was in a new list. Is this a problem we're supposed to fix, or a bug in MSWord or OpenOffice, or what?
It's a problem with the program we're working on, but not something you have to worry about. If you look at your sample2.xml, you will see that there are two lists, with some other paragraphs in between (like that copyright notice). The second list just contains one item, and has the attribute text:continue-numbering="true". This tells OpenOffice not to start numbering again from 1, but to continue the numbers from the previous list. Our program doesn't know about this attribute, and so it gets the numbering wrong.
You don't have to worry about it or fix it or anything.
Where does the upper case thing come in? I can't seem to find any functions in the HTML renderer or the text renderer that change the mode to upper case. Am I missing something?
Look hard at the visit_heading routine in the text renderer. For a Level 1 heading, it sets upper_case to True, then processes the contents, then sets it back to False afterwards. Your output formatter has to be able to do this too.
Unfortunately the existing sample documents don't seem to test this. If you look at the XML you'll see that the content of the headings is already in upper case. You should create a new document with a primary heading in lower case and check that it comes out in upper case in the .txt file.
In Q4, I don't understand what to do with the automatic paragraph styles like "P1", "P2" etc. For example in the XML file, if you have a style like
<style:style style:parent-style-name="Affiliation" style:family="paragraph" style:name="P2">Does that mean that in the HTML body, whenever I encounter <P class="P2"> I must replace "P2" with "Affiliation" etc?
Don't do the substitution in the HTML. Leave it so that what you end up with in your HTML is <P class="P2">. Add a new section to the CSS that has
P.P2 { whatever the formatting was for P.Affiliation }You should be able to arrange that this happens as you traverse the styles part of the tree. You'll need a dictionary to store the formatting for Affiliation (and the others) so that you can insert it when you need to.
In Q2, I've got rid of most of the extraneous spaces, but I'm still getting extra spaces in between the first letter of a fully capitalised word and the rest of the word, like this: A BDELBAR instead of ABDELBAR. But this doesn't seem to happen for a word like "The". Could you give me any hints?
First, look at the middle bullet point in Q6 above.
For more detail, look at the XML. For that example, look at sample2.xml, in the references section, near the end. In those names, everything after the first letter is inside a <span> with an attribute telling it to format that in a "small caps" font. But they're part of the same word and there shouldn't be a space in between. This is really tricky, and you'll need to think hard to get it right. (The old version in the text renderer and HTML renderer gets it slightly wrong too, as mentioned in Q6 above, because it sometimes puts a line break in between.) It comes down to controlling carefully when your output formatter inserts a space. You'll probably have to modify the add_word routine, because the one I gave you always inserts a space, except when what it is going to add is at the beginning of a new line.
This is one of the small tricky points about this assignment. There is a simple solution, but you may have to think for a while to find it.
In Q1, how should we format a paragraph inside a list item, inside a list?
Like this:
<text:p> ... </text:p> <text:ordered-list text:style-name="WW8Num1"> <text:list-item> <text:p text:style-name="P10">Is a shift from DOM to DCM conducive to effective learning?</text:p> </text:list-item>The principle is this: indent everything exactly as before, except when you see a paragraph element (<text:p>). The paragraph start tag should be on a new line and indented just as for any other element, but the contents of the paragraph (other tags for spans etc, data) should be on the same line as the start paragraph tag, with no added white space, finishing with the paragraph end tag. Then you do a newline and resume indenting.
We use newlines and indents to show the structure of the document as much as possible. But within paragraphs white space (spaces, tabs and newlines) is significant, so we don't add any until we get to the end of the paragraph.
In Q3 you said to create a routine "visit_section" and make it do nothing. Are we allowed to make it visit its children? Because if not, the text inside a section node would not print.
Yes, of course. You're absolutely right.
What I should have written was "make it do nothing except the recursive call to visit its children."
In Q3, can we assume that a secondary heading only comes after a primary heading? If not, then what level of indentation should a secondary heading have?
Firstly, the indentation level is not something you set. It's a result of the position of a node in the tree. Every time you descend another level in the tree, the pretty-printer adds one extra level of indentation. It's the same with the HTML renderer.
I don't want you to assume that a Secondary Heading has to come after a Primary Heading. Real people don't always follow rules like that, but we should still be able to format their work. You should be able to do the rearrangement by following these rules:
When you see a primary head, make a level 1 section and put the heading inside it, along with everything up to the next primary head.
When you see a secondary head, make a level 2 section and put the heading inside it, along with everything up to the next primary or secondary head.
Copyright © 2004, Ian Barnes, The Australian National University
Feedback & Queries to
comp2100@iwaki.anu.edu.au
Version 2004.4, 29 April 2004, 14:40:49