ANU The Australian National University
____________________________________________________
[ANU] [DCS] [COMP2100] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [Assessment] [PSP] [Eiffel] [Reading] [Help]
____________________________________________________
____________________________________________________
[Assignment 1] [Assignment 2] [Assignment 3]
____________________________________________________

COMP2100
Assignment 2: The Parse Tree

Due at 5pm on Friday 30th April 2004

Hints and FAQ

Introduction

Following your successful testing work, you have been moved into the development team for Oops, the Open Office Publication System. At the moment this program reads an Open Office word processor file converted from a Microsoft Word file that was prepared using the ACM journal template, scans and parses its XML content, builds a parse tree representing the logical structure of the document and then writes:

Your assignment is to extend the program toward the eventual goal of making it a useful publication system.

You are to carry out the following tasks:

  1. Modify the tree file output so that it writes correct XML (and change the name of the output file to ".xml").

  2. Integrate the new OUTPUT_FORMATTER class to the plain text and HTML renderers so as to remove repeated code.

  3. Add a new visitor that goes over the tree and deepens the structure by adding elements representing sections and subsections.

  4. Improve the HTML_RENDERER so that it uses Cascading Style Sheets to specify formatting.


1. Modify the tree file output to write correct XML

In its current form, the ".tree" file it creates is not correct XML. This is because it puts quote marks and extra white space around the data elements when it prints them. Your task here is to change this so that the output is good XML. To do this, you first have to understand the role of white space (spaces, tabs and newlines) in XML files.

In XML, some white space is significant meaning that it is part of the data content of the file, and other white space is insignificant, meaning that it can be removed, changed, added to without affecting the meaning of the file. In our situation, white space inside data elements and between elements at the paragraph level or lower is significant, while white space at a higher level is not. This means that it is OK to add newlines and spaces between high-level tags in order to show structure by indentation, but that it is not OK to add extra white space inside "<text:p>" elements or others at the same level or lower in the document.

You must change the code that prints the tree file so that it no longer puts quote marks and extra white space in these places.

For example, here is the last part of sample6.tree:

      <text:p text:style-name="P10">
         <text:span text:style-name="T5">
            "MEYER"
         </text:span>
         ", B. 1997. "
         <text:s/>
         <text:span text:style-name="T1">
            "Object-Oriented Software Construction,("
         </text:span>
         <text:span text:style-name="T6">
            "Second Edition),"
         </text:span>
      </text:p>
      <text:p text:style-name="Footnote">
         "Third and last footnote."
      </text:p>
   </office:body>
</office:document-content>

This should be reformatted to look like this:

      <text:p text:style-name="P10"><text:span text:style-name="T5">MEYER</text:span>, B. 1997. <text:s/><text:span text:style-name="T1">Object-Oriented Software Construction,(</text:span><text:span text:style-name="T6">Second Edition),</text:span></text:p>
      <text:p text:style-name="Footnote">Third and last footnote.</text:p>
   </office:body>
</office:document-content>

Notice that each paragraph is just one long line with all the tags inside it formatted "inline" rather than each on a separate line and indented. Notice that the start-paragraph, end-body and end-document-content tags are indented according to their proper level of nesting within the document.

Finally change the root class so that this file has the extension .xml rather than .tree. You may want to change one line of the Makefile as well.


2. Integrate the output formatter

In Assignment 1 you tested an output formatter class. This is a new class that is not yet part of the system. Instead, the existing system does output directly from the two visitor classes TEXT_RENDERER and HTML_RENDERER. This code is repeated in both classes, and makes them large and unwieldy. This is poor design, and is the reason I wrote the new class OUTPUT_FORMATTER. Your task now is to remove all the output formatting code from the text renderer and the HTML renderer, and to replace it with calls to an output formatter object. When you have finished, all output from those renderers must go through an output formatter. There must be no direct output in the renderers whatsoever.

You will find, if you look closely at this problem, that the output formatter class I have given you is not yet capable of doing the job. It lacks some important features that are needed by the renderers. You will have to add to the output formatter so that it has the necessary capabilities. In particular, you will need to add:

The last point there is more subtle than the others. It comes up in situations like the reference list in sample6.tree where you will see that the first letter of each author's name is in ordinary upper-case, but the rest is inside a span with style T6. If you look carefully at the style definitions at the top of the file, you'll see that style T6 is what is called "small caps". So although the formatting for the first letter is different than for the rest of the word, they're all still part of the same word, and should not be separated by a space.

Note: The version of the output formatter I have given you is one that satisfies most of the requirements given in Assignment 1. Most of the problems with the output formatter came about because clients were allowed to change the formatting (width, indentation etc) in the middle of a paragraph. You can choose to simplify things by adopting a convention that the clients never do this. Adding this to the requirements for the output formatter would have made Assignment 1 much easier.


3. Deepen the structure

At the moment the structure of our documents is quite flat. Except for list items, each different component at the level of paragraphs is at the same level in the tree, immediately below the office:body element. What you have to do in this part is to change that so that the document has more structure. You will do this by writing a new visitor class called TREE_DEEPENER (a bit like the existing class TREE_FIXER) that traverses the tree reorganising its nodes.

Every time you encounter a node whose name is text:p with an attribute text:style-name="Primary Head", you must insert a new node with name "section" and with an attribute level="1", and move the heading and all nodes after it until the next primary head so that they become children of the section node.

Every time you encounter a node whose name is text:p with an attribute text:style-name="Secondary Head", you must insert a new node with name "section" and with an attribute level="2", and move the heading and all nodes after it until the next primary or secondary head so that they become children of the section node.

For example, if you start with this:

<text:p text:style-name="Primary Head">Section 1</text:p>
<text:p>First paragraph in Section 1.</text:p>
<text:p text:style-name="Secondary Head">Subsection 1.1</text:p>
<text:p>First paragraph in Subsection 1.1.</text:p>
<text:p text:style-name="Secondary Head">Subsection 1.2</text:p>
<text:p>First paragraph in Subection 1.2.</text:p>
<text:p text:style-name="Primary Head">Section 2</text:p>
<text:p>First paragraph in Section 2.</text:p>
<text:p text:style-name="Primary Head">Section 3</text:p>
<text:p>First paragraph in Section 3.</text:p>
<text:p text:style-name="Secondary Head">Subsection 3.1</text:p>
<text:p>First paragraph in Subsection 3.1.</text:p>

you should end up with this:

<section level="1">
   <text:p text:style-name="Primary Head">Section 1</text:p>
   <text:p>First paragraph in Section 1.</text:p>
   <section level="2">
      <text:p text:style-name="Secondary Head">Subsection 1.1</text:p>
      <text:p>First paragraph in Subsection 1.1.</text:p>
   </section>
   <section level="2">
      <text:p text:style-name="Secondary Head">Subsection 1.2</text:p>
      <text:p>First paragraph in Subection 1.2.</text:p>
   </section>
</section>
<section level="1">
   <text:p text:style-name="Primary Head">Section 2</text:p>
   <text:p>First paragraph in Section 2.</text:p>
</section>
<section level="1">
   <text:p text:style-name="Primary Head">Section 3</text:p>
   <text:p>First paragraph in Section 3.</text:p>
   <section level="2">
      <text:p text:style-name="Secondary Head">Subsection 3.1</text:p>
      <text:p>First paragraph in Subsection 3.1.</text:p>
   </section>
</section>

You will find that at the moment objects belonging to the class XML_CONTAINER_ELEMENT can only be created while parsing a stream of tokens from a scanner. You will need to add another creation routine to this class that allows you to create new objects and insert them into the tree.

You will need to add a new class to the strategies cluster, a SECTION_STRATEGY. Model this on the other strategy classes. Make sure that when you create a new section node, you attach an appropriate strategy object to it.

Of course you'll need to add some code to the root class CONVERTER to create one of these tree deepeners and run it over the tree. You should do this after the tree fixer and before any of the output. That way you'll be able to check the .xml file to see if this is working.

Once you've done this, you'll find that none of your visitors work any more. You'll need to add a new routine visit_section to class VISITOR and each of its subclasses. For now, have them do nothing.

One point to be wary of: each node in the tree has a link to its parent node. You will need to update these on any node that you move. Later stages in processing may rely on this being correct.


4. Improve the HTML output

As it stands, the HTML_RENDERER formats every paragraph the same, no matter whether it is a primary heading, the article's title, or just an ordinary paragraph of text. This is because it just converts each text:p node in the tree into a <P> tag in the HTML output. Your task here is to improve the HTML formatting by adding "class" attributes to paragraphs and adding a Cascading Style Sheet section in the HTML header specifying formatting.

For the paragraph nodes, all you have to do is modify the output so that instead of just writing <p> to the HTML output, it adds an attribute with name "class" and value equal to the value of the "style-name" attribute in the tree. So for example, if your renderer encounters a node with name "text:p" and style-name "Primary Head" it should write "<P class="Primary-Head">". Notice that spaces in the style name must be replaced by hyphens.

While you're at it, add code to visit_section so that it creates an HTML <DIV> element for each section. Add an attribute class="section level 1" or class="section level 2" as appropriate.

The second change you need to make to the HTML_RENDERER is to have it write a Cascading Style Sheet into the HTML header. At the moment the only thing it puts into the HEAD element is the document title. Now it should also add a style sheet, like this:

<style type="text/css">
<!--
---- CSS style information goes here ----
-->
</style>

Most of what will go in the style sheet section will be the same for every article. (This is our "house style".)

P.Title { 
    font-size: large; 
    font-weight: bold;
    text-align: center; }
P.Abstract {
    margin-left: 10%;
    margin-right: 10%;
    font-size: small; }
P.Author's-Name {
    font-weight: bold;
    text-align: center; }
P.Affiliation {
    text-align: center; }
P.Categories {
    font-size: small; }
P.Text-Body {}
P.Quoted-Text {
    margin-left: 10%;
    margin-right: 10%; }
P.Numbered-List {}
P.Footnote {
    text-size: small; }
P.Figure-Caption {
    text-align: center;
    text-size: small; }
P.Table-Head {
    text-align: center; }
P.References {}
P.Primary-Head {
    font-size: large;
    font-weight: bold; }
P.Secondary-Head {
    font-weight: bold; }
P.Displayed-Equation {
    text-align: center;
    font-style: italic; }
P.Initial-Body-Text {}

If that was all there was to it, this would be easy. But there's more. In the section of the tree under the node "office:automatic-styles" there is more style information. Some of this is about text styles for spans. These are chunks of text within a paragraph that should be formatted differently, for example a phrase in italics. You can identify these styles because they have an attribute style:family="text". For each of these you must add an entry to the CSS section, and in the main part of the document you must make sure that the spans are written to the output as <span class="style-name">. At the moment the renderer does some fancy footwork to replace these with <I> and <B> elements, but it misses things like subscripts and superscripts and the small caps. This solution is more general.

For example, in sample2.tree you will find

<style:style style:family="text"
             style:name="T1">
   <style:properties style:font-weight-asian="bold"
                     fo:font-weight="bold"/>
</style:style>

For this you should add a section to your CSS:

span.T1 { font-weight: bold; }

Basically what you need to do is find the style:properties element and copy the content of any attribute that has the prefix fo:. So fo:font-weight="bold" gets converted to font-weight: bold, fo:font-style="italic" gets converted to font-style: italic and so on. Ignore all the other attributes.

The final thing you need to work out is what to do with all those paragraph styles (the ones with style:family="paragraph") like "P1", "P2" and so on. Each of these has a parent style that it inherits from. You will need to add an entry in the CSS for each of these styles, and associate it with the same formatting instructions as its parent style. This means that you probably want to put the formatting instructions into a lookup table. Try using the library class DICTIONARY.


Getting started

Download the compressed archive file a2start.tar.gz. This contains the source code for the starting version of the program. Uncompress and unpack the archive.

Note that the starting code is almost the same as last year's starting code. It does not contain all the extra features that students added last year, and that I spoke about in the lecture.

Compile the program by typing make. There are three sample files provided: sample1.sxw, sample2.sxw and sample3.sxw. To run the program, type for example oops sample1. I will provide more sample files over the coming days. Keep an eye on the Assignment 2 FAQ and Hints Page for these.

Take plenty of time to read and understand the code before you try to modify it. Don't just dive in and start changing things until you understand how they work.

Each time you modify a class, you must also modify the Author and Date information at the top. We haven't covered version control yet in lectures, so remove the dollar signs and the Revision line, and update the author and date like this:

-- Author: Ian Barnes, modified by Jane Bloggs
-- Date: 10 May 2002

By the way, the first two sample files were downloaded from the ACM web site's Instructions for Authors. I have included the original Microsoft Word versions of these documents in case you have Word and would like to check anything. I created the third document in Open Office by modifying one of the others. I have included a version of it saved in Word format. If you look closely at sample3.tree or sample3.html you will see that I have messed it up. Somehow I managed to put the major section headings inside one-item numbered lists. This illustrates the point of a preflight system. I can put my document into it, see that the output is wrong, fix it up and try again. (I'll do that, and a fixed version will probably appear soon as sample4.sxw.)

If you don't understand something in the requirements, send your questions to comp2100@iwaki sooner rather than later. I will post some of the answers to the Assignment 2 FAQ and Hints Page, so check there before you write to me.

This will probably be quite a challenging assignment for many of you, particularly because you will have to understand at least part of quite a large program before you can start modifying it. I urge you strongly to start soon. Don't leave this until the last minute.


Submission

Submit your assignment as a compressed tar archive by moving to the working directory and then typing (or cutting and pasting from your browser):

tar cvf a2.tar Makefile oops.ace *.e */*.e
gzip a2.tar
/dept/dcs/comp2100/bin/submit_a2 a2.tar.gz

Make sure you follow those steps precisely, or some of your files won't get collected in the tar archive. Read carefully through the output of the submit_a2 script to make sure that there were no problems with your submission.

Marking

Your submission will be marked out of 20 according to a standard marking guide. Your program will be compiled and run on test data as part of the marking process.

Your assignment mark will be based on: the correctness of your submitted program, the clarity, readability and simplicity of added or modified code, documentation of added or modified code using header comments and other comments, and compliance with the Eiffel style guidelines on the Eiffel Coding Standard Page. Make sure that the version you submit will compile and run, even if it won't do everything it is supposed to. You will be penalised heavily if your program crashes or fails to compile.

We will make every effort to return your assignment to you in your lab class in Week 10.

Late submissions

Late submissions will be accepted up to one week after the deadline. They will be penalised four marks (20%).

____________________________________________________
[Assignment 1] [Assignment 2] [Assignment 3]
____________________________________________________
____________________________________________________
[ANU] [DCS] [COMP2100] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [Assessment] [PSP] [Eiffel] [Reading] [Help]
____________________________________________________

Copyright © 2004, Ian Barnes, The Australian National University
Feedback & Queries to comp2100@iwaki.anu.edu.au
Version 2004.4, 7 April 2004, 14:57:21