COMP2100/2500/6442
Homework 8

Continue filling in a new Time Recording Log and Weekly Time Use Summary each week.

Write the following program, following the PSP to make estimates of your program length and programming time, and record the kinds of defects. This is getting harder: you will need to pay closer attention if you want to improve. Fill in the Project Plan Summary and a Defect Recording Log.

Write a Java program called ‘Commonest’ that reads text from the standard input and then prints out a table of the 5 most common words, together with their number of occurrences, in descending order of frequency.

For example, suppose that the file stuff.txt contains the text:

Eiffel! eiffel? eiffel `EIFFEL' eiFfeL
java Java "JAVA" java
pascal Pascal%$#@!&*PASCAL
C, C; C.
bash-bash
BASIC
COBOL
Fortran66
Fortran77
Fortran90
SQL

Then we could have the following interaction with the finished program:

[comp2100@partch]$ java Commonest < stuff.txt
  5 eiffel
  4 java
  3 c
  3 pascal
  2 bash

Notice that the program discards the punctuation characters and reduces all words to lower-case for comparison. A word is defined as an unbroken sequence of digits and letters (both upper- and lower-case), and words are separated by sequences of other stuff (anything else: punctuation, special symbols, white space, ends of lines etc). Note that the underscore character is not included in the words.

Words with the same frequency should appear in alphabetical order and only the first five words should be printed, with their frequency counts.

Quick reading comprehension check: why didn't fortran appear as a word with frequency 3?


Hints

You will probably almost certainly need to write more than one class. This is a decision to make during the Design phase of development. By the time you start the Coding phase, you should know exactly what classes you will write, what library classes you will use and how they will work together. The design part of this homework is hard – don't rush through it.

It may make this whole thing easier if you split the task into stages.

For week 8: attempt part 1 and 2: split the input into words, and count their frequency. Parts 3 and 4 come later for homework 10.

1. Read and split

The first stage is to read the input and split it into words. (Hint: research the class StreamTokenizer or the method String.split). Transform words into a common form (so that all of the different cases of characters in input words will be treated the same from now on: we do not need to preserve the original spellings for the output).

2. Count the words

The second stage is to do something with those words so that you end up counting how many times each word appears: that is, some relation between words and their frequencies (the number of times a word occurs). (Hint: you could consider using a binary tree for sorting the words as strings, with a counter value at each tree node, as well as the word's string value (the key); or a HashTable, with a counter value for each word. The benefit of doing this homework task is really about your own use of data structures that you build, so adapting your own binary tree from homework 6 is preferable). If you have not done homework 6 you need to do it sometime — do it before you go any further!

Homework 10

Read the rest of the specification all the way through before you make any design or estimation steps!

task 3. Sort the words by frequency; 4. Select the N commonest

The third stage is to sort this list into descending order of frequency, and for words with the same frequency, by alphabetical order. The final stage is to print the first N of them (the first five, for example). (It may make sense to do these together.)

As usual, don't hard code the number '5' into your program but define a constant value. A more general program would get the number from the command arguments; see the Extensions below.

Design – this is important

The most crucial design decision is what data structure to use for storing the words and frequencies, because you have to sort things twice: once for the words themselves, and again for the frequencies (to extract the 5 top ones). Lecturer Ian Barnes created a new class that had two attributes: a string for the word, and an integer for its frequency, and wrote a method to do the sorting himself. But it's not easy to make this work. There may be better ways. Lecturer Chris Johnson created a generic class that could be used for sorting things of any comparable kind, and used it twice over.

The whole program is probably the hardest homework for the semester. Treat it with respect. Take enough time in the Analysis and Design phases. When Ian Barnes did this exercise, all my estimates were out by a factor of two: it took twice as long and the code was twice as large as he thought. So he didn't spend enough time in the early phases and ended up finding lots of serious defects during the Test phase. Use his experience: don't make the same mistake.


Extension tasks

  1. Many writers in English use the apostrophe to abbreviate words like “I'm”, “it's”, “we're”, “can't”, and “wouldn't”. The way the program is specified above, it would split each of those into two words: “I” and “m”, “it” and “s” and so on. Modify the requirements so that a single apostrophe in the middle of a word isn't removed, but that quote marks before or after a word are removed, along with all other punctuation. So running the program on the sentence:

    "I won't, won't! WON'T!!", shouted the baby.

    would give the results:

      3 won't
      1 baby
      1 i
      1 shouted
      1 the

    This means that you have to be smarter than the simple use of the split method.

  2. for homework 10
    Get the number of commonest words to be selected from the command argument. For example:
        java Commonest 8 < testwords.txt
    
    to deliver the 8 most common words.

Copyright © 2005, 2009. Ian Barnes, Chris Johnson, The Australian National University
$Revision: 603 $ $Date: 2011-04-26 18:19:15 +1000 (Tue, 26 Apr 2011) $ $Author: cwj $
Feedback & Queries to comp2100@cs.anu.edu.au