[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]
COMP2100/2500
Homework 12Continue filling in a new Time Recording Log and Weekly Time Use Summary each week.
Write the following program, following the enhanced PSP as described in Lecture 19 and filling in the Project Plan Summary and a Defect Recording Log. Use one of the Project Plan Summary forms that doesn't have greyed-out sections.
Write a Java program called ‘Commonest’ that reads text from the standard input and then prints out a table of the 5 most common words, together with their number of occurrences, in descending order of frequency.
For example, suppose that the file stuff.txt contains the text:
Eiffel! eiffel? eiffel `EIFFEL' eiFfeL java Java "JAVA" java pascal Pascal%$#@!&*PASCAL C, C; C. bash-bash BASIC COBOL Fortran66 Fortran77 Fortran90 SQLThen we could have the following interaction with the finished program:
[comp2100@partch]$ java Commonest < stuff.txt 5 eiffel 4 java 3 c 3 pascal 2 bashNotice that the program ignores all punctuation and reduces all words to lower-case for comparison. A word is defined as an unbroken sequence of letters (both upper- and lower-case) and digits, and words are separated by sequences of other stuff (anything else: punctuation, special symbols, white space, ends of lines etc).
Words with the same frequency should appear in alphabetical order and only the first five words should be printed.
Hints
You will probably need to write more than one class. This is a decision to make during the Design phase of development. By the time you start the Coding phase, you should know exactly what classes you will write, what library classes you will use and how they will work together.
It may make this whole thing easier if you split the task into stages. The first stage is to read the input and split it into words. The second stage is to do something with that list of words so that you end up with a relation between words and frequencies (the number of times a word occurs). The third stage is to sort this into descending order of frequency, and for words with the same frequency, by alphabetical order. The final stage is to print the first five. (It may make sense to do the last two stages together.)
The most crucial design decision is what data structure to use for storing the words and frequencies. I created a new class that had two attributes: a string for the word, and an integer for its frequency. But it's not easy to make this work. There may be better ways.
This is definitely the hardest homework for the semester. Treat it with respect. Take enough time in the Analysis and Design phases. When I did this exercise, all my estimates were out by a factor of two: it took twice as long and the code was twice as large as I thought. So I didn't spend enough time in the early phases and ended up finding lots of serious defects during the Test phase. Don't make the same mistake.
Extension tasks
Many writers in English use the apostrophe to abbreviate words like “I'm”, “it's”, “we're”, “can't”, and “wouldn't”. The way the program is specified above, it would split each of those into two words: “I” and “m”, “it” and “s” and so on. Modify the requirements so that a single apostrophe in the middle of a word isn't removed, but that quote marks before or after a word are removed, along with all other punctuation. So running the program on the sentence:
"I won't, won't! WON'T!!", shouted the baby.
would give the results:
3 won't 1 baby 1 i 1 shouted 1 the
[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]
Copyright © 2005, Ian Barnes, The Australian National University
Version 2005.1, Wednesday, 4 May 2005, 07:03:39 +1000
Feedback & Queries to
comp2100@cs.anu.edu.au