|
|
COMP8400 - Assignment 2 - Due week 12 (Wednesday 27 May, 5 pm)
This assignment is worth 15% of your total course mark. It will be
marked out of 15 as indicated below.
Association rules mining / Classification
Objectives
The objectives of this second assignment are to apply the topics learned
in the middle part of the course by (1) conducting association rules
mining on a very small database; (2) calculating classifier accuracy
measures; and (3) by either carrying out a classification project in
Rattle evaluating three classifiers on a publicly available data
set, or by implementing, testing and evaluating a simple k-NN (nearest
neighbour) classifier.
The estimated time we expect you to spend on this assignment is
around 15 hours in total (1 hour per mark).
Submission
You will have to submit one file, named u1234567-ass2.pdf (please
replace u1234567 with your ANU university ID) that contains three
parts:
- the detailed results of your association rules mining, based on a small
database created using your ANU university ID, which should follow
exactly the example given in task 1 below,
- the calculations and results of your classifier accuracy measures,
again based on your ANU university ID, which should follow exactly
the example given in task 2 below, and
- a report that contains the details of your classification project,
or your k-NN classifier implementation, as described in task 3
below.
Important:
- Make sure that your submitted file on the first page at the top
contains your name and ANU university ID!
- The maximum total length of your submission must not be more than
eight (8) A4 pages (i.e. maximum four (4) sheets of paper printed
double-sided).
- You have to use a font of at least size 12 points.
- You have to send your submission to
comp8400@cs.anu.edu.au.
- The file must be submitted by 5 pm on Wednesday 27 May.
Extensions
Students will only be granted an extension on the submission deadline in
exceptional circumstances. Work and sporting commitments are normally
NOT sufficient grounds. If you think you have grounds for an
extension, you should notify the course coordinator as soon as possible
and provide written evidence in support of your case (for example a medical
certificate). The course coordinator will then decide whether to grant
an extension and inform you as soon as practical.
Penalties
Penalties for late submissions are as follows:
| How late | less than 6 hours | 6 to 24 hours |
24 to 48 hours | 48 to 72 hours |
72 to 96 hours | more than 96 hours |
| Penalty from 15 marks |
-0.5 | -1 | -2 | -4 |
-8 | -15 (forget it!) |
Penalties for submission that are longer than 8 pages are as follows:
| Number of pages | 9 | 10
| 11 | 12 or more |
| Penalty from 15 marks |
-1 | -2 | -4 | -6 |
Plagiarism
No group work is permitted for the assignment. We do
encourage you to discuss your work in the labs and lectures,
but we expect you to do the assignment work by yourself.
You should read the chapter in the
Department of Computer Science Student Handbook that discusses
assessment (Chapter 6, pages 18-24), particularly the sections headed
Misconduct in examinations (which also applies to assignments
and other forms of assessment) and Collaborations versus
misconduct in assignments.
If you do include material from some other documents (for example
graphics, figures, tables or formulas extracted from a paper, a book,
lecture slides or a Web site), then you clearly have to make attribution,
for example by writing the name of the paper, book, etc., and where you
got it from. You may include URL links to external documents.
Tasks
- Association rules mining (6 marks)
In this first task, you have to manually create a small
database with seven transactions as specified below, and then
manually find all frequent item-sets of lengths 2 and 3, and
frequent rules of length 3 for the first two of the length 3 frequent
item-sets, as well as their support, confidence and lift values.
The following table contains item-sets with five items a
to e. You have to create a database using your
7-digit ANU university ID as detailed below the table.
| Digit | First | Second | Third |
Fourth |
| 0 |
a, b, c |
a, b, d |
c, e |
a, c, e |
| 1 |
b, c, d, e |
a, c, d, e |
a, d, e |
b, d, e |
| 2 |
a, b, c |
b, c, d |
a, c, e |
a, b, c, e |
| 3 |
b, c, e |
a, c, d |
a, b, c |
a, c, d |
| 4 |
a, c, d |
b, c, d, e |
a, b, d, e |
a, b, e |
| 5 |
a, c, e |
a, b, c, e |
c, e |
a, b |
| 6 |
b, c, d, e |
a, b, d, e |
b, e |
a, b |
| 7 |
a, b, c, e |
a, b, e |
a, d, e |
a, c |
| 8 |
a, b, c, e |
c, e |
a, b, c |
a, c |
| 9 |
a, c, d, e |
a, c, d |
a, b, e |
a, b |
Conduct the following steps manually and write your results into
your assignment report as detailed in the example below.
- Create your database according to your ANU university ID.
For each of the seven digits of your ID, select the corresponding
item-set from the table. If your ID contains a digit twice, or even
three or four times, select the first, second, third or fourth
entry in a table row for the first, second, third or fourth
occurrence of a digit in your ANU university ID.
For example, for the ANU university ID u4024122, the
following seven item-sets (transactions) would be selected from
the table:
- (TID 1) 4 (first):
a, c, d
- (TID 2) 0 (first):
a, b, c
- (TID 3) 2 (first):
a, b, c
- (TID 4) 4 (second): b, c, d, e
- (TID 5) 1 (first):
b, c, d, e
- (TID 6) 2 (second): b, c, d
- (TID 7) 2 (third):
a, c, e
- Find all the frequent item-sets of length 2 that have a count
of at least 2 transactions (i.e. minimum support = 2).
- Find all the frequent item-sets of length 3 that have a count
of at least 2 transactions (i.e. minimum support = 2).
- Sort the frequent item-sets of length 3 alphabetically (according
to the items they contain), and for the first two of them
generate all rules of length 3 they contain. For each of these rules
calculate their support and confidence (as percentage
numbers between 1% and 100%, rounded to one digit after the comma),
and their lift (rounded to two digits after the comma).
The output you have to write into your report should follow the example
given here (which is based on the example ANU university ID
u4024122):
Task 1: ANU student ID: u4024122
------------------------
1. Data set: Digit | Item set
------------------
4 | a, c, d
0 | a, b, c
2 | a, b, c
4 | b, c, d, e
1 | b, c, d, e
2 | b, c, d
2 | a, c, e
2. Frequent 2 item-sets: Item set | Count
----------------
a, b | 2
a, c | 4
b, c | 5
b, d | 3
b, e | 2
c, d | 4
c, e | 3
d, 3 | 2
3. Frequent 3 item-sets: Item set | Count (sorted alphabetically)
-----------------
a, b, c | 2
b, c, d | 3
b, c, e | 2
b, d, e | 2
c, d, e | 2
4. Frequent rules of length 3 from first two frequent 3 item-sets:
Rule | Support | Confidence | Lift
---------------------------------------------
(a, b) -> c | 28.6% | 100.0% | 1.00
(a, c) -> b | 28.6% | 50.0% | 0.70
(b, c) -> a | 28.6% | 40.0% | 0.70
(b, c) -> d | 42.9% | 60.0% | 1.05
(b, d) -> c | 42.9% | 100.0% | 1.00
(c, d) -> b | 42.9% | 75.0% | 1.05
Marking:
You will receive one (1) mark each for the:
- correct data set you generate,
- correct and complete length 2 frequent item-sets,
- correct and complete length 3 frequent item-sets,
- correct rules generated and their correct support values,
- correct confidence values for these rules, and
- correct lift values for these rules.
Clarifications:
- Don't try to use software to solve this simple association rules
mining problem, do it manually, similar to the example given in
the 'Introduction to Association Rules' lecture (slide number 13).
- Classifier accuracy measures (3 marks)
In this second task you have to manually calculate several
accuracy measures based on a confusion matrix which will be
populated as follows. The numbers in the confusion matrix are assumed
to come from a binary (2-class) supervised classifier.
- Take the seven digits of your ANU university ID.
- The number of true positives are the first four (4)
digits of your ANU university ID.
- The number of false positives are the last three (3)
digits of your ANU university ID.
- The number of false negatives are the first three (3)
digits of your ANU university ID.
- The number of true negatives are the last four (4)
digits of your ANU university ID.
Once you have these four numbers (true positives, true negatives,
false positives, and false negatives), you have to complete (and write
into your report to be submitted) the following tasks:
- Fill-in a confusion matrix similar to the example given below.
You should also write down the 'Total' number (sum of the
four entries in the confusion matrix) of training examples.
- Calculate a second normalised confusion matrix, where the
numbers are proportions of the total number of training examples.
Round the numbers to three digits after the comma.
- Calculate the accuracy and error rate as percentages (rounded to
2 digits after the comma) based on the normalised confusion matrix.
- Calculate the specificity, precision and recall as percentages
(rounded to 2 digits after the comma) based on the normalised
confusion matrix.
The output you have to write into your report should follow the example
given here (which is based on the example ANU university ID
u4024122):
Task 2: ANU student ID: u4024122
------------------------
True positives: 4024
False positives: 122
False negatives: 402
True negatives: 4122
1. Confusion matrix:
-----------------------
| Pred Pos | Pred Neg |
|----------+----------|
True Pos | 4024 | 402 |
-----------------------
True Neg | 122 | 4122 |
----------------------- Total: 8670
2. Normalised confusion matrix:
-----------------------
| Pred Pos | Pred Neg |
|----------+----------|
True Pos | 0.464 | 0.046 |
-----------------------
True Neg | 0.014 | 0.475 |
-----------------------
3. Accuracy = 93.96%
Error rate = 6.04%
4. Specificity = 97.13%
Precision = 97.06%
Recall = 90.92%
Marking:
You will receive one (1) mark each for the:
- correct confusion matrix, 'Total' number and normalised
confusion matrix,
- correct accuracy and error rate percentage numbers, and
- correct specificity, precision and recall percentage numbers.
- Classification project / k-NN classifier implementation (6 marks)
There will be two options you can choose from (you only need to do
one of them!):
- This task is similar to the first option (option A) in task 3
of assignment 1, but this time you have
to conduct a classification project using Rattle, rather
than a clustering project.
You have to select a data set from the
UCI Machine Learning
repository with the following restrictions:
- You are not allowed to use any of the data sets that we
have used in the COMP8400 labs.
- You have to use a data set which has two classes
(suitable for binary (2-class) supervised learning).
- You have to use a different data set from what you have
used in the first assignment (we will check that!).
Once you have chosen a data set, conduct a classification project
using Rattle. You have to use and compare at least three
different classification techniques:
- decision tree induction, as used in lab 3,
- support vector machines, as used in lab 4, and
- one of the various other
two-class methods implemented in Rattle.
The third method might require you to read more in the text
book or the Rattle documentation, or search for relevant
information on the Web.
You have to write and submit a report (maximum 5 A4 pages
long) on your classification project that contains the following
information (please put them into clearly indicated sections):
- An executive summary, maximum 10 lines of text long,
which you would give to your supervisor, boss or manager. This
summary should contain a short description of the data set you
have chosen, and the main results/outcomes of your
classification project.
Write this summary without any data mining specific
language, in such a way that a non-specialist, even a
non-computer scientist or non-IT person, could understand
it.
- The name of the data set and where you got it from (URL of
where the data set is available), and a description of the
data set, including the number of records and number of
attributes it contains, and details about these attributes
(their names and types, e.g. numerical, categorical, ordinal,
etc.).
- A description of the data exploration steps you have done
using Rattle, and what you found out about the quality
of this data set, such as number of missing values, out of
range values, distribution of values (minimum, maximum,
mean and median values for numerical attributes, number of
categories for categorical attributes, histogram plots, box
plots, correlation plots, etc).
- A description of the data cleaning and transformation steps
you have done (or not - in which case describe why no
transformation was needed), and a description of attribute
selection and/or feature construction you have done, and the
reasoning behind this.
- A description of the classification approaches you have done
using the three classifiers. This should include a description
of the parameter values chosen for a technique, and why you
choose these values. This section should contain one paragraph
for each of the three classifiers you have used.
Note: You should aim to achieve a classification accuracy as
high as possible for each of the three classifiers.
- A presentation of the classification results for each of the
three classifiers you have used. This should include the
confusion matrix (both with raw numbers and normalised
proportions, similar to task 2 above) for the best
classification result you achieved for each of the three
classifiers, and corresponding appropriate visualisations
(using one of the evaluation methods from Rattle, such
as ROC curve or precision/recall curve). You should also
describe your findings and compare the results you got from the
three classifiers, and describe the models produced by the
classifiers (such as decision trees or rules generated, etc.).
- Finally, a general description of your classification project,
including problems and difficulties encountered, things you
have learned, and steps you would possibly do differently the
next time.
Marking:
You will receive the following marks for your classification report:
- one mark (1) for your executive summary,
- half a mark (0.5) for your data set description,
- one mark (1) for your description of the data exploration
step you have done,
- one mark (1) for your description of the data cleaning
and transformation you have performed,
- one mark (1) for your description of your classification
approaches, including classifiers chosen, their parameters,
etc.,
- one mark (1) for your presentation of the classifier
results, and
- half a mark (0.5) for your general description of your
classification project.
- Implement a simple k-NN (nearest neighbour) classifier in a
programming language of your choice (preferably Python, C, C++
or Java).
Your program must read a text file (an example is given below) which
contains the following information:
- Line 1: k (the number of nearest neighbours to consider),
- Line 2: dim (the dimensionality of the input records),
- Line 3: dist (the distance function to be used - possible
values can be: `euc' for Euclidean distance, 'man' for
Manhattan distance, and 'can' for Canberra distance),
- Line 3: num_train_rec (the number of training records
to follow),
- Following lines: the actual num_train_rec training
records, which each have to consist of (dim+1) comma
separated numbers, with the last number being the class (`0'
or `1'),
- Next line: num_test_rec (the number of test records to
follow), and
- Following lines: the actual num_test_rec test records,
each consisting of dim comma separated numbers).
Here is a simple example input text file:
3
2
euc
4
0, 0, 0
0, 1, 0
3, 5, 1
5, 6, 1
3
1, 1
6, 7
2, 3
It contains the following values: k=3, dimension=2,
distance measure used is Euclidean, number of training
records is four, and the number of test records is three.
When your k-NN classifier program is run using the above example
input file, it should produce an output as follows:
k = 3
dim = 2
Distance: euc
Number of training records = 4
Number of test records = 3
Test record [1.0, 1.0] classified as "0"
Test record [6.0, 7.0] classified as "1"
Test record [2.0, 3.0] classified as "0"
You will have to write and submit a report (maximum 5 A4 pages long)
which contains the following:
- The program listing, which should be well structured and
nicely formatted, and which has to contain enough comments
to understand what the program is doing. Specifically, there
have to be three functions/methods for the three distance
measures you need to implement (Euclidean, Manhattan, and
Canberra distance), and a function/method that performs the
k-NN classification for one test record.
- A description of how the program works (i.e. what the
functions and/or methods in you program are doing).
- A description of how you tested your program (i.e. how you
assured that the program works correctly).
- The output produced by your program on the following three
test input files:
Marking:
You will receive the following marks for your k-NN classifier
implementation:
- two marks (2) for the program listing, its readability,
structure, commenting, etc.,
- half a mark (0.5) for your description of how your
program works,
- half a mark (0.5) for your description of how you tested your
program,
- one mark (1) each for correctly classifying all the test
records in the three test input files.
The report you have to submit for this third task of the assignment
must not be longer than five (5) A4 pages, including any graphics,
plots and program code listings. Do not use fonts smaller than 12
points (except for the program listing, which can be smaller - but has
to be readable).
The total length of the assignment report you have to submit must not be
longer than eight (8) A4 pages (i.e. four (4) sheets of paper printed
double-sided), including any graphics, plots, and references.
Do not use fonts smaller than 12 points!
Last modified: 21/05/2009, 08:46
|