|
|
COMP3420 - Assignment 2 - Due Friday 29th May 2009, 12:00 (noon)
This assignment is worth 20% of your total course mark. It will be
marked out of 20 as indicated below.
Data Mining
Objectives
The objective of this second assignment is to apply the topics presented
in the COMP3420 data mining lectures and labs/tutorials, by (1) using the
Rattle data mining tool to do a small data exploration project;
(2) doing step-by-step association rules mining on a small toy database;
(3) calculating four distances using different distance measures, and
(4) calculating classifier accuracy measures.
The estimated time we expect you to spend on this assignment is around
20 hours in total (1 hour per mark).
Submission
You will have to submit one file, named datamining.pdf that
contains four parts:
- a report describing your data exploration project, as described in
task 1 below,
- the detailed results of your association rule mining, based on a
small toy database, that should follow exactly the example given
in task 2 below,
- the answers of the distance calculation question given in task 3
below (including your workings), and
- the calculations and results of your classifier accuracy measures
(based on your ANU university ID), that should follow exactly the
example given in task 4 below.
Assignment submission will be electronic. Submit your assignment
with the following command from a terminal window (console) in the CSIT
computer labs:
submit comp3420 a2 datamining.pdf
Important:
- Make sure you practise submitting your file well before the
deadline.
- It is possible and perfectly acceptable to submit updated versions of
your file. Our system will collect them all and your most recent
submission will be the one assessed.
- You should check that your submission has been successful by
clicking the Assessments Show button in StReaMS
(http://cs.anu.edu.au/streams/). You should see a list of files
together with the time and date of submission.
- Make sure that your submitted file on the first page at the top
contains your name, ANU university ID, and your tutorial group!
- Your submitted assignment report must not be longer than eight (8)
A4 pages (i.e. maximum 4 sheets of paper printed double-sided).
- You have to use a font of at least size 12 points.
- The file must be submitted by 12:00 (noon) on Friday 29th May
2009.
Extensions
Students will only be granted an extension on the submission deadline in
exceptional circumstances. Work and sporting commitments are normally
NOT sufficient grounds. If you think you have grounds for an
extension, you should notify
Peter
Christen as soon as possible and provide written evidence in support
of your case (for example a medical certificate). Peter will then decide
whether to grant an extension and inform you as soon as practical.
Penalties
Penalties for late submissions are as follows:
| How late | less than 6 hours | 6 to 24 hours |
24 to 48 hours | 48 to 72 hours |
72 to 96 hours | more than 96 hours |
| Penalty from 20 marks |
-0.5 | -1 | -2 | -4 |
-8 | -15 (forget it!) |
Penalties for submission that are longer than eight (8) pages are as
follows:
| Number of pages | 9 | 10
| 11 | 12 |
13 | 14 or more |
| Penalty from 20 marks |
-1 | -2 | -3 | -4 |
-5 | -6 |
Plagiarism
No group work is permitted for the assignment. We do encourage
you to discuss your work in the labs, tutorials and lectures, but we expect
you to do the assignment work by yourself.
You should read the chapter in the
Department of Computer Science Student Handbook that discusses
assessment (Chapter 6, pages 18-24), particularly the sections headed
Misconduct in examinations (which also applies to assignments
and other forms of assessment), and Collaborations versus
misconduct in assignments.
If you do include material from some other documents (for example
graphics, figures, tables or formulas extracted from a paper, a book,
lecture slides or a Web site), then you clearly have to make attribution,
for example by writing the name of the paper, book, etc., and where you
got it from. You may include URL links to external documents.
If you have any questions about plagiarism please discuss them with
Peter
Christen.
Tasks
- Rattle data exploration project (6 marks)
As a first task, you should conduct a data exploration project using
the Rattle data mining tool as used in the labs. For
this, you have to select a publicly available data set from the
University of
California Irvine (UCI) Machine Learning Repository.
You will have to write and submit a report (maximum 4 pages long)
which details the steps you have done in your data exploration
project. Your report must contain the following information (please
use clear section headers for each):
- The name of the data set from the UCI Machine Learning Repository
you are exploring, as well as the URL within the UCI repository
that describes this data set. For example, for the `Yeast' data
set,the URL would be:
http://archive.ics.uci.edu/ml/datasets/Yeast
- A description, in your own words, of the data set,
including the number of records and attributes it contains, and
details about these attributes (their names and types, for
example if they are numerical, categorical, or ordinal
attributes, etc.)
- A description of the data exploration steps you have done
using the Rattle data mining tool, and what you found out
about this data set. This should include the number of missing
values, out of range values, distribution of values (mean, medians,
modes, minimum and maximum, histograms, etc.). You should also
include some graphs produced by Rattle to illustrate some
of the data set attributes - for example attributes that have
an unusual distribution of values - and describe these graphs.
Marking:
You will receive the following marks:
- Association rule mining (6 marks)
In this second part, you have to use a small database made of six
transactions to manually conduct association rules mining following
the Apriori algorithm as presented in the lectures. You have to
generate all large item-sets of lengths 2 and 3, and then generate the
rules of length 3 for the first frequent item-set of length 3. You
also have to calculate the support and confidence percentage values of
the rules you generate. A detailed example of what is expected is given
below.
Conduct the following steps manually:
- Generate the toy database
The database will contain six (6) transactions with items
a, b, c, d, and
e. The first four (4) transactions of this database
are given here:
TID | Item set
----------------
1 | a, b, c, e
2 | b, c, e
3 | c, d, e
4 | a, b, c, d
Transactions number 5 and 6 you have to generated according to
the last two digits of your ANU university ID. This is done as
follows.
The following table contains two item-sets each for the 10
possible digits.
| 0 |
a, b, c |
a, b, d |
| 1 |
b, c, d, e |
a, c, e |
| 2 |
a, b, c |
a, b, c, e |
| 3 |
b, c, e |
a, c, d |
| 4 |
a, c, d |
a, b, d, e |
| 5 |
a, c, d |
a, b, c, e |
| 6 |
b, c, e |
a, b, d, e |
| 7 |
a, b, c, e |
a, b, e |
| 8 |
a, b, c, e |
a, c, e |
| 9 |
a, c, d, e] |
a, c, d |
Now take the last two digits of your ANU university ID. For
example, lets assume your ANU ID is u1234567, then the
last two digits are 6 and 7.
Transaction number five in your database will correspond to the
first item-set in the above table listed for digit 6
(this is item-set: b, c, e), and transaction number six
in your database will correspond to the first item-set in
the above table listed for digit 7 (this is item-set:
a, b, c, e).
Important: Only use the second item-set for a certain
digit in the above table if both the last two digits in your ANU
university ID are the same. For example, if your ANU ID is
u4024122, then you should select item-set
a, b, c (the first item-set for digit 2 in the above
table) as your fifth transaction in your database, and
item-set a, b, c, e (the second item-set for digit
2 in the above table) as your fifth transaction in your
database.
Lets make a complete example: Assuming your ANU university ID is
u4024122, then your database would look like this:
TID | Item set
----------------
1 | a, b, c, e
2 | b, c, e
3 | c, d, e
4 | a, b, c, d
5 | a, b, c
6 | a, b, c, e
Now that you have your database, you should manually
conduct association rules mining as follows.
- Find all the frequent item-sets of length 2 that have a
count (support) of at least 2 transactions (i.e. minimum
support = 2).
- Next, find all the frequent item-sets of length 3 that
have a count (support) of at least 2 transactions (i.e.
minimum support = 2).
- Now sort the frequent item-sets of length 3 alphabetically
(according to the items they contain), and for the first
frequent item-set of length 3, generate all rules of length 3
it contains. Then, for each of these rules calculate their
support and confidence as percentage numbers
between 1% and 100%, rounded to one digit behind the comma.
What you have to write into your assignment 2 report (the
datamining.pdf document) should follow the example given here
(which is based on the example ANU university ID u4024122):
PART 1: ANU student ID: u4024122
------------------------
A. Data set: TID | Item set
------------------
1 | a, b, c, e
2 | b, c, e
3 | c, d, e
4 | a, b, c, d
5 | a, b, c
6 | a, b, c, e
B. Large 2-item sets: Item set | Count
-----------------
a, b | 4
a, c | 4
a, e | 2
b, c | 5
b, e | 3
c, d | 2
c, e | 4
C. Large 3-item sets: Item set | Count
-----------------
a, b, c | 4
a, b, e | 2
a, c, e | 2
b, c, e | 3
D. Frequent rules of length 3 from first large 3-item sets:
Rule | Support | Confidence
-----------------------------------
(a, b) -> c | 66.7% | 100.0%
(a, c) -> b | 66.7% | 100.0%
(b, c) -> a | 66.7% | 80.0%
Note: Do not try to use any software to solve this simple
association rules mining problem, do it manually, similar to the
example given in the `Introduction to Association Rules' lecture.
Marking:
You will receive the following marks:
- one (1) mark for correctly generating your data set (part A),
- one (1) mark for correctly and completely generating the
frequent item-sets of length 2 (part B),
- one (1) mark for correctly and completely generating the
frequent item-sets of length 3 (part C),
- one (1) mark for correctly and completely generating the
frequent rules of length 3 from the first large 3-item sets
(part D),
- one (1) mark for correctly calculating the support values of
these rules (part D), and
- one (1) mark for correctly calculating the confidence values of
these rules (part D).
- Distance measures for clustering (4 marks)
You have to calculate several distance measures for two records
(data tuples) that contain numerical values. These two records
will again be based on your ANU university ID and are calculated as
follows.
- Take your ANU university ID (the seven digit number of the form
u1234567) and split it into three: (a) the first
three digits, (b) next two digits, and (c) the last two digits.
For example, 1234567 becomes: 123 / 45 /
67.
- Now reverse your ANU university ID (u1234567 becomes
7654321) and repeat the splitting process.
So, our example becomes: 765 / 43 / 21.
- The two triplets you have just generated are the two records
(data tuples). For our example: (123, 45, 67) and
(765, 43, 21)
Using the two records (data tuples) based on your ANU university
ID (not the examples given above), you have to calculate the
following four distance measures between these two data tuples:
- Euclidean distance
- Manhattan distance
- Minkowski distance using q = 3
- Canberra distance (formula not given in text book or
lectures - you need to do a bit of research to find out how
this distance is calculated!)
You must write down your workings, i.e. provide details of how you
calculated these distances. If you only provide the numerical
results you will not receive any marks!
For each correct answer you will receive one (1) mark. You will not
receive any marks if you don't get the calculation of your two data
tuples correct!
- Classifier accuracy measures (4 marks)
In this last task you have to manually calculate several accuracy
measures based on a confusion matrix which will be populated
as follows. The numbers in the confusion matrix are assumed to come
from a binary (2-class) supervised classifier.
- Take the seven digits of your ANU university ID.
- The number of true positives are the last four (4)
digits of your ANU university ID.
- The number of false positives are the first three (3)
digits of your ANU university ID.
- Now reverse your ANU university ID, i.e. the first digit becomes
the last and the last becomes the first, etc. For example, an ID
(u at beginning is removed) 1234567 will be
reversed into 7654321.
- The number of true negatives are the last four (4)
digits of your reversed ANU university ID.
- The number of false negatives are the first three (3)
digits of your reversed ANU university ID.
Once you have these four numbers (true positives, true negatives,
false positives, and false negatives), you have to calculate and
write into your assignment report (the datamining.pdf document)
the following tasks:
- (1) Fill-in a confusion matrix similar to the example given below.
You should also write down the 'Total' number (sum of the
four entries in the confusion matrix) of training examples.
(2) Calculate a second normalised confusion matrix, where the
numbers are proportions of the total number or training examples.
Round the numbers to three digits after the comma.
- Calculate the accuracy and error rate as percentages (rounded to
two digits after the comma) based on the confusion matrix.
- Calculate the specificity, precision and recall as percentages
(rounded to two digits after the comma) based on the
confusion matrix.
The output you have to write into your report should follow the example
given here (which is based on the example ANU university ID
u4024122):
PART 3: ANU student ID: u4024122
------------------------
ANU university ID: 4024122 -> 402|4122
ANU university ID reversed: 2214204 -> 221|4204
True positives: 4122
False positives: 402
True negatives: 4204
False negatives: 221
A. Confusion matrix:
---------------------------------------------
| Predicted Positives | Predicted Negatives |
|---------------------+---------------------|
True Positives | 4122 | 221 |
---------------------------------------------
True Negatives | 402 | 4204 |
--------------------------------------------- Total: 8949
B. Normalised confusion matrix:
---------------------------------------------
| Predicted Positives | Predicted Negatives |
|---------------------+---------------------|
True Positives | 0.461 | 0.025 |
---------------------------------------------
True Negatives | 0.045 | 0.470 |
---------------------------------------------
C. Accuracy = 93.04%
Error rate = 6.96%
D. Specificity = 91.27%
Precision = 91.11%
Recall = 94.91%
Marking:
You will receive one (1) mark each for the:
- correct confusion matrix and 'Total' number (part A),
- correct normalised confusion matrix (part B),
- correct accuracy and error rate percentage numbers (part C),
and
- correct specificity, precision and recall percentage numbers
(part D).
The total length of the assignment report (the datamining.pdf document)
you have to submit must not be longer than eight (8) A4 pages
(i.e. 4 sheets of paper printed double-sided), including any graphics, plots,
and references. Do not use fonts smaller than 12 points!
Last modified: 21/05/2009, 08:46
|