CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP3420 - Assignment 2 - Due Friday 29th May 2009, 12:00 (noon)

This assignment is worth 20% of your total course mark. It will be marked out of 20 as indicated below.

Data Mining

Objectives

The objective of this second assignment is to apply the topics presented in the COMP3420 data mining lectures and labs/tutorials, by (1) using the Rattle data mining tool to do a small data exploration project; (2) doing step-by-step association rules mining on a small toy database; (3) calculating four distances using different distance measures, and (4) calculating classifier accuracy measures.

The estimated time we expect you to spend on this assignment is around 20 hours in total (1 hour per mark).

Submission

You will have to submit one file, named datamining.pdf that contains four parts:

  1. a report describing your data exploration project, as described in task 1 below,

  2. the detailed results of your association rule mining, based on a small toy database, that should follow exactly the example given in task 2 below,

  3. the answers of the distance calculation question given in task 3 below (including your workings), and

  4. the calculations and results of your classifier accuracy measures (based on your ANU university ID), that should follow exactly the example given in task 4 below.

Assignment submission will be electronic. Submit your assignment with the following command from a terminal window (console) in the CSIT computer labs:

submit comp3420 a2 datamining.pdf

Important:

  • Make sure you practise submitting your file well before the deadline.

  • It is possible and perfectly acceptable to submit updated versions of your file. Our system will collect them all and your most recent submission will be the one assessed.

  • You should check that your submission has been successful by clicking the Assessments Show button in StReaMS (http://cs.anu.edu.au/streams/). You should see a list of files together with the time and date of submission.

  • Make sure that your submitted file on the first page at the top contains your name, ANU university ID, and your tutorial group!

  • Your submitted assignment report must not be longer than eight (8) A4 pages (i.e. maximum 4 sheets of paper printed double-sided).

  • You have to use a font of at least size 12 points.

  • The file must be submitted by 12:00 (noon) on Friday 29th May 2009.

Extensions

Students will only be granted an extension on the submission deadline in exceptional circumstances. Work and sporting commitments are normally NOT sufficient grounds. If you think you have grounds for an extension, you should notify Peter Christen as soon as possible and provide written evidence in support of your case (for example a medical certificate). Peter will then decide whether to grant an extension and inform you as soon as practical.

Penalties

Penalties for late submissions are as follows:

How late less than 6 hours 6 to 24 hours 24 to 48 hours 48 to 72 hours 72 to 96 hours more than 96 hours
Penalty from 20 marks -0.5 -1 -2 -4 -8 -15 (forget it!)

 
Penalties for submission that are longer than eight (8) pages are as follows:

Number of pages   9     10     11     12     13   14 or more
Penalty from 20 marks -1 -2 -3 -4 -5 -6

Plagiarism

No group work is permitted for the assignment. We do encourage you to discuss your work in the labs, tutorials and lectures, but we expect you to do the assignment work by yourself.

You should read the chapter in the Department of Computer Science Student Handbook that discusses assessment (Chapter 6, pages 18-24), particularly the sections headed Misconduct in examinations (which also applies to assignments and other forms of assessment), and Collaborations versus misconduct in assignments.

If you do include material from some other documents (for example graphics, figures, tables or formulas extracted from a paper, a book, lecture slides or a Web site), then you clearly have to make attribution, for example by writing the name of the paper, book, etc., and where you got it from. You may include URL links to external documents.

If you have any questions about plagiarism please discuss them with Peter Christen.


Tasks

  1. Rattle data exploration project (6 marks)

    As a first task, you should conduct a data exploration project using the Rattle data mining tool as used in the labs. For this, you have to select a publicly available data set from the University of California Irvine (UCI) Machine Learning Repository.

    You will have to write and submit a report (maximum 4 pages long) which details the steps you have done in your data exploration project. Your report must contain the following information (please use clear section headers for each):

    1. The name of the data set from the UCI Machine Learning Repository you are exploring, as well as the URL within the UCI repository that describes this data set. For example, for the `Yeast' data set,the URL would be: http://archive.ics.uci.edu/ml/datasets/Yeast

    2. A description, in your own words, of the data set, including the number of records and attributes it contains, and details about these attributes (their names and types, for example if they are numerical, categorical, or ordinal attributes, etc.)

    3. A description of the data exploration steps you have done using the Rattle data mining tool, and what you found out about this data set. This should include the number of missing values, out of range values, distribution of values (mean, medians, modes, minimum and maximum, histograms, etc.). You should also include some graphs produced by Rattle to illustrate some of the data set attributes - for example attributes that have an unusual distribution of values - and describe these graphs.

    Marking:

    You will receive the following marks:

    • Up to two (2) marks for your description of the data set. Here we will be looking for completeness and correctness of your description (i.e. are you describing ALL attributes in the data set, and is your description of the attribute types CORRECT).

      Note: You will not get ANY marks if you simply copy some text from the data set description from the UCI Web site or from a .names file that describes a data set. In fact, this is plagiarism - please read the paragraph about this topic above!

    • Up to four (4) marks for your description of the exploration steps you describe in your assignment report. Here we will look at what you have done, how clearly you describe what you have done, how you present it (i.e. text and graphs), and how complete your data exploration is.


  2. Association rule mining (6 marks)

    In this second part, you have to use a small database made of six transactions to manually conduct association rules mining following the Apriori algorithm as presented in the lectures. You have to generate all large item-sets of lengths 2 and 3, and then generate the rules of length 3 for the first frequent item-set of length 3. You also have to calculate the support and confidence percentage values of the rules you generate. A detailed example of what is expected is given below.

    Conduct the following steps manually:

    1. Generate the toy database

      The database will contain six (6) transactions with items a, b, c, d, and e. The first four (4) transactions of this database are given here:

                TID | Item set
                ----------------
                 1  | a, b, c, e
                 2  | b, c, e
                 3  | c, d, e
                 4  | a, b, c, d
                
      Transactions number 5 and 6 you have to generated according to the last two digits of your ANU university ID. This is done as follows.

      The following table contains two item-sets each for the 10 possible digits.

      0 a, b, c a, b, d
      1 b, c, d, e a, c, e
      2 a, b, c a, b, c, e
      3 b, c, e a, c, d
      4 a, c, d a, b, d, e
      5 a, c, d a, b, c, e
      6 b, c, e a, b, d, e
      7 a, b, c, e a, b, e
      8 a, b, c, e a, c, e
      9 a, c, d, e] a, c, d

       
      Now take the last two digits of your ANU university ID. For example, lets assume your ANU ID is u1234567, then the last two digits are 6 and 7.

      Transaction number five in your database will correspond to the first item-set in the above table listed for digit 6 (this is item-set: b, c, e), and transaction number six in your database will correspond to the first item-set in the above table listed for digit 7 (this is item-set: a, b, c, e).

      Important: Only use the second item-set for a certain digit in the above table if both the last two digits in your ANU university ID are the same. For example, if your ANU ID is u4024122, then you should select item-set a, b, c (the first item-set for digit 2 in the above table) as your fifth transaction in your database, and item-set a, b, c, e (the second item-set for digit 2 in the above table) as your fifth transaction in your database.

      Lets make a complete example: Assuming your ANU university ID is u4024122, then your database would look like this:

                TID | Item set
                ----------------
                 1  | a, b, c, e
                 2  | b, c, e
                 3  | c, d, e
                 4  | a, b, c, d
                 5  | a, b, c
                 6  | a, b, c, e
                

      Now that you have your database, you should manually conduct association rules mining as follows.

    2. Find all the frequent item-sets of length 2 that have a count (support) of at least 2 transactions (i.e. minimum support = 2).

    3. Next, find all the frequent item-sets of length 3 that have a count (support) of at least 2 transactions (i.e. minimum support = 2).

    4. Now sort the frequent item-sets of length 3 alphabetically (according to the items they contain), and for the first frequent item-set of length 3, generate all rules of length 3 it contains. Then, for each of these rules calculate their support and confidence as percentage numbers between 1% and 100%, rounded to one digit behind the comma.

    What you have to write into your assignment 2 report (the datamining.pdf document) should follow the example given here (which is based on the example ANU university ID u4024122):

         PART 1: ANU student ID: u4024122
                 ------------------------
    
         A. Data set:  TID | Item set
                        ------------------
                         1  | a, b, c, e
                         2  | b, c, e
                         3  | c, d, e
                         4  | a, b, c, d
                         5  | a, b, c
                         6  | a, b, c, e
    
         B.  Large 2-item sets:  Item set  | Count
                                 -----------------
                                  a, b     |   4
                                  a, c     |   4
                                  a, e     |   2
                                  b, c     |   5
                                  b, e     |   3
                                  c, d     |   2
                                  c, e     |   4
    
         C.  Large 3-item sets:  Item set  | Count
                                 -----------------
                                   a, b, c |   4
                                   a, b, e |   2
                                   a, c, e |   2
                                   b, c, e |   3
    
         D.  Frequent rules of length 3 from first large 3-item sets:
    
                     Rule         | Support | Confidence
                     -----------------------------------
                     (a, b) -> c  |  66.7%  |   100.0%
                     (a, c) -> b  |  66.7%  |   100.0%
                     (b, c) -> a  |  66.7%  |    80.0%
         
    Note: Do not try to use any software to solve this simple association rules mining problem, do it manually, similar to the example given in the `Introduction to Association Rules' lecture.

    Marking:

    You will receive the following marks:

    • one (1) mark for correctly generating your data set (part A),
    • one (1) mark for correctly and completely generating the frequent item-sets of length 2 (part B),
    • one (1) mark for correctly and completely generating the frequent item-sets of length 3 (part C),
    • one (1) mark for correctly and completely generating the frequent rules of length 3 from the first large 3-item sets (part D),
    • one (1) mark for correctly calculating the support values of these rules (part D), and
    • one (1) mark for correctly calculating the confidence values of these rules (part D).


  3. Distance measures for clustering (4 marks)

    You have to calculate several distance measures for two records (data tuples) that contain numerical values. These two records will again be based on your ANU university ID and are calculated as follows.

    • Take your ANU university ID (the seven digit number of the form u1234567) and split it into three: (a) the first three digits, (b) next two digits, and (c) the last two digits.

      For example, 1234567 becomes: 123 / 45 / 67.

    • Now reverse your ANU university ID (u1234567 becomes 7654321) and repeat the splitting process.

      So, our example becomes: 765 / 43 / 21.

    • The two triplets you have just generated are the two records (data tuples). For our example: (123, 45, 67) and (765, 43, 21)

    Using the two records (data tuples) based on your ANU university ID (not the examples given above), you have to calculate the following four distance measures between these two data tuples:

    1. Euclidean distance
    2. Manhattan distance
    3. Minkowski distance using q = 3
    4. Canberra distance (formula not given in text book or lectures - you need to do a bit of research to find out how this distance is calculated!)

    You must write down your workings, i.e. provide details of how you calculated these distances. If you only provide the numerical results you will not receive any marks!

    For each correct answer you will receive one (1) mark. You will not receive any marks if you don't get the calculation of your two data tuples correct!


  4. Classifier accuracy measures (4 marks)

    In this last task you have to manually calculate several accuracy measures based on a confusion matrix which will be populated as follows. The numbers in the confusion matrix are assumed to come from a binary (2-class) supervised classifier.

    1. Take the seven digits of your ANU university ID.
    2. The number of true positives are the last four (4) digits of your ANU university ID.
    3. The number of false positives are the first three (3) digits of your ANU university ID.
    4. Now reverse your ANU university ID, i.e. the first digit becomes the last and the last becomes the first, etc. For example, an ID (u at beginning is removed) 1234567 will be reversed into 7654321.
    5. The number of true negatives are the last four (4) digits of your reversed ANU university ID.
    6. The number of false negatives are the first three (3) digits of your reversed ANU university ID.

    Once you have these four numbers (true positives, true negatives, false positives, and false negatives), you have to calculate and write into your assignment report (the datamining.pdf document) the following tasks:

    1. (1) Fill-in a confusion matrix similar to the example given below. You should also write down the 'Total' number (sum of the four entries in the confusion matrix) of training examples.
      (2) Calculate a second normalised confusion matrix, where the numbers are proportions of the total number or training examples. Round the numbers to three digits after the comma.
    2. Calculate the accuracy and error rate as percentages (rounded to two digits after the comma) based on the confusion matrix.
    3. Calculate the specificity, precision and recall as percentages (rounded to two digits after the comma) based on the confusion matrix.

    The output you have to write into your report should follow the example given here (which is based on the example ANU university ID u4024122):

         PART 3: ANU student ID: u4024122
                 ------------------------
    
         ANU university ID:          4024122 -> 402|4122
         ANU university ID reversed: 2214204 -> 221|4204
    
         True positives:  4122
         False positives:  402
         True negatives:  4204
         False negatives:  221
    
         A. Confusion matrix:
                            ---------------------------------------------
                            | Predicted Positives | Predicted Negatives |
                            |---------------------+---------------------|
            True Positives  |        4122         |         221         |
                            ---------------------------------------------
            True Negatives  |         402         |        4204         |
                            ---------------------------------------------  Total: 8949
         B. Normalised confusion matrix:
                            ---------------------------------------------
                            | Predicted Positives | Predicted Negatives |
                            |---------------------+---------------------|
            True Positives  |       0.461         |       0.025         |
                            ---------------------------------------------
            True Negatives  |       0.045         |       0.470         |
                            ---------------------------------------------
    
         C.  Accuracy =    93.04%
             Error rate =   6.96%
    
         D.  Specificity = 91.27%
             Precision =   91.11%
             Recall =      94.91%
         

    Marking:

    You will receive one (1) mark each for the:

    • correct confusion matrix and 'Total' number (part A),
    • correct normalised confusion matrix (part B),
    • correct accuracy and error rate percentage numbers (part C), and
    • correct specificity, precision and recall percentage numbers (part D).


The total length of the assignment report (the datamining.pdf document) you have to submit must not be longer than eight (8) A4 pages (i.e. 4 sheets of paper printed double-sided), including any graphics, plots, and references. Do not use fonts smaller than 12 points!


Last modified: 21/05/2009, 08:46