CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
Department of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8400 - Assignment 2 - Due week 12 (Wednesday 27 May, 5 pm)

This assignment is worth 15% of your total course mark. It will be marked out of 15 as indicated below.

Association rules mining / Classification

Objectives

The objectives of this second assignment are to apply the topics learned in the middle part of the course by (1) conducting association rules mining on a very small database; (2) calculating classifier accuracy measures; and (3) by either carrying out a classification project in Rattle evaluating three classifiers on a publicly available data set, or by implementing, testing and evaluating a simple k-NN (nearest neighbour) classifier.

The estimated time we expect you to spend on this assignment is around 15 hours in total (1 hour per mark).

Submission

You will have to submit one file, named u1234567-ass2.pdf (please replace u1234567 with your ANU university ID) that contains three parts:

  1. the detailed results of your association rules mining, based on a small database created using your ANU university ID, which should follow exactly the example given in task 1 below,

  2. the calculations and results of your classifier accuracy measures, again based on your ANU university ID, which should follow exactly the example given in task 2 below, and

  3. a report that contains the details of your classification project, or your k-NN classifier implementation, as described in task 3 below.

Important:

  • Make sure that your submitted file on the first page at the top contains your name and ANU university ID!

  • The maximum total length of your submission must not be more than eight (8) A4 pages (i.e. maximum four (4) sheets of paper printed double-sided).

  • You have to use a font of at least size 12 points.

  • You have to send your submission to comp8400@cs.anu.edu.au.

  • The file must be submitted by 5 pm on Wednesday 27 May.

Extensions

Students will only be granted an extension on the submission deadline in exceptional circumstances. Work and sporting commitments are normally NOT sufficient grounds. If you think you have grounds for an extension, you should notify the course coordinator as soon as possible and provide written evidence in support of your case (for example a medical certificate). The course coordinator will then decide whether to grant an extension and inform you as soon as practical.

Penalties

Penalties for late submissions are as follows:

How late less than 6 hours 6 to 24 hours 24 to 48 hours 48 to 72 hours 72 to 96 hours more than 96 hours
Penalty from 15 marks -0.5 -1 -2 -4 -8 -15 (forget it!)

Penalties for submission that are longer than 8 pages are as follows:

Number of pages   9     10     11     12 or more
Penalty from 15 marks -1 -2 -4 -6

Plagiarism

No group work is permitted for the assignment. We do encourage you to discuss your work in the labs and lectures, but we expect you to do the assignment work by yourself.

You should read the chapter in the Department of Computer Science Student Handbook that discusses assessment (Chapter 6, pages 18-24), particularly the sections headed Misconduct in examinations (which also applies to assignments and other forms of assessment) and Collaborations versus misconduct in assignments.

If you do include material from some other documents (for example graphics, figures, tables or formulas extracted from a paper, a book, lecture slides or a Web site), then you clearly have to make attribution, for example by writing the name of the paper, book, etc., and where you got it from. You may include URL links to external documents.


Tasks

  1. Association rules mining (6 marks)

    In this first task, you have to manually create a small database with seven transactions as specified below, and then manually find all frequent item-sets of lengths 2 and 3, and frequent rules of length 3 for the first two of the length 3 frequent item-sets, as well as their support, confidence and lift values.

    The following table contains item-sets with five items a to e. You have to create a database using your 7-digit ANU university ID as detailed below the table.

    Digit First Second Third Fourth
    0 a, b, c a, b, d c, e a, c, e
    1 b, c, d, e a, c, d, e a, d, e b, d, e
    2 a, b, c b, c, d a, c, e a, b, c, e
    3 b, c, e a, c, d a, b, c a, c, d
    4 a, c, d b, c, d, e a, b, d, e a, b, e
    5 a, c, e a, b, c, e c, e a, b
    6 b, c, d, e a, b, d, e b, e a, b
    7 a, b, c, e a, b, e a, d, e a, c
    8 a, b, c, e c, e a, b, c a, c
    9 a, c, d, e a, c, d a, b, e a, b

     
    Conduct the following steps manually and write your results into your assignment report as detailed in the example below.

    1. Create your database according to your ANU university ID. For each of the seven digits of your ID, select the corresponding item-set from the table. If your ID contains a digit twice, or even three or four times, select the first, second, third or fourth entry in a table row for the first, second, third or fourth occurrence of a digit in your ANU university ID.

      For example, for the ANU university ID u4024122, the following seven item-sets (transactions) would be selected from the table:

      1. (TID 1)   4 (first):     a, c, d
      2. (TID 2)   0 (first):     a, b, c
      3. (TID 3)   2 (first):     a, b, c
      4. (TID 4)   4 (second): b, c, d, e
      5. (TID 5)   1 (first):     b, c, d, e
      6. (TID 6)   2 (second): b, c, d
      7. (TID 7)   2 (third):    a, c, e

    2. Find all the frequent item-sets of length 2 that have a count of at least 2 transactions (i.e. minimum support = 2).

    3. Find all the frequent item-sets of length 3 that have a count of at least 2 transactions (i.e. minimum support = 2).

    4. Sort the frequent item-sets of length 3 alphabetically (according to the items they contain), and for the first two of them generate all rules of length 3 they contain. For each of these rules calculate their support and confidence (as percentage numbers between 1% and 100%, rounded to one digit after the comma), and their lift (rounded to two digits after the comma).

    The output you have to write into your report should follow the example given here (which is based on the example ANU university ID u4024122):

         Task 1: ANU student ID: u4024122
                 ------------------------
    
         1.  Data set:  Digit | Item set
                        ------------------
                          4   | a, c, d
                          0   | a, b, c
                          2   | a, b, c
                          4   | b, c, d, e
                          1   | b, c, d, e
                          2   | b, c, d
                          2   | a, c, e
    
         2.  Frequent 2 item-sets:  Item set | Count
                                    ----------------
                                      a, b   |   2
                                      a, c   |   4
                                      b, c   |   5
                                      b, d   |   3
                                      b, e   |   2
                                      c, d   |   4
                                      c, e   |   3
                                      d, 3   |   2
    
         3.  Frequent 3 item-sets:  Item set  | Count   (sorted alphabetically)
                                    -----------------
                                     a, b, c  |   2
                                     b, c, d  |   3
                                     b, c, e  |   2
                                     b, d, e  |   2
                                     c, d, e  |   2
    
         4.  Frequent rules of length 3 from first two frequent 3 item-sets:
    
                          Rule          | Support  | Confidence |  Lift
                          ---------------------------------------------
                           (a, b) -> c  |   28.6%  |    100.0%  |  1.00
                           (a, c) -> b  |   28.6%  |     50.0%  |  0.70
                           (b, c) -> a  |   28.6%  |     40.0%  |  0.70
                           (b, c) -> d  |   42.9%  |     60.0%  |  1.05
                           (b, d) -> c  |   42.9%  |    100.0%  |  1.00
                           (c, d) -> b  |   42.9%  |     75.0%  |  1.05
         

    Marking:

    You will receive one (1) mark each for the:

    • correct data set you generate,
    • correct and complete length 2 frequent item-sets,
    • correct and complete length 3 frequent item-sets,
    • correct rules generated and their correct support values,
    • correct confidence values for these rules, and
    • correct lift values for these rules.

    Clarifications:

    • Don't try to use software to solve this simple association rules mining problem, do it manually, similar to the example given in the 'Introduction to Association Rules' lecture (slide number 13).


  2. Classifier accuracy measures (3 marks)

    In this second task you have to manually calculate several accuracy measures based on a confusion matrix which will be populated as follows. The numbers in the confusion matrix are assumed to come from a binary (2-class) supervised classifier.

    1. Take the seven digits of your ANU university ID.

    2. The number of true positives are the first four (4) digits of your ANU university ID.

    3. The number of false positives are the last three (3) digits of your ANU university ID.

    4. The number of false negatives are the first three (3) digits of your ANU university ID.

    5. The number of true negatives are the last four (4) digits of your ANU university ID.

    Once you have these four numbers (true positives, true negatives, false positives, and false negatives), you have to complete (and write into your report to be submitted) the following tasks:

    1. Fill-in a confusion matrix similar to the example given below. You should also write down the 'Total' number (sum of the four entries in the confusion matrix) of training examples.

    2. Calculate a second normalised confusion matrix, where the numbers are proportions of the total number of training examples. Round the numbers to three digits after the comma.

    3. Calculate the accuracy and error rate as percentages (rounded to 2 digits after the comma) based on the normalised confusion matrix.

    4. Calculate the specificity, precision and recall as percentages (rounded to 2 digits after the comma) based on the normalised confusion matrix.

    The output you have to write into your report should follow the example given here (which is based on the example ANU university ID u4024122):

         Task 2: ANU student ID: u4024122
                 ------------------------
    
         True positives:  4024
         False positives:  122
         False negatives:  402
         True negatives:  4122
    
         1.  Confusion matrix:
                      -----------------------
                      | Pred Pos | Pred Neg |
                      |----------+----------|
             True Pos |    4024  |     402  |
                      -----------------------
             True Neg |     122  |    4122  |
                      -----------------------  Total: 8670
    
         2.  Normalised confusion matrix:
                      -----------------------
                      | Pred Pos | Pred Neg |
                      |----------+----------|
             True Pos |   0.464  |   0.046  |
                      -----------------------
             True Neg |   0.014  |   0.475  |
                      -----------------------
    
         3.  Accuracy =    93.96%
             Error rate =   6.04%
    
         4.  Specificity = 97.13%
             Precision =   97.06%
             Recall =      90.92%
         

    Marking:

    You will receive one (1) mark each for the:

    • correct confusion matrix, 'Total' number and normalised confusion matrix,
    • correct accuracy and error rate percentage numbers, and
    • correct specificity, precision and recall percentage numbers.


  3. Classification project / k-NN classifier implementation (6 marks)

    There will be two options you can choose from (you only need to do one of them!):

    1. This task is similar to the first option (option A) in task 3 of assignment 1, but this time you have to conduct a classification project using Rattle, rather than a clustering project.

      You have to select a data set from the UCI Machine Learning repository with the following restrictions:

      • You are not allowed to use any of the data sets that we have used in the COMP8400 labs.

      • You have to use a data set which has two classes (suitable for binary (2-class) supervised learning).

      • You have to use a different data set from what you have used in the first assignment (we will check that!).

      Once you have chosen a data set, conduct a classification project using Rattle. You have to use and compare at least three different classification techniques:

      1. decision tree induction, as used in lab 3,
      2. support vector machines, as used in lab 4, and
      3. one of the various other two-class methods implemented in Rattle.

      The third method might require you to read more in the text book or the Rattle documentation, or search for relevant information on the Web.

      You have to write and submit a report (maximum 5 A4 pages long) on your classification project that contains the following information (please put them into clearly indicated sections):

      1. An executive summary, maximum 10 lines of text long, which you would give to your supervisor, boss or manager. This summary should contain a short description of the data set you have chosen, and the main results/outcomes of your classification project.

        Write this summary without any data mining specific language, in such a way that a non-specialist, even a non-computer scientist or non-IT person, could understand it.

      2. The name of the data set and where you got it from (URL of where the data set is available), and a description of the data set, including the number of records and number of attributes it contains, and details about these attributes (their names and types, e.g. numerical, categorical, ordinal, etc.).

      3. A description of the data exploration steps you have done using Rattle, and what you found out about the quality of this data set, such as number of missing values, out of range values, distribution of values (minimum, maximum, mean and median values for numerical attributes, number of categories for categorical attributes, histogram plots, box plots, correlation plots, etc).

      4. A description of the data cleaning and transformation steps you have done (or not - in which case describe why no transformation was needed), and a description of attribute selection and/or feature construction you have done, and the reasoning behind this.

      5. A description of the classification approaches you have done using the three classifiers. This should include a description of the parameter values chosen for a technique, and why you choose these values. This section should contain one paragraph for each of the three classifiers you have used.

        Note: You should aim to achieve a classification accuracy as high as possible for each of the three classifiers.

      6. A presentation of the classification results for each of the three classifiers you have used. This should include the confusion matrix (both with raw numbers and normalised proportions, similar to task 2 above) for the best classification result you achieved for each of the three classifiers, and corresponding appropriate visualisations (using one of the evaluation methods from Rattle, such as ROC curve or precision/recall curve). You should also describe your findings and compare the results you got from the three classifiers, and describe the models produced by the classifiers (such as decision trees or rules generated, etc.).

      7. Finally, a general description of your classification project, including problems and difficulties encountered, things you have learned, and steps you would possibly do differently the next time.

      Marking:

      You will receive the following marks for your classification report:

      • one mark (1) for your executive summary,
      • half a mark (0.5) for your data set description,
      • one mark (1) for your description of the data exploration step you have done,
      • one mark (1) for your description of the data cleaning and transformation you have performed,
      • one mark (1) for your description of your classification approaches, including classifiers chosen, their parameters, etc.,
      • one mark (1) for your presentation of the classifier results, and
      • half a mark (0.5) for your general description of your classification project.


    2. Implement a simple k-NN (nearest neighbour) classifier in a programming language of your choice (preferably Python, C, C++ or Java).

      Your program must read a text file (an example is given below) which contains the following information:

      1. Line 1: k (the number of nearest neighbours to consider),
      2. Line 2: dim (the dimensionality of the input records),
      3. Line 3: dist (the distance function to be used - possible values can be: `euc' for Euclidean distance, 'man' for Manhattan distance, and 'can' for Canberra distance),
      4. Line 3: num_train_rec (the number of training records to follow),
      5. Following lines: the actual num_train_rec training records, which each have to consist of (dim+1) comma separated numbers, with the last number being the class (`0' or `1'),
      6. Next line: num_test_rec (the number of test records to follow), and
      7. Following lines: the actual num_test_rec test records, each consisting of dim comma separated numbers).

      Here is a simple example input text file:

                3
                2
                euc
                4
                0, 0, 0
                0, 1, 0
                3, 5, 1
                5, 6, 1
                3
                1, 1
                6, 7
                2, 3
                
      It contains the following values: k=3, dimension=2, distance measure used is Euclidean, number of training records is four, and the number of test records is three.

      When your k-NN classifier program is run using the above example input file, it should produce an output as follows:

                k = 3
                dim = 2
                Distance: euc
      
                Number of training records = 4
                Number of test records = 3
      
                Test record [1.0, 1.0] classified as "0"
                Test record [6.0, 7.0] classified as "1"
                Test record [2.0, 3.0] classified as "0"
                

      You will have to write and submit a report (maximum 5 A4 pages long) which contains the following:

      1. The program listing, which should be well structured and nicely formatted, and which has to contain enough comments to understand what the program is doing. Specifically, there have to be three functions/methods for the three distance measures you need to implement (Euclidean, Manhattan, and Canberra distance), and a function/method that performs the k-NN classification for one test record.

      2. A description of how the program works (i.e. what the functions and/or methods in you program are doing).

      3. A description of how you tested your program (i.e. how you assured that the program works correctly).

      4. The output produced by your program on the following three test input files:

      Marking:

      You will receive the following marks for your k-NN classifier implementation:

      • two marks (2) for the program listing, its readability, structure, commenting, etc.,
      • half a mark (0.5) for your description of how your program works,
      • half a mark (0.5) for your description of how you tested your program,
      • one mark (1) each for correctly classifying all the test records in the three test input files.

    The report you have to submit for this third task of the assignment must not be longer than five (5) A4 pages, including any graphics, plots and program code listings. Do not use fonts smaller than 12 points (except for the program listing, which can be smaller - but has to be readable).


The total length of the assignment report you have to submit must not be longer than eight (8) A4 pages (i.e. four (4) sheets of paper printed double-sided), including any graphics, plots, and references. Do not use fonts smaller than 12 points!


Last modified: 21/05/2009, 08:46