CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
Department of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8400 - Assignment 1 - Due week 9 (Wednesday 6 May, 5 pm)

This assignment is worth 15% of your total course mark. It will be marked out of 15 as indicated below.

Data issues in data mining / Clustering

Objectives

The objective of this first assignment is to apply the topics learned in the first part of the course by (1) writing a short essay (answering several questions) in the area of data issues (like pre-processing and quality); (2) answer a set of questions related to clustering; and (3) to either use a data mining tool to conduct a clustering project or to implement, test and evaluate a simple clustering technique.

The estimated time we expect you to spend on this assignment is around 15 hours in total (1 hour per mark).

Submission

You will have to submit one file, named u1234567-ass1.pdf (please replace u1234567 with your ANU university ID) that contains three parts:

  1. An essay / report, answering the questions given in part 1 below (maximum length 3 A4 pages).

  2. The answers on the theoretical clustering question (on distance measures) given in part 2 below.

  3. A report that contains the details of your clustering project or implementation as described in part 3 below (maximum length 5 A4 pages).

Important:

  • Make sure that your submitted file on the first page contains your name and ANU university ID!

  • The maximum total length of your submission must not be more than eight (8) A4 pages.

  • You have to use a font size of at least 12 points.

  • You have to send your submission to comp8400@cs.anu.edu.au.

  • The file must be submitted by 5 pm on Wednesday 6 May.

Extensions

Students will only be granted an extension on the submission deadline in exceptional circumstances. Work and sporting commitments are normally NOT sufficient grounds. If you think you have grounds for an extension, you should notify the course coordinator as soon as possible and provide written evidence in support of your case (for example a medical certificate). The course coordinator will then decide whether to grant an extension and inform you as soon as practical.

Penalties

Penalties for late submissions are as follows:

How late less than 6 hours 6 to 24 hours 24 to 48 hours 48 to 72 hours 72 to 96 hours more than 96 hours
Penalty from 15 marks -0.5 -1 -2 -4 -8 -15 (forget it!)

Penalties for submission that are longer than 8 pages are as follows:

Number of pages   9     10     11     12 or more
Penalty from 15 marks -1 -2 -4 -6

Plagiarism

No group work is permitted for the assignment. We do encourage you to discuss your work in the labs and lectures, but we expect you to do the assignment work by yourself.

You should read the chapter in the Department of Computer Science Student Handbook that discusses assessment (Chapter 6, pages 18-24), particularly the sections headed Misconduct in examinations (which also applies to assignments and other forms of assessment) and Collaborations versus misconduct in assignments.

If you do include material from some other documents (for example graphics, figures, tables or formulas extracted from a paper, a book, lecture slides or a Web site), then you clearly have to make attribution, for example by writing the name of the paper, book, etc., and where you got it from. You may include URL links to external documents.


Tasks

  1. Data mining at a national institute of sports (6 marks)

    Many nations, including Australia, have national institutes to foster sports and produce world-class athletes. In Australia, the Australian Institute of Sports, located in Canberra, takes on that role.

    Increasingly, sports-researchers use sophisticated data analysis techniques in order to investigate how the performance of athletes can be optimised further and therefore increase their chances to win more gold medals, for example at the next Olympic games.

    Imagine you are being hired as a new data mining specialist by such a national institute of sports. Your first tasks are to

    1. investigate the data that is being collected by this institute,
    2. to build a data warehouse that is capable of holding all data collected by the institute, and
    3. to look into what data mining projects could be conducted using this data.

    For this first part of the assignment, you should write a report addressing the following six questions. Your report should be maximum three (3) A4 pages long. You have to use a font size of at least 12 points.

    1. Describe in more details what kind of data such a national institute of sports would collect (data types, sources, formats, volume, etc).

    2. How would you design a data warehouse for this national institute of sports? What would the main dimensions be? And what kind of data would be stored in this data warehouse?

    3. What kind of data pre-processing and data integration issues will you have to consider when designing such a data warehouse? What kind of external data do you think would be of interest and should be integrated into such a data warehouse?

    4. What kind of questions do you think the sports researchers and athletes of this national institute of sports would like to have addressed by data mining? What could the outcomes of such data mining projects be?

    5. What challenges will you most likely face as a data miner at this national institute of sports?

    6. Are there any legal/regulatory requirements that might limit the use of data mining at this national institute of sports?

    You will receive up-to one mark for each of the above six questions.


  2. Distance measures for clustering (2 marks)

    You have to calculate several distance measures for two records (data tuples) that contain numerical values. These two records will be based on your ANU university ID and are calculated as follows.

    • Take your ANU university ID (the seven digit number of the form u1234567) and split it into three: (a) the first three digits, (b) next two digits, and (c) the last two digits.

      For example, u1234567 becomes: 123 / 45 / 67.

    • Now reverse your ANU university ID (u1234567 becomes 7654321) and repeat the splitting process.

      So, our example becomes: 765 / 43 / 21.

    • The two triplets you have just generated are the two records (data tuples). For our example: (123, 45, 67) and (765, 43, 21)

    Using the two records (data tuples) based on your ANU university ID (not the examples given above), you have to calculate the following four distance measures between these two data tuples:

    1. Euclidean distance
    2. Manhattan distance
    3. Minkowski distance using q = 3
    4. Canberra distance (formula not given in text book or lectures)

    You must write down your workings, i.e. provide details of how you calculated these distances. If you only provide the numerical results you will not receive any marks!

    For each correct answer you will receive 1/2 mark. You will not receive any marks if you don't get the calculation of your two data tuples correct!


  3. Clustering (7 marks)

    There will be two options you can choose from (you only need to do one of them!):

    1. Conduct a clustering project using Rattle (or another data mining tool) on a publicly available data set of your choice. You will have to write and submit a report (maximum 5 A4 pages long) which details the steps you have done in your clustering project. Your report must contain the following information (please use clear section headers for each):

      1. You must provide: (a) the name of the data set, (b) where you got it from (the name of the data repository and the URL to a page where the data set is described or from where it is available), and (c) a description of the data set that must include the number of records and attributes (variable) it contains, and details about these attributes (their names and types, for example numerical, categorical, ordinal, etc.)
        Note: You are not allowed to use any of the data sets used in the COMP8400 labs.

      2. A description of the data exploration steps you have done using the data mining tool, and what you found out about the data quality of this data set, for example number of missing values, out of range values, distribution of values (means values, minimum and maximum, histograms, etc.).

      3. A description of the data cleaning and transformation steps you might have done (or not - in which case describe why no transformation was needed).

      4. A description of attribute selection and/or feature construction you have done, and the reasoning behind this.

      5. A description of the clustering approach taken, which has to include (a) the clustering technique (or techniques) used, (b) the reason why you chose this techniques(s), and (c) the description of the parameter values you have chosen (for example, if you use k-means provide the number of clusters k and distance measure you have chosen, and why).

      6. A description of the clusters you have found in the data set, and if these clusters are sensible, i.e. correspond to what you would have expected in such a data set.

      7. Finally, a general description of your clustering project, including problems and difficulties encountered, things you have learned, and steps you would possibly do different the next time.

      Clarifications:

      • Q1: I'm not sure what you mean by part A.5: A description of attribute selection of feature construction you have done, and the reasoning behind this.

        A1: When selecting a data set, you will likely find that not all of the attributes in this data set are suitable for clustering, or some might not be useful for various reasons (depending upon the data set). Additionally, you might even decide to create new features (possibly outside of Rattle), or impute some of the attributes in order to improve the data for clustering.
        So you should describe why you do (or do not) select some of the attributes, why you do data imputation, or feature construction on the data set that you are using.

      • Q2: Could you tell me what data is suitable for the clustering part in the project?
        How many rows should be enough? How many attributes?

        A2: I would imagine a suitable data set would contain several hundreds to several thousand records (rows), and also several (up to maybe 10) attributes (of which you possibly do not use all for the clustering).


    2. Program a simple clustering algorithm (like k-means, PAM, or Farthest-First) in a programming language of your choice (preferably Python, C, C++ or Java), and test and evaluate it on a publicly available data set.

      Your program must read a data set in comma separated values (CSV) text format, and print the produced clusters (e.g. centroids) to standard output.

      You will have to write and submit a report (maximum 5 A4 pages long) which contains the following:

      1. The program listing, which should be well structured and nicely formatted, and which contain enough comments to understand what the program is doing.

      2. A description of how the program works (i.e. what the functions or blocks in you program are doing).

      3. A description of how you tested your program.

      4. The name of the data set and where you got it from (including the name of the data repository and the URL to a page where the data set is described or from where it is available).

      5. The output produced by your program on the data set you used.

      6. A description of your observations when running your program on this data set, for example the number of iterations and total time it took, amount of memory used, etc. Please also describe on what computer (operating system, amount of memory, CPU type and speed, etc.) you developed and ran your program.

    The report you have to submit for this third task part of the assignment must not be longer than five (5) A4 pages, including any graphics, plots and program code listings. Do not use fonts smaller than 12 points (except for the program listing, which can be smaller - but has to be readable).


Marking scheme

  1. For part 1: 6 marks in total, with 1 mark per question.

  2. For part 2, 0.5 marks per correctly calculated distance measure (including workings).

  3. For part 3 A (cluster data mining project): seven (7) marks in total, with one (1) mark per question.
  4. For part 3 B (cluster algorithm programming): seven (7) marks in total, with two (2) marks for question 1 (program listing), and one (1) mark each for all other questions.


Last modified: 31/03/2009, 11:30