|
|
COMP8400 - Assignment 1 - Due week 9 (Wednesday 6 May, 5 pm)
This assignment is worth 15% of your total course mark. It will be
marked out of 15 as indicated below.
Data issues in data mining / Clustering
Objectives
The objective of this first assignment is to apply the topics learned in
the first part of the course by (1) writing a short essay (answering
several questions) in the area of data issues (like pre-processing and
quality); (2) answer a set of questions related to clustering; and (3)
to either use a data mining tool to conduct a clustering project or to
implement, test and evaluate a simple clustering technique.
The estimated time we expect you to spend on this assignment is
around 15 hours in total (1 hour per mark).
Submission
You will have to submit one file, named u1234567-ass1.pdf (please
replace u1234567 with your ANU university ID) that contains three
parts:
- An essay / report, answering the questions given in part 1 below
(maximum length 3 A4 pages).
- The answers on the theoretical clustering question (on distance
measures) given in part 2 below.
- A report that contains the details of your clustering project or
implementation as described in part 3 below (maximum length 5
A4 pages).
Important:
- Make sure that your submitted file on the first page contains your
name and ANU university ID!
- The maximum total length of your submission must not be more than
eight (8) A4 pages.
- You have to use a font size of at least 12 points.
- You have to send your submission to
comp8400@cs.anu.edu.au.
- The file must be submitted by 5 pm on Wednesday 6 May.
Extensions
Students will only be granted an extension on the submission deadline in
exceptional circumstances. Work and sporting commitments are normally
NOT sufficient grounds. If you think you have grounds for an
extension, you should notify the course coordinator as soon as possible
and provide written evidence in support of your case (for example a medical
certificate). The course coordinator will then decide whether to grant
an extension and inform you as soon as practical.
Penalties
Penalties for late submissions are as follows:
| How late | less than 6 hours | 6 to 24 hours |
24 to 48 hours | 48 to 72 hours |
72 to 96 hours | more than 96 hours |
| Penalty from 15 marks |
-0.5 | -1 | -2 | -4 |
-8 | -15 (forget it!) |
Penalties for submission that are longer than 8 pages are as follows:
| Number of pages | 9 | 10
| 11 | 12 or more |
| Penalty from 15 marks |
-1 | -2 | -4 | -6 |
Plagiarism
No group work is permitted for the assignment. We do
encourage you to discuss your work in the labs and lectures,
but we expect you to do the assignment work by yourself.
You should read the chapter in the
Department of Computer Science Student Handbook that discusses
assessment (Chapter 6, pages 18-24), particularly the sections headed
Misconduct in examinations (which also applies to assignments
and other forms of assessment) and Collaborations versus
misconduct in assignments.
If you do include material from some other documents (for example
graphics, figures, tables or formulas extracted from a paper, a book,
lecture slides or a Web site), then you clearly have to make attribution,
for example by writing the name of the paper, book, etc., and where you
got it from. You may include URL links to external documents.
Tasks
- Data mining at a national institute of sports
(6 marks)
Many nations, including Australia, have national institutes to
foster sports and produce world-class athletes. In Australia, the
Australian Institute of
Sports, located in Canberra, takes on that role.
Increasingly, sports-researchers use sophisticated data analysis
techniques in order to investigate how the performance of athletes
can be optimised further and therefore increase their chances to
win more gold medals, for example at the next Olympic games.
Imagine you are being hired as a new data mining specialist by
such a national institute of sports. Your first tasks are to
- investigate the data that is being collected by this
institute,
- to build a data warehouse that is capable of holding all data
collected by the institute, and
- to look into what data mining projects could be conducted
using this data.
For this first part of the assignment, you should write a report
addressing the following six questions. Your report should be
maximum three (3) A4 pages long. You have to use a font size of at
least 12 points.
- Describe in more details what kind of data such a national
institute of sports would collect (data types, sources,
formats, volume, etc).
- How would you design a data warehouse for this national
institute of sports? What would the main dimensions be? And
what kind of data would be stored in this data warehouse?
- What kind of data pre-processing and data integration issues
will you have to consider when designing such a data
warehouse? What kind of external data do you think would be of
interest and should be integrated into such a data warehouse?
- What kind of questions do you think the sports researchers
and athletes of this national institute of sports would like to
have addressed by data mining?
What could the outcomes of such data mining projects be?
- What challenges will you most likely face as a data miner
at this national institute of sports?
- Are there any legal/regulatory requirements that might limit
the use of data mining at this national institute of sports?
You will receive up-to one mark for each of the above six questions.
- Distance measures for clustering (2 marks)
You have to calculate several distance measures for two records
(data tuples) that contain numerical values. These two records
will be based on your ANU university ID and are calculated as
follows.
- Take your ANU university ID (the seven digit number of the form
u1234567) and split it into three: (a) the first
three digits, (b) next two digits, and (c) the last two digits.
For example, u1234567 becomes: 123 / 45 /
67.
- Now reverse your ANU university ID (u1234567 becomes
7654321) and repeat the splitting process.
So, our example becomes: 765 / 43 / 21.
- The two triplets you have just generated are the two records
(data tuples). For our example: (123, 45, 67) and
(765, 43, 21)
Using the two records (data tuples) based on your ANU university
ID (not the examples given above), you have to calculate the
following four distance measures between these two data tuples:
- Euclidean distance
- Manhattan distance
- Minkowski distance using q = 3
- Canberra distance (formula not given in text book or
lectures)
You must write down your workings, i.e. provide details of how you
calculated these distances. If you only provide the numerical
results you will not receive any marks!
For each correct answer you will receive 1/2 mark. You will not receive
any marks if you don't get the calculation of your two data tuples
correct!
- Clustering (7 marks)
There will be two options you can choose from (you only need to do
one of them!):
- Conduct a clustering project using Rattle (or another
data mining tool) on a publicly available data set of your
choice. You will have to write and submit a report (maximum
5 A4 pages long) which details the steps you have done in your
clustering project. Your report must contain the following
information (please use clear section headers for each):
- You must provide: (a) the name of the data set, (b) where
you got it from (the name of the data repository and the
URL to a page where the data set is described or from
where it is available), and (c) a description of the data
set that must include the number of records and attributes
(variable) it contains, and details about these attributes
(their names and types, for example numerical, categorical,
ordinal, etc.)
Note: You are not allowed to use any of the data sets
used in the COMP8400 labs.
- A description of the data exploration steps you have done
using the data mining tool, and what you found out about the
data quality of this data set, for example number of missing
values, out of range values, distribution of values (means
values, minimum and maximum, histograms, etc.).
- A description of the data cleaning and transformation steps
you might have done (or not - in which case describe why
no transformation was needed).
- A description of attribute selection and/or feature construction
you have done, and the reasoning behind this.
- A description of the clustering approach taken, which has to
include (a) the clustering technique (or techniques) used,
(b) the reason why you chose this techniques(s), and (c) the
description of the parameter values you have chosen (for
example, if you use k-means provide the number of
clusters k and distance measure you have chosen, and
why).
- A description of the clusters you have found in the data set,
and if these clusters are sensible, i.e. correspond to what
you would have expected in such a data set.
- Finally, a general description of your clustering project,
including problems and difficulties encountered, things you
have learned, and steps you would possibly do different the
next time.
Clarifications:
- Q1: I'm not sure what you mean by part A.5: A description of
attribute selection of feature construction you have done,
and the reasoning behind this.
A1: When selecting a data set, you will likely find that not
all of the attributes in this data set are suitable for
clustering, or some might not be useful for various reasons
(depending upon the data set). Additionally, you might
even decide to create new features (possibly outside of
Rattle), or impute some of the attributes in order
to improve the data for clustering.
So you should describe why you do (or do not) select some
of the attributes, why you do data imputation, or feature
construction on the data set that you are using.
- Q2: Could you tell me what data is suitable for the clustering
part in the project?
How many rows should be enough? How many attributes?
A2: I would imagine a suitable data set would contain several
hundreds to several thousand records (rows), and also
several (up to maybe 10) attributes (of which you possibly
do not use all for the clustering).
- Program a simple clustering algorithm (like k-means,
PAM, or Farthest-First) in a programming language of
your choice (preferably Python, C, C++ or Java), and test and
evaluate it on a publicly available data set.
Your program must read a data set in comma separated values (CSV)
text format, and print the produced clusters (e.g. centroids) to
standard output.
You will have to write and submit a report (maximum 5 A4 pages long)
which contains the following:
- The program listing, which should be well structured and
nicely formatted, and which contain enough comments to
understand what the program is doing.
- A description of how the program works (i.e. what the
functions or blocks in you program are doing).
- A description of how you tested your program.
- The name of the data set and where you got it from (including
the name of the data repository and the URL to a page where the
data set is described or from where it is available).
- The output produced by your program on the data set you
used.
- A description of your observations when running your
program on this data set, for example the number of iterations
and total time it took, amount of memory used, etc. Please
also describe on what computer (operating system, amount of
memory, CPU type and speed, etc.) you developed and ran your
program.
The report you have to submit for this third task part of the assignment
must not be longer than five (5) A4 pages, including any graphics,
plots and program code listings. Do not use fonts smaller than 12
points (except for the program listing, which can be smaller - but has to
be readable).
Marking scheme
- For part 1: 6 marks in total, with 1 mark per question.
- For part 2, 0.5 marks per correctly calculated distance measure
(including workings).
- For part 3 A (cluster data mining project): seven (7) marks in total,
with one (1) mark per question.
- For part 3 B (cluster algorithm programming): seven (7) marks in total,
with two (2) marks for question 1 (program listing), and one (1)
mark each for all other questions.
Last modified: 31/03/2009, 11:30
|