|
|
COMP3420 - Tutorial 5 - Week 11 (18-22 May)
Association rules and clustering in Rattle (unsupervised learning)
Objectives
The objectives of this second Rattle lab are to experiment with
the association rules and clustering algorithms available in Rattle,
in order to better understand the issues involved with these data mining
techniques; as well as to become more familiar with the Rattle tool.
Preliminaries
If you haven't done so yet, I suggest you create a tutorial5
directory (or folder) within your comp3420 directory.
Now copy the simple example data set
audit.csv into your tutorial5
directory. It contains artificial data created by Graham Williams
(developer of Rattle) and is supplied with Rattle. To
quote from the Rattle documentation: "It consists of 2,000
fictional clients who have been audited, perhaps for compliance with
regard to the amount of a tax refund that is being claimed. For each
case an outcome is recorded (whether the taxpayer's claims had to be
adjusted or not) and any amount of adjustment that resulted is also
recorded."
Association mining in Rattle is based on the extension package
arules (one of the many extension packages available for
R). For more information about arules please have a
look at the
arules package information page.
Note: The Rattle version currently installed in the CSIT
computer labs seems to have some bugs that affect the way association
mining can be conducted. Specifically, the arules packages
brings up error messages when variables contain missing values are used,
and also for several other, currently unknown, reasons. I (Peter Christen)
have been in contact with Graham Williams (Rattle developer)
trying to fix this problem. We will keep you informed about this issue.
Our apologies if such error messages appear during the labs.
You should also read the sections on
unsupervised modelling in the online
Rattle documentation.
Tasks
- Start Rattle as described in the
tutorial 4 lab sheet.
- Load the CSV data set audit.csv (make sure the
CSV File radio button is selected). Click on
Execute to load the data set.
- For association mining, we do want to use all records in this
data set. Therefore, un-tick the Sample box.
- Now set the role of all variables that contain missing values to
Ignore. Also set the role of the variable ID to
ignore - think about why we don't want to have the identifier
variable in our association mining experiments. Make sure you
click on Execute to confirm your variable settings.
- Now explore this data set on the Explore tabs, similar to
what you have done in the previous
tutorial.
- Once you have a good understanding of the content of the
audit.csv data set, go to the Associate tab.
- On the Associate tab, make sure Baskets is not
selected. Click on Freq Plot and inspect the graph
that is shown.
What does this graph tell you, for example, about the
distribution of records with Female and Male gender,
or the distribution of the Marital values. Why are
Marital=Widowed or Marital=Separated, for example,
not shown?
Now change the Support value to a lower or a higher value,
and re-generate the Freq Plot. What do you see?
- Change the Support back to its original value of 0.1
and start the association mining algorithm by clicking on the
Execute button.
Inspect the output generated. How many rules were generated? Where
is most of the time spent?
- If you click on Show Rules, all the generated rules will be
displayed in the main Rattle output area. Inspect the
generated rules. How are they sorted?
What is the meaning of rules that have an empty left-hand side? (Hint:
Compare these rules with the frequency plot you have generated earlier).
- Now play around with different values for Support and
Confidence. For example, what values would you set to get rules
that appear in at least a quarter of all records?
Note that you have to click Execute each time you change these
parameter values, followed by Show Rules.
- Next we want to include a variable that has missing values. Go to
the Transform tab, select Impute, then select
Constant and type in a string value in the corresponding box.
A value like `Missing', or `Not available', or `NA'
would make sense, of course. Think about what values you do not want to
impute.
Select the Occupation variable, and click on Execute to
start the data imputation process. Once done, go to the Data tab
to check that a new variable has been created, and make sure the role
of this new variable is set as Input. Remember, if you change
any of the variable roles you need to confirm this with a click on
Execute.
Go to the Explore page and check that the new variable is now
free of missing values, and that it contains the new string value you
have just imputed.
- Go back to the Associate tab and re-do the association mining.
What new rules do you get?
- As a last step we want to include the Age variable, by
converting it into a categorical variable with 6 categories.
Note: Due to the above mentioned problems with the arules
package, it is recommended that you quit Rattle and re-start
it before continuing (you then need to re-do steps 2, 3 and 4 from
above).
Go to the Transform tab, select Remap, then change the
Number field to 6.
Select the Age variable, and click on Execute to
start the data re-mapping process. Once done go to the Data tab
to check that a new variable has been created, and make sure the role
of this new variable is set as Input. Also make sure the role of
the original Age variable is set to Ignore (again,
remember to click on Execute to confirm any changes in roles).
- Go back to the Associate tab, click on Freq Plot, and then
re-do the association mining. What new rules do you get?
- Now we start looking at clustering in Rattle. For this, we will
also use the audit.csv data set. The best is to quit and re-start
Rattle, and then repeat steps 2 and 3 from above.
- The clustering algorithms implemented in Rattle only work on
numerical attributes (variables), so please go to the Data tab
and set the role of all Categorial type variables to
Ignore.
To allow better visualisation of the clustering process, we will
only use the following two variables:
Please set their role to Input on the Data tab, and for the
time being set the roles of Age, Deductions,
Adjustment and Adjusted to Ignore as well.
Don't forget to click Execute to confirm your variable selection.
- Go to the Cluster tab, make sure that K Means is
selected. Leave the number of clusters as 10 and click Execute.
Inspect the output printed in the main Rattle output area,
especially the cluster centroids. Do they look like well separated
clusters to you?
Further statistical information can be generated by clicking the
Stats button, and you can get a plot of the clusters using the
Data Plot button.
- Now click on Data Plot to visualise the clustering graphically.
What kind of clusters has the k-means algorithm produced? Along which
variable has the data been clustered? Can you imagine why?
- In order to improve the clustering, you have to normalise the
variables you are using. To do so, go to the Transform tab and
use the Scale [0..1] normalisation (select the two variables
Hours and Income and click Execute to perform the
normalisation). You should see two new variables being generated:
R01_Income and R01_Hours.
Go back to the Cluster tab and re-do the clustering. How do the
clusters look now?
- Now play with the number of clusters. Reduce them, and click
Execute each time to re-run the k-means clustering algorithm.
Stop when you feel the clusters generated are well separated and
intuitively make sense. Three or four clusters seem to be a good
number for this example data set and variable selection.
- Now change the clustering algorithm to Hierarchical. First,
let's leave the various parameters at their given values and click
Execute to run this algorithm.
- Click on Dendrogram to view the hierarchical clustering
generated. Then change the number of clusters and again click
Dendrogram for various cluster numbers. How do the
generated plots differ? You can always see the actual clusters by
clicking on Data Plot.
- Next change the Distance and Agglomerate parameters,
followed by clicking Execute to run the clustering algorithm,
and click on Dendrogram to visualise the results.
- Now set the number of clusters to the same as you had previously
with k-means clustering, and click on Data Plot to visualise
the generated clustering. How does it differ from the clusters
generated by k-means?
- Now go back to the Data page and select one or more other
variables of type Numeric as Input variables, followed
by clicking Execute to confirm your selection. Then go back
to the Cluster page and re-run the various clustering
approaches (you might have to do a normalisation on these new
variables as well).
Please note that variables that contain missing values will result in
the k-means algorithm to possibly crash, so make sure you impute
missing values first.
- Time permitting, select another data set from the Library on the
Data page, and run association mining or clustering on it.
A small example file is dvdtrans.csv, which
contains transactions of the DVD movies purchased by customers.
Note: This data set contains transactions in the form of
market baskets, and therefore you need to check the tick box
Baskets on the Associate tab.
Alternatively, you can download a data set from an online
collection, such as the one of the
UCI Machine Learning
repository.
- Quit Rattle as described in the
previous tutorial.
Last modified: 18/05/2009, 14:48
|