CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP3420 - Tutorial 5 - Week 11 (18-22 May)

Association rules and clustering in Rattle (unsupervised learning)

Objectives

The objectives of this second Rattle lab are to experiment with the association rules and clustering algorithms available in Rattle, in order to better understand the issues involved with these data mining techniques; as well as to become more familiar with the Rattle tool.


Preliminaries

If you haven't done so yet, I suggest you create a tutorial5 directory (or folder) within your comp3420 directory.

Now copy the simple example data set audit.csv into your tutorial5 directory. It contains artificial data created by Graham Williams (developer of Rattle) and is supplied with Rattle. To quote from the Rattle documentation: "It consists of 2,000 fictional clients who have been audited, perhaps for compliance with regard to the amount of a tax refund that is being claimed. For each case an outcome is recorded (whether the taxpayer's claims had to be adjusted or not) and any amount of adjustment that resulted is also recorded."

Association mining in Rattle is based on the extension package arules (one of the many extension packages available for R). For more information about arules please have a look at the arules package information page.

Note: The Rattle version currently installed in the CSIT computer labs seems to have some bugs that affect the way association mining can be conducted. Specifically, the arules packages brings up error messages when variables contain missing values are used, and also for several other, currently unknown, reasons. I (Peter Christen) have been in contact with Graham Williams (Rattle developer) trying to fix this problem. We will keep you informed about this issue. Our apologies if such error messages appear during the labs.

You should also read the sections on unsupervised modelling in the online Rattle documentation.


Tasks

  1. Start Rattle as described in the tutorial 4 lab sheet.

  2. Load the CSV data set audit.csv (make sure the CSV File radio button is selected). Click on Execute to load the data set.

  3. For association mining, we do want to use all records in this data set. Therefore, un-tick the Sample box.

  4. Now set the role of all variables that contain missing values to Ignore. Also set the role of the variable ID to ignore - think about why we don't want to have the identifier variable in our association mining experiments. Make sure you click on Execute to confirm your variable settings.

  5. Now explore this data set on the Explore tabs, similar to what you have done in the previous tutorial.

  6. Once you have a good understanding of the content of the audit.csv data set, go to the Associate tab.

  7. On the Associate tab, make sure Baskets is not selected. Click on Freq Plot and inspect the graph that is shown.

    What does this graph tell you, for example, about the distribution of records with Female and Male gender, or the distribution of the Marital values. Why are Marital=Widowed or Marital=Separated, for example, not shown?

    Now change the Support value to a lower or a higher value, and re-generate the Freq Plot. What do you see?

  8. Change the Support back to its original value of 0.1 and start the association mining algorithm by clicking on the Execute button.

    Inspect the output generated. How many rules were generated? Where is most of the time spent?

  9. If you click on Show Rules, all the generated rules will be displayed in the main Rattle output area. Inspect the generated rules. How are they sorted?

    What is the meaning of rules that have an empty left-hand side? (Hint: Compare these rules with the frequency plot you have generated earlier).

  10. Now play around with different values for Support and Confidence. For example, what values would you set to get rules that appear in at least a quarter of all records? Note that you have to click Execute each time you change these parameter values, followed by Show Rules.

  11. Next we want to include a variable that has missing values. Go to the Transform tab, select Impute, then select Constant and type in a string value in the corresponding box. A value like `Missing', or `Not available', or `NA' would make sense, of course. Think about what values you do not want to impute.

    Select the Occupation variable, and click on Execute to start the data imputation process. Once done, go to the Data tab to check that a new variable has been created, and make sure the role of this new variable is set as Input. Remember, if you change any of the variable roles you need to confirm this with a click on Execute.

    Go to the Explore page and check that the new variable is now free of missing values, and that it contains the new string value you have just imputed.

  12. Go back to the Associate tab and re-do the association mining. What new rules do you get?

  13. As a last step we want to include the Age variable, by converting it into a categorical variable with 6 categories.

    Note: Due to the above mentioned problems with the arules package, it is recommended that you quit Rattle and re-start it before continuing (you then need to re-do steps 2, 3 and 4 from above).

    Go to the Transform tab, select Remap, then change the Number field to 6.

    Select the Age variable, and click on Execute to start the data re-mapping process. Once done go to the Data tab to check that a new variable has been created, and make sure the role of this new variable is set as Input. Also make sure the role of the original Age variable is set to Ignore (again, remember to click on Execute to confirm any changes in roles).

  14. Go back to the Associate tab, click on Freq Plot, and then re-do the association mining. What new rules do you get?

     

  15. Now we start looking at clustering in Rattle. For this, we will also use the audit.csv data set. The best is to quit and re-start Rattle, and then repeat steps 2 and 3 from above.

  16. The clustering algorithms implemented in Rattle only work on numerical attributes (variables), so please go to the Data tab and set the role of all Categorial type variables to Ignore.

    To allow better visualisation of the clustering process, we will only use the following two variables:

    • Hours
    • Income

    Please set their role to Input on the Data tab, and for the time being set the roles of Age, Deductions, Adjustment and Adjusted to Ignore as well.

    Don't forget to click Execute to confirm your variable selection.

  17. Go to the Cluster tab, make sure that K Means is selected. Leave the number of clusters as 10 and click Execute. Inspect the output printed in the main Rattle output area, especially the cluster centroids. Do they look like well separated clusters to you?

    Further statistical information can be generated by clicking the Stats button, and you can get a plot of the clusters using the Data Plot button.

  18. Now click on Data Plot to visualise the clustering graphically. What kind of clusters has the k-means algorithm produced? Along which variable has the data been clustered? Can you imagine why?

  19. In order to improve the clustering, you have to normalise the variables you are using. To do so, go to the Transform tab and use the Scale [0..1] normalisation (select the two variables Hours and Income and click Execute to perform the normalisation). You should see two new variables being generated: R01_Income and R01_Hours.

    Go back to the Cluster tab and re-do the clustering. How do the clusters look now?

  20. Now play with the number of clusters. Reduce them, and click Execute each time to re-run the k-means clustering algorithm. Stop when you feel the clusters generated are well separated and intuitively make sense. Three or four clusters seem to be a good number for this example data set and variable selection.

  21. Now change the clustering algorithm to Hierarchical. First, let's leave the various parameters at their given values and click Execute to run this algorithm.

  22. Click on Dendrogram to view the hierarchical clustering generated. Then change the number of clusters and again click Dendrogram for various cluster numbers. How do the generated plots differ? You can always see the actual clusters by clicking on Data Plot.

  23. Next change the Distance and Agglomerate parameters, followed by clicking Execute to run the clustering algorithm, and click on Dendrogram to visualise the results.

  24. Now set the number of clusters to the same as you had previously with k-means clustering, and click on Data Plot to visualise the generated clustering. How does it differ from the clusters generated by k-means?

  25. Now go back to the Data page and select one or more other variables of type Numeric as Input variables, followed by clicking Execute to confirm your selection. Then go back to the Cluster page and re-run the various clustering approaches (you might have to do a normalisation on these new variables as well).

    Please note that variables that contain missing values will result in the k-means algorithm to possibly crash, so make sure you impute missing values first.

  26. Time permitting, select another data set from the Library on the Data page, and run association mining or clustering on it.

    A small example file is dvdtrans.csv, which contains transactions of the DVD movies purchased by customers. Note: This data set contains transactions in the form of market baskets, and therefore you need to check the tick box Baskets on the Associate tab.

    Alternatively, you can download a data set from an online collection, such as the one of the UCI Machine Learning repository.

  27. Quit Rattle as described in the previous tutorial.


Last modified: 18/05/2009, 14:48