CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
Department of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8400 - Lab 2 - Thursday 26 March

Association mining in Rattle

Objectives

The objective of this second lab is to experiment with the association mining package available in R and Rattle, in order to better understand the issues involved with this data mining technique; as well as to become more familiar with the Rattle tool.


Preliminaries

If you haven't done so yet, I suggest you create a lab2 directory (or folder) within your comp8400 directory.

Now copy the simple example data set audit.csv into your lab2 directory. It contains artificial data created by Graham Williams and is supplied with Rattle. To quote from the Rattle documentation: "It consists of 2,000 fictional clients who have been audited, perhaps for compliance with regard to the amount of a tax refund that is being claimed. For each case an outcome is recorded (whether the taxpayer's claims had to be adjusted or not) and any amount of adjustment that resulted is also recorded."

Association mining in Rattle is based on the extension package arules (one of the many extension packages available for R). For more information about the arules please have a look at the arules package information page.

Note: The Rattle version currently installed in the CSIT computer labs seems to have some bugs that affect the way association mining can be conducted. Specifically, the arules packages brings up error messages when variables contain missing values, and also for several other, currently unknown, reasons. I (Peter Christen) have been in contact with Graham Williams (Rattle developer) in the past week trying to fix this problem. We will keep you informed about this issue. Our apologies if such error messages appear during the labs.


Tasks

  1. Start Rattle as described in the first lab sheet.

  2. Load the CSV data set audit.csv (make sure the CSV File radio button is selected). Click on Execute to load the data set.

  3. For association mining, we do want to use all records in this data set. Therefore, un-tick the Sample box.

  4. Now set the role all variables that contain missing values to Ignore. Also set the role of the variable ID to ignore - think about why we don't want to have the identifier variable in our association mining experiments. Make sure you click on Execute to confirm your variable settings.

  5. Now explore this data set on the Explore tabs, similar to what you have done in the first lab.

  6. Once you have a good understanding of the content of the audit.csv data set, go to the Associate tab.

  7. On the Associate tab, make sure Baskets is not selected. Click on Freq Plot and inspect the graph that is shown.

    What does this graph tell you, for example, about the distribution of records with Female and Male gender, or the distribution of the Marital values. Why are Marital=Widowed or Marital=Separated, for example, not shown?

    Now change the Support value to a lower or a higher value, and re-generate the Freq Plot. What do you see?

  8. Change the Support back to its original value of 0.1 and start the association mining algorithm by clicking on the Execute button.

    Inspect the output generated. How many rules were generated? Where is most of the time spent?

  9. If you click on Show Rules, all the generated rules will be displayed in the main Rattle output area. Inspect the generated rules. How are they sorted?

    What is the meaning of rules that have an empty left-hand side? (Hint: Compare these rules with the frequency plot you have generated earlier).

  10. Now play around with different values for Support and Confidence. For example, what values would you set to get rules that appear in at least a quarter of all records? Note that you have to click Execute each time you change these parameter values, followed by Show Rules.

  11. Next we want to include a variable that has missing values. Go to the Transform tab, select Impute, then select Constant and type in a string value in the corresponding box. A value like `Missing', or `Not available', or `NA' would make sense, of course. Think about what values do you not want to impute.

    Select the Occupation variable, and click on Execute to start the data imputation process. Once done, go to the Data tab to check that a new variable has been created, and make sure the role of this new variable is set as Input. Remember, if you change any of the variable roles you need to confirm this with a click on Execute.

    Go to the Explore page and check that the new variable is now free of missing values, and that it contains the new string value you have just imputed.

  12. Go back to the Associate tab and re-do the association mining. What new rules do you get?

  13. As a last step we want to include the Age variable, by converting it into a categorical variable with 6 categories.

    Note: Due to the above mentioned problems with the arules package, it is recommended that you quit Rattle and re-start it before continuing (you then need to re-do steps 2, 3 and 4 from above).

    Go to the Transform tab, select Remap, then change the Number field to 6.

    Select the Age variable, and click on Execute to start the data re-mapping process. Once done go to the Data tab to check that a new variable has been created, and make sure the role of this new variable is set as Input. Also make sure the role of the original Age variable is set to Ignore (again, remember to click on Execute to confirm any changes in roles).

  14. Go back to the Associate tab, click on Freq Plot, and then re-do the association mining. What new rules do you get?

  15. Quit Rattle as described in the first lab sheet.


Last modified: 3/04/2009, 13:13