CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
Department of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8400 - Lab 3 - Thursday 9 April

Decision trees in Rattle

Objectives

The objectives of this third lab are to experiment with the decision tree package available in R and Rattle, in order to better understand the issues involved with this data mining technique; and to experiment with the different evaluation methods for supervised classification available in the Rattle tool.

Preliminaries

For this lab, we will mainly use the audit.csv data set which you have used previously in lab 2. If you want to use another data set to conduct more experiments at the end of the lab please do so.

The binary (2-class) supervised decision tree classier in Rattle is based on the R package rpart (Recursive partitioning and regression trees). You can get help on this package by typing the following three commands into the R console (the terminal window where you started Rattle):

  • library(rpart)
  • help(rpart)
  • help(rpart.control)

You should also have a read through the section Building Classification Models available in the Rattle Data Miner documentation.


Tasks

  1. Start Rattle as described in the first lab sheet.

  2. Load the CSV data set audit.csv (make sure you have CSV File selected in the Data tab, and the Header box is ticked).

  3. Click Execute to load the data into Rattle.

  4. Now make sure the variable (attribute) Adjusted is selected as Target variable, and that you sample the data (e.g. leave the 70 percentage sampling value as it is). Also make sure that the variable ID is set to role Ident(ifier).
    You can select or set to Ignore other variables if you feel they are not suitable for decision tree classification (after having built a decision tree you might later want to come back to the Data tab and change your variable selection).

  5. Next you might want to explore the data set in order to again become familiar with it. Specifically, you should examine the values of the target variable Adjusted. You might also want to have a look at the actual data (which you can do in the Data tab by clicking on View Data).

  6. Now go to the Model tab and make sure the Tree type radio button is selected. As you can see, there are various parameters that can be set and modified. Please read the Rattle documentation on decision trees for more information. You can also get additional help for these parameters from R by typing into the R console: help(rpart.control).

  7. At the beginning, you should lower the value for Complexity to as low as possible (simply type in 0, it will then automatically be set to the smallest possible value).

  8. To generate a decision tree, click on Execute and inspect what is printed into the main Rattle output area. Next, click on Draw and a window with a decision tree will be shown.

  9. Compare the decision tree drawing with the Summary of the rpart model in the main Rattle output area. Each leave node in the drawing has a coloured number (which corresponds to the leave node number), a 0 or 1 (which is the class label from the audit data set according to the target variable Adjusted), and a percentage number (which corresponds to the accuracy of the classified training pairs in this leave node).

  10. Now go to the Evaluate tab and examine the different options to evaluate the accuracy of the decision tree you just generated. Make sure the Testing data radio button and the the Error matrix radio buttons are selected, and then click on Execute. You should check the Confusion matrix (and write down the four numbers for each tree you generate).
  11. Next, select the different graphical measures available, and for each click on Execute. You can read more on these evaluation measures in the Rattle documentation section on evaluation and deployment.

  12. If you want to examine the Risk graph you need to go back to the Data tab and select a risk variable. For the audit.csv data set this should be the variable Adjustment (make sure you confirm your variable role change with a click on Execute on the Data tab).

  13. Now experiment with different values for the parameters Complexity, Min Bucket, Min Split and Max Depth. Which tree will give you the best accuracy, which one the worst? Which tree is the easiest to interpret? Which is hardest?

  14. Also click on Rules to see the rules generated from a given tree. What is easier to understand, the tree diagram or the rules?

  15. If you have time, you might want to use different data sets, e.g from the UCI Machine Learning archive, and explore how you can build decision trees on them.

  16. Quit Rattle as described in the first lab sheet.


Last modified: 9/04/2009, 07:33