CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP3420 - Tutorial 6 - Week 13 (1-5 June)

Classification in Rattle (supervised learning)

Draft - Expect changes and clarifications
Please e-mail Peter Christen if you see mistakes or parts are unclear to you.

Objectives

The objective of this third Rattle tutorial are to experiment with the decision tree and SVM classification algorithms available in Rattle, in order to better understand the issues involved with these data mining techniques; as well as to become more familiar with the Rattle tool.


Preliminaries

If you haven't done so yet, I suggest you create a tutorial6 directory (or folder) within your comp3420 directory.

For this lab, we will mainly use the audit.csv data set which you have used previously in tutorial 5. If you want to use another data set to conduct more experiments at the end of the tutorial please do so.

The binary (2-class) supervised decision tree classier in Rattle is based on the R package rpart (Recursive partitioning and regression trees). You can get help on this package by typing the following three commands into the R console (the terminal window where you started Rattle):

  • library(rpart)
  • help(rpart)
  • help(rpart.control)

The support vector machine classifier in Rattle is based on the R package kernlab (Kernel Methods Lab), and specifically on the ksvm class from this package. You can get help on this class by typing the following two commands into the R console (the terminal window where you started R and Rattle):

  • library(kernlab)
  • help(ksvm)

Before the lab, you should read through the sections Building Classification Models and Evaluation and Deployment available in the Rattle Data Miner documentation.


Tasks

  1. Start Rattle as described in the tutorial 4 sheet.

  2. Load the CSV data set audit.csv (make sure you have CSV File selected in the Data tab, and the Header box is ticked).

  3. Click Execute to load the data into Rattle.

  4. Now make sure the variable (attribute) Adjusted is selected as Target variable, and that you sample the data (e.g. leave the 70 percentage sampling value as it is). Also make sure that the variable ID is set to role Ident(ifier).

    You can select or set to Ignore other variables if you feel they are not suitable for decision tree classification (after having built a decision tree you might later want to come back to the Data tab and change your variable selection).

  5. Next you might want to explore the data set in order to again become familiar with it. Specifically, you should examine the values of the target variable Adjusted. You might also want to have a look at the actual data (which you can do on the Data tab by clicking on View Data).

  6. Now go to the Model tab and make sure the Tree type radio button is selected. As you can see, there are various parameters that can be set and modified. Please read the Rattle documentation on decision trees for more information. You can also get additional help for these parameters from R by typing into the R console: help(rpart.control).

  7. At the beginning, you should lower the value for Complexity to as low as possible (simply type 0 into this input field, it will then automatically be set to the smallest possible value).

  8. To generate a decision tree, click on Execute and inspect what is printed into the main Rattle output area. Next, click on Draw and a window with a decision tree will be shown.

  9. Compare the decision tree drawing with the Summary of the rpart model in the main Rattle output area. Each leave node in the drawing has a coloured number (which corresponds to the leave node number), a 0 or 1 (which is the class label from the audit data set according to the target variable Adjusted), and a percentage number (which corresponds to the accuracy of the classified training pairs in this leave node).

  10. Now go to the Evaluate tab and examine the different options to evaluate the accuracy of the decision tree you just generated. Make sure the Testing data radio button and the Error matrix radio buttons are selected, and then click on Execute. You should check the Confusion matrix (and write down the four numbers for each tree you generate).

  11. Now experiment with different values for the parameters Complexity, Min Bucket, Min Split and Max Depth. Which tree will give you the best accuracy, which one the worst? Which tree is the easiest to interpret? Which is hardest?

  12. Also click on Rules to see the rules generated from a given tree. What is easier to understand, the tree diagram or the rules?

  13. Next, select the different graphical measures available, and for each click on Execute. You can read more on these evaluation measures in the Rattle documentation section on Evaluation and Deployment.

     

  14. Now lets look at the Support Vector Machine (SVM) classifier in Rattle.

    Go to the Model tab and make sure the SVM type radio button is selected. As you can see, there are two main parameters you can modify, one is the Kernel function (the mathematical function that is at the core of the SVM), and the other parameter is the Class Weights. Please read the Rattle documentation section on support vector machines for more information.

    Note that the current Rattle version contains a bug that results in an error/crash when the class weights are specified (Graham Williams, the developer of Rattle, is working on this). Therefore, we will not use the Class Weights parameter in this tutorial.

  15. To build a SVM, click on Execute and inspect what is printed into the main Rattle output area. How many support vectors are required (out of how many training records)?

  16. Go to the Evaluate tab and examine the confusion matrix results you get with this SVM (make sure the Testing button is activated and not the Training one). Write down the four numbers down so you can compare them with the results from other SVMs you will construct later on in this tutorial.

    Is the accuracy of your SVM classifier better than the one of your best decision tree you've built earlier in this tutorial?

  17. Now experiment with the Kernel function, and for each SVM you build examine the resulting confusion matrix. Which one gives you the best results? Also check the Training error printed on the Model page. Is there a correlation between training and testing error?

  18. Next select the Tree classifier (as already done earlier) and re-create the best decision tree classifier you got. Once you have done this, go to the Evaluate tab and you will see that you can now also tick the Tree model box.

    Make sure both the SVM and Tree boxes are ticked, select Confusion and click on Execute. This should give you two confusion matrices each (two for the decision tree and two for the SVM). Which one is the better classifier?

  19. Next select ROC on the Evaluate and once you've executed you should see a graph popping up which contains two curves - one for the decision tree and one for the SVM. Compare these graphs - again, which is the better classifier, and how do they differ?

    For more information about ROC graphs please have a look at the ROC Graphs: Notes and Practical Considerations for Researchers.

  20. Finally, let's look at the risk charts implemented in Rattle (please read the documentation provided at this previous link). Go back to the Data tab, and select the Adjustment variable (attribute) as Risk variable (make sure you click on Execute before you go back to the Model tab).

  21. Again build your 2-class classifiers (decision tree and SVM), and then go to the Evaluation tab and select Risk (make sure both Tree and SVM are ticked). Once you click on Execute you will see two risk chart popping up. Analyse them to see which of the two classifiers is better.

  22. If you have time, you might want to use different data sets, e.g from the UCI Machine Learning repository, and explore how you can build SVMs and decision trees on them.

  23. Quit Rattle as described in tutorial 4.


Last modified: 30/05/2009, 08:39