|
|
COMP3420 - Tutorial 6 - Week 13 (1-5 June)
Classification in Rattle (supervised learning)
Draft - Expect changes and clarifications
Please e-mail
Peter Christen if you see mistakes or parts are unclear to you.
Objectives
The objective of this third Rattle tutorial are to experiment with
the decision tree and SVM classification algorithms available in
Rattle, in order to better understand the issues involved with
these data mining techniques; as well as to become more familiar with
the Rattle tool.
Preliminaries
If you haven't done so yet, I suggest you create a tutorial6
directory (or folder) within your comp3420 directory.
For this lab, we will mainly use the audit.csv data set which
you have used previously in tutorial 5.
If you want to use another data set to conduct more experiments at the
end of the tutorial please do so.
The binary (2-class) supervised decision tree classier in Rattle
is based on the R package
rpart (Recursive partitioning and regression trees). You can get
help on this package by typing the following three commands into the
R console (the terminal window where you started Rattle):
- library(rpart)
- help(rpart)
- help(rpart.control)
The support vector machine classifier in Rattle is based on the
R package
kernlab (Kernel Methods Lab), and specifically on the ksvm
class from this package. You can get help on this class by typing the
following two commands into the R console (the terminal window where
you started R and Rattle):
- library(kernlab)
- help(ksvm)
Before the lab, you should read through the sections
Building
Classification Models and
Evaluation and Deployment available in the Rattle Data Miner
documentation.
Tasks
- Start Rattle as described in the
tutorial 4 sheet.
- Load the CSV data set
audit.csv (make sure you have
CSV File selected in the Data tab, and the Header
box is ticked).
- Click Execute to load the data into Rattle.
- Now make sure the variable (attribute) Adjusted is selected as
Target variable, and that you sample the data (e.g. leave the
70 percentage sampling value as it is). Also make sure that the
variable ID is set to role Ident(ifier).
You can select or set to Ignore other variables if you feel they
are not suitable for decision tree classification (after having
built a decision tree you might later want to come back to the
Data tab and change your variable selection).
- Next you might want to explore the data set in order to again become
familiar with it. Specifically, you should examine the values of the
target variable Adjusted. You might also want to
have a look at the actual data (which you can do on the Data
tab by clicking on View Data).
- Now go to the Model tab and make sure the Tree type radio
button is selected. As you can see, there are various parameters that
can be set and modified. Please read the Rattle documentation on
decision trees for more information. You can also get additional help
for these parameters from R by typing into the R console:
help(rpart.control).
- At the beginning, you should lower the value for Complexity to as
low as possible (simply type 0 into this input field, it will then
automatically be set to the smallest possible value).
- To generate a decision tree, click on Execute and inspect what
is printed into the main Rattle output area. Next, click on
Draw and a window with a decision tree will be shown.
- Compare the decision tree drawing with the Summary of the rpart
model in the main Rattle output area. Each leave node
in the drawing has a coloured number (which corresponds to the leave
node number), a 0 or 1 (which is the class label from the
audit data set according to the target variable
Adjusted), and a percentage number (which corresponds to
the accuracy of the classified training pairs in this leave node).
- Now go to the Evaluate tab and examine the different options to
evaluate the accuracy of the decision tree you just generated. Make
sure the Testing data radio button and the Error
matrix radio buttons are selected, and then click on
Execute. You should check the Confusion matrix
(and write down the four numbers for each tree you generate).
- Now experiment with different values for the parameters Complexity,
Min Bucket, Min Split and Max Depth. Which tree will
give you the best accuracy, which one the worst? Which tree is the easiest
to interpret? Which is hardest?
- Also click on Rules to see the rules generated from a given
tree. What is easier to understand, the tree diagram or the rules?
- Next, select the different graphical measures available, and for each
click on Execute. You can read more on these evaluation measures
in the Rattle documentation section on
Evaluation and Deployment.
- Now lets look at the Support Vector Machine (SVM) classifier in
Rattle.
Go to the Model tab and make sure the SVM type radio
button is selected. As you can see, there are two main parameters you
can modify, one is the Kernel function (the mathematical function
that is at the core of the SVM), and the other parameter is the
Class Weights. Please read the Rattle documentation
section on
support vector machines for more information.
Note that the current Rattle version contains a bug that results
in an error/crash when the class weights are specified (Graham Williams,
the developer of Rattle, is working on this). Therefore, we will
not use the Class Weights parameter in this tutorial.
- To build a SVM, click on Execute and inspect what is printed into
the main Rattle output area. How many support vectors are
required (out of how many training records)?
- Go to the Evaluate tab and examine the confusion matrix results
you get with this SVM (make sure the Testing button is activated
and not the Training one). Write down the four numbers down so you
can compare them with the results from other SVMs you will construct
later on in this tutorial.
Is the accuracy of your SVM classifier better than the one of your best
decision tree you've built earlier in this tutorial?
- Now experiment with the Kernel function, and for each SVM you
build examine the resulting confusion matrix. Which one gives you the
best results? Also check the Training error printed on the
Model page. Is there a correlation between training and testing
error?
- Next select the Tree classifier (as already done earlier)
and re-create the best decision tree classifier you got. Once you have
done this, go to the Evaluate tab and you will see that you
can now also tick the Tree model box.
Make sure both the SVM and Tree boxes are ticked, select
Confusion and click on Execute. This should give you two
confusion matrices each (two for the decision tree and two for the SVM).
Which one is the better classifier?
- Next select ROC on the Evaluate and once you've executed you
should see a graph popping up which contains two curves - one for the
decision tree and one for the SVM. Compare these graphs - again, which is
the better classifier, and how do they differ?
For more information about ROC graphs please have a look at the
ROC Graphs: Notes and Practical Considerations for Researchers.
- Finally, let's look at the
risk
charts implemented in Rattle (please read the documentation
provided at this previous link). Go back to the Data tab, and
select the Adjustment variable (attribute) as Risk
variable (make sure you click on Execute before you go back to the
Model tab).
- Again build your 2-class classifiers (decision tree and SVM), and then go
to the Evaluation tab and select Risk (make sure both
Tree and SVM are ticked). Once you click on Execute
you will see two risk chart popping up. Analyse them to see which of the
two classifiers is better.
- If you have time, you might want to use different data sets, e.g from the
UCI Machine Learning
repository, and explore how you can build SVMs and decision trees on
them.
- Quit Rattle as described in
tutorial 4.
Last modified: 30/05/2009, 08:39
|