COMP8400 - Lab 1 - Thursday 12 March
Introduction to Rattle and data exploration
Objectives
The objective of this first lab is to get familiar with the graphical
user interface of the open source data mining tool
Rattle, and to conduct
data exploration on smaller example data sets.
Note that Rattle, similar to many other data mining tools,
contains a very large number of algorithms, techniques, settings and
options. In COMP8400, we will not use all of them, and you are not required
to be familiar with all of them. You are however encouraged to explore
these techniques and options, and read about them in the Rattle
documentation or the R help pages.
Rattle is a freely available software tool that provides a
graphical user interface on top of the R statistical programming
language (for more on R and how to get it please see the
COMP8400
further material and resources Web page). Rattle thus
facilitates access to the many data mining and statistical functionalities
from R.
The main developer of Rattle is
Graham Williams,
senior data miner at the Australian Taxation Office (ATO) in Canberra.
Rattle has and is been used in data mining courses at the
Australian National University, the University of Canberra, and at Yale
University. It is also used for practical data mining at the ATO and
various other organisations. Graham will be giving a
guest lecture in COMP8400 in the last
week of the semester.
The Rattle software and its manual can be downloaded and accessed from:
http://rattle.togaware.com
The Rattle version installed in the labs is 2.4.0. Note
that this is the version assumed to be used for assignments as well.
If you do your assignments on your laptop or home/office computer using
a different version of Rattle, then make sure that you write in
your assignment submission which version you were using.
Preliminaries
I suggest that you create a directory comp8400 in either your
ANU student account on the lab machines, on your personal laptop or
desktop, or on a portable storage device such as USB memory stick. Within
this comp8400 directory, create sub-directories named
lab1, lab2, lab3, lab4, as well as
assign1, assign2 and presentation in order to have
your COMP8400 work nicely structured.
You should have a look at the
Rattle
Data Miner documentation. Note that both Rattle and its
documentation are under development, and currently not all functionality
and chapters are completed. Any feedback on errors, typos and other issues
is much appreciated (you can tell the course coordinator who will then contact
Graham Williams).
Preferably you should have a browse through the Rattle Data Miner
documentation before the lab session. Also, if you plan to use your own
laptop in the labs, please try to install Rattle before the lab
(install instructions are provided in the Rattle Data Miner
documentation).
If you want to dig deeper into the functionalities of Rattle
please have a look at the R statistical language it is based on.
Starting Rattle
- Start the Terminal (Konsole) program by clicking on the
Main Menu (left-most icon on the bottom menu panel), select
System, then Konsole.
- A new window will pop-up with a cursor that allows text input. Type
the upper-case letter R followed by
`Enter'. This will start the R statistical
language. The prompt (character at the beginning of the line where
the cursor is) should have changed to `>'.
- Now type the following into the terminal window:
library(rattle) followed by `Enter'.
Some text on Rattle, including a copyright notice, will be
shown.
- Finally, to start Rattle please type:
rattle() followed by `Enter'. Make
sure that you type the opening and closing bracket!
- A new window will appear, which is the main Rattle graphical
user interface.
Quitting Rattle
- Click on the 'Quit' button in the main Rattle window,
and confirm with a click on Yes if you really want to quit.
Tasks
This first lab basically consist of working through chapters
Interacting with Rattle,
Data,
Exploring Data, and
Transforming Data. The last three of these chapters correspond to
the Data, Explore, and Transform tabs in Rattle.
- Start R and Rattle as detailed above.
- Rattle includes a simple example data set which we will use
in this first lab. To load it conduct the following two steps:
- Click on the Execute button.
- Click on Yes when asked: "Would you like to use the
sample weather dataset?"
The weather data set comes from the Bureau of Meteorology,
and it contains 355 records.
You can now see detailed information about this data set. Please read
through the Help menu (Data sub-menu) or the
Data chapter about the different roles and other data related
settings. Make sure you understand the different roles before you
continue.
If you would like to look at other data sets, then click on the
Library button, then select one of the many available
data sets from the menu available next to Data Name:. You need
to confirm your selection with a click on the Execute button.
Before you continue make sure you have selected the weather
sample data set (and have confirmed your selection with a click
on Execute).
- Note: You can always go to the Log tab to see the
R code that has been generated when you clicked Execute.
This is the actual code that is run underneath Rattle.
- Now go to the Explore tab, which offers a large number of
possibilities to explore the loaded data set. The chapter on
Exploring Data in the Rattle online documentation
provides more detailed descriptions, as does the Help menu
(Explore item).
Explore the content of the weather data set using the various
options provided. Specifically, you should learn about the
distribution of the values in the variables (attributes), as well as
the number of missing values (use Summary / Describe /
Show Missing for this).
Some specific questions you might want to explore could include:
- Which variables are strongly correlated? Which ones are strongly
negatively correlated?
- Which variable has the most skewed distribution?
- Which variable has the largest number of missing values?
Note: If some of the result plots are too clustered, then you
might want to go back to the Data page and set the role of
certain variables to Ignore (make sure you click Execute
to confirm the new selection).
Note: If the graphics output does not work (an empty window
appears and an error message is printed in the terminal window), please
go to Settings and un-tick the Use Cairo Graphics Device
option.
Note: The options Latticist and GGobi are not
available in this installation of Rattle. Both are external
graphical data exploration tool.
- Next, go to the Transform tab to impute values for missing
data. See the chapter on
Transforming Data in the Rattle online documentation
for explanations about the many different transformation functions
available.
In order to perform a data transformation, you need to high-light
one or more variables (use Shift-click to highlight several
variables), and then click on Execute to do the actual
transformation.
As you will see, Rattle will generate a new variable and append
it at the end of the list (scroll down to the bottom of the list). As
you can see, the new variable will have a name based on the
transformation performed, concatenated with the original variable name.
If you now go back to the Data page you will notice that
Rattle has set the role of the variables you just transformed to
Ignore.
- Time permitting, select another data set from the Library on the
Data page, and explore it.
Alternatively, you can download a data set from one of the data
collections given on the COMP8400 further
material and resources Web site, for example from the
UCI Machine Learning
repository.
- At the end, quit Rattle as detailed above. Make sure you
properly log out of your account before you leave the lab.
Last modified: 3/04/2009, 13:13