CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
Department of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8400 - Lab 1 - Thursday 12 March

Introduction to Rattle and data exploration

Objectives

The objective of this first lab is to get familiar with the graphical user interface of the open source data mining tool Rattle, and to conduct data exploration on smaller example data sets.

Note that Rattle, similar to many other data mining tools, contains a very large number of algorithms, techniques, settings and options. In COMP8400, we will not use all of them, and you are not required to be familiar with all of them. You are however encouraged to explore these techniques and options, and read about them in the Rattle documentation or the R help pages.

Rattle is a freely available software tool that provides a graphical user interface on top of the R statistical programming language (for more on R and how to get it please see the COMP8400 further material and resources Web page). Rattle thus facilitates access to the many data mining and statistical functionalities from R.

The main developer of Rattle is Graham Williams, senior data miner at the Australian Taxation Office (ATO) in Canberra. Rattle has and is been used in data mining courses at the Australian National University, the University of Canberra, and at Yale University. It is also used for practical data mining at the ATO and various other organisations. Graham will be giving a guest lecture in COMP8400 in the last week of the semester.

The Rattle software and its manual can be downloaded and accessed from:

http://rattle.togaware.com

The Rattle version installed in the labs is 2.4.0. Note that this is the version assumed to be used for assignments as well. If you do your assignments on your laptop or home/office computer using a different version of Rattle, then make sure that you write in your assignment submission which version you were using.


Preliminaries

I suggest that you create a directory comp8400 in either your ANU student account on the lab machines, on your personal laptop or desktop, or on a portable storage device such as USB memory stick. Within this comp8400 directory, create sub-directories named lab1, lab2, lab3, lab4, as well as assign1, assign2 and presentation in order to have your COMP8400 work nicely structured.

You should have a look at the Rattle Data Miner documentation. Note that both Rattle and its documentation are under development, and currently not all functionality and chapters are completed. Any feedback on errors, typos and other issues is much appreciated (you can tell the course coordinator who will then contact Graham Williams).

Preferably you should have a browse through the Rattle Data Miner documentation before the lab session. Also, if you plan to use your own laptop in the labs, please try to install Rattle before the lab (install instructions are provided in the Rattle Data Miner documentation).

If you want to dig deeper into the functionalities of Rattle please have a look at the R statistical language it is based on.


Starting Rattle

  1. Start the Terminal (Konsole) program by clicking on the Main Menu (left-most icon on the bottom menu panel), select System, then Konsole.

  2. A new window will pop-up with a cursor that allows text input. Type the upper-case letter R followed by `Enter'. This will start the R statistical language. The prompt (character at the beginning of the line where the cursor is) should have changed to `>'.

  3. Now type the following into the terminal window: library(rattle) followed by `Enter'. Some text on Rattle, including a copyright notice, will be shown.

  4. Finally, to start Rattle please type: rattle() followed by `Enter'. Make sure that you type the opening and closing bracket!

  5. A new window will appear, which is the main Rattle graphical user interface.

Quitting Rattle

  1. Click on the 'Quit' button in the main Rattle window, and confirm with a click on Yes if you really want to quit.


Tasks

This first lab basically consist of working through chapters Interacting with Rattle, Data, Exploring Data, and Transforming Data. The last three of these chapters correspond to the Data, Explore, and Transform tabs in Rattle.

  1. Start R and Rattle as detailed above.

  2. Rattle includes a simple example data set which we will use in this first lab. To load it conduct the following two steps:
    1. Click on the Execute button.
    2. Click on Yes when asked: "Would you like to use the sample weather dataset?"

    The weather data set comes from the Bureau of Meteorology, and it contains 355 records.

    You can now see detailed information about this data set. Please read through the Help menu (Data sub-menu) or the Data chapter about the different roles and other data related settings. Make sure you understand the different roles before you continue.

    If you would like to look at other data sets, then click on the Library button, then select one of the many available data sets from the menu available next to Data Name:. You need to confirm your selection with a click on the Execute button.

    Before you continue make sure you have selected the weather sample data set (and have confirmed your selection with a click on Execute).

  3. Note: You can always go to the Log tab to see the R code that has been generated when you clicked Execute. This is the actual code that is run underneath Rattle.

  4. Now go to the Explore tab, which offers a large number of possibilities to explore the loaded data set. The chapter on Exploring Data in the Rattle online documentation provides more detailed descriptions, as does the Help menu (Explore item).

    Explore the content of the weather data set using the various options provided. Specifically, you should learn about the distribution of the values in the variables (attributes), as well as the number of missing values (use Summary / Describe / Show Missing for this).

    Some specific questions you might want to explore could include:

    • Which variables are strongly correlated? Which ones are strongly negatively correlated?
    • Which variable has the most skewed distribution?
    • Which variable has the largest number of missing values?

    Note: If some of the result plots are too clustered, then you might want to go back to the Data page and set the role of certain variables to Ignore (make sure you click Execute to confirm the new selection).

    Note: If the graphics output does not work (an empty window appears and an error message is printed in the terminal window), please go to Settings and un-tick the Use Cairo Graphics Device option.

    Note: The options Latticist and GGobi are not available in this installation of Rattle. Both are external graphical data exploration tool.

  5. Next, go to the Transform tab to impute values for missing data. See the chapter on Transforming Data in the Rattle online documentation for explanations about the many different transformation functions available.

    In order to perform a data transformation, you need to high-light one or more variables (use Shift-click to highlight several variables), and then click on Execute to do the actual transformation.

    As you will see, Rattle will generate a new variable and append it at the end of the list (scroll down to the bottom of the list). As you can see, the new variable will have a name based on the transformation performed, concatenated with the original variable name.

    If you now go back to the Data page you will notice that Rattle has set the role of the variables you just transformed to Ignore.

  6. Time permitting, select another data set from the Library on the Data page, and explore it.

    Alternatively, you can download a data set from one of the data collections given on the COMP8400 further material and resources Web site, for example from the UCI Machine Learning repository.

  7. At the end, quit Rattle as detailed above. Make sure you properly log out of your account before you leave the lab.


Last modified: 3/04/2009, 13:13