Marking guide for COMP8400 assignment 1, 2009 ============================================= General comments: Please follow exactly the questions asked in assignment specifications, as otherwise it is very hard to accurately mark an assignment! Please give section headings as provided in assignment specifications. ***** Part 1: 6 marks, 1 per question. ***** In the following I provide examples of what I think should have been covered by your answers. Obviously, there is no exact right or wrong answer when writing such a report. My marking is based on my assessment of what has been covered by a student's answers. 1 Describe in more details what kind of data such a national institute of sports would collect (data types, sources, formats, volume, etc). A large variety of data from many different sources will be collected: - Data about athletes (such as their physical and medical details, information about their nutrition, past injuries, etc.) - Data about athletes performance in training (like measurements of times, weights, heights, etc. depending upon their sport, as well as body measurements - pulse rate, blood glucose levels, blood pressure, etc). Additionally, if relevant, data about training venues and environmental conditions. - Data from competitions (including TV footage, results, maybe even environmental factors - temperate, humidity etc. during a competition). - Data about matches for team sports (again video footage, number of goals, scores, who was playing when, etc.) - Reports, newspaper and online articles about elite sports, as well as scientific articles from medical sports journals (PDF, html or text files). Data types include numerical data (like measurements from sensors and timers), textual data (reports about athletes from coaches, or medical and nutritional reports, maybe even athletes personal diaries where they write down self-reflections about their performances), time-series data (heart beat rates, blood pressure, temperatures, etc.), and multimedia data (like video footage of training as well as competitions, like matches). The number of athletes (at the AIS) is in the hundreds, so the volume of data will be accordingly. Time-series data (potentially recorded automatically using sensors) will be large (for example heart-beat rates for marathon runners). Multimedia data will also be very large volumes (Gigabytes). The source of this data is also very varied, including hand-written reports, automatically collected sensor data, video recordings (from public television as well as staff), external data from competitions (official results), and environmental data. One can also imagine that staff is crawling the Web to find information about elite sports in other countries in order to find out what they are doing. Note: A 'relational database' is not really a data source (who/what has put the data into this database?) Data sources can be sensors, handwritten forms (like medical, notes by athletes and coaches), video (from training or competition or external (TV), etc. 2 How would you design a data warehouse for this national institute of sports? What would the main dimensions be? And what kind of data would be stored in this data warehouse? Given the large variety of data being collected - much of it very specific to a given sport's discipline - it will not be good advise to design a data warehouse with a single data cube, but rather several data cubes (which might have one dimension in common - namely details about athletes). A fact constellation schema would be the most appropriate design approach. The 'athletes' dimension would include all their detailed information (such as personal, physical, medical, nutritional details). Based on the different sports disciplines, it might be possible to share certain dimensions between for example team sports (soccer, football, basketball, volleyball, etc.), athletics, swimming etc. The main differences will be between individual and team sports. 3 What kind of data pre-processing and data integration issues will you have to consider when designing such a data warehouse? What kind of external data do you think would be of interest and should be integrated into such a data warehouse? Given data is being collected from different sources on a continuous basis, many data pre-processing tasks will need to be applied: - Free-format textual information (medical and nutritional reports) have to be scanned, cleaned, and standardised. - Data from sensors might include outliers (wrong measurements) and missing data values (e.g. malfunctioning of a sensor), or it can be recorded in different units. - Integrating data from different sources can be complicated, for example if data is collected at different points in time (e.g. time series of heart beat and blood glucose level recorded at different intervals). - External data might be recorded in different units and needs to be converted (like temperature data during an event in the USA will be in degree Fahrenheit) - External data (already mentioned above) will include data from competitions, TV, newspapers and magazines, Web sites, other sports organisations (like a swimming organisation might provide a database with results from junior competitions). 4 What kind of questions do you think the sports researchers and athletes of this national institute of sports would like to have addressed by data mining? What could the outcomes of such data mining projects be? Examples include: - What types of athletes (physical build, endurance, etc.) are more likely to be successful in certain types of sports? - What are the characteristics of young athletes that more likely makes them successful elite athletes in certain disciplines? - How can training facilities be utilised more efficiently? - How do environmental effects influence performance - and can we use environmental effects to increase performance (like high-altitude camps)? - What training procedures can increase success in competitions? - What strategies in team matches (like when to exchange players) results in wins? - What is the best coaching strategy? Note: Questions like: "How many athletes have won gold medals last year?" are not really data mining questions, but can be answered by traditional database queries (SQL). Outcomes would be: - Hopefully more gold medals at the next Olympics (and all other similar competitions), and more wins for our teams. - Improved training efficiency and improved training facilities. - Reduced costs for training and facilities (without compromising success). - Improved recruitment and discovery of young talents (or potential next generation elite sports women and men). 5 What challenges will you most likely face as a data miner at this national institute of sports? Data size and complexity (especially different data types - free text, video etc.) is certainly a major technical challenge, i.e. how can information from many different sources be integrated and mined. Data quality, especially with regard to missing values and outliers (in time series data) is another major challenge. Data ownership (by a certain discipline) might make it hard to conduct a project (e.g. a coach might be reluctant to provide information about the way he is coaching, as this is his secret of success). Further challenges include: How to convince people (and athletes) that data mining can be useful and improve success? And also: Can the results of a data mining project (possibly unexpected or contradictory) be trusted? And how can results of data mining projects be implemented, for example to improve training procedures? (These are some of the non-technical challenges.) 6 Are there any legal/regulatory requirements that might limit the use of data mining at this national institute of sports? A lot of the data about athletes (like their medical details) are highly confidential and need to be protected (for example strictly limit access). Results of data mining projects can also be highly confidential, as they might give a country a crucial advantage over other nations, therefore these results must be strictly protected. Staff must be highly trustable - imagine somebody selling all data or results/outcomes to another country (sports espionage). Data mining results might also discriminate against certain types of athletes, for example if a data mining results comes up with a rule like "no runner smaller than 1.60 meters has ever won an Olympics gold medal", does this mean nobody with a height below 1.60 meters will be admitted to the national institute of sports? ---------------------------------------------------------------------- ***** Part 2: 2 marks. ***** 0.5 mark for each correctly calculated distance (0.25 deduction each if for example the numerical numbers were wrong but the formula correct, or the other way round). ---------------------------------------------------------------------- ***** Part 2: 7 marks. ***** Option A: Conduct a clustering project using Rattle. One mark per question. The report should contain the following information: 1. This should include: (a) the name of the data set, (b) where you got it from (the name of the data repository and the URL to a page where the data set is described or from where it is available), and (c) a description of the data set that must include the number of records and attributes (variable) it contains, and details about these attributes (their names and types, for example numerical, categorical, ordinal, etc.) - Part (a) and (b) were easy, I think everybody got them right. - For part (c), a table listing the attributes with their name and types (numerical, categorical - ordinal, nominal, etc) would have been the best way to answer. Additionally, a sentence or two describing in your own words what this data set is all about would have been nice. - Some students lost a part of this mark because they did not give the number of attributes or the number of records in their data set, or they did not describe all attributes, or did not describe them completely (e.g. not their type), or described them wrongly. 2. A description of the data exploration steps you have done using the data mining tool, and what you found out about the data quality of this data set, for example number of missing values, out of range values, distribution of values (means values, minimum and maximum, histograms, etc.). - Ideally a table with minimum, maximum and mean values for numerical attributes; and number of categories for categorical attributes. - Also in this table the number of missing values in the attributes should have been given. - Some graphs (histograms, etc.) for some attributes would have been good. - Additional: Correlation plots, and any observation of unusual distributions (skewness, kurtosis) in the data should have been reported. - Please don't write things like: "Then I clicked on the `Explore' tab, then on `Describe', and then on `Execute'..." A report should mainly concentrate on the results of your data exploration, but describe the steps and results, for example like: "An exploration of basic statistics, as summarised in Table 1, that attribute X had an unexpected distribution of values..." 3. A description of the data cleaning and transformation steps you might have done (or not - in which case describe why no transformation was needed). - Important: most students did NOT normalise the numerical attributes used for clustering - they lost 0.5 mark for this! Un-normalised attributes, for example income (0..1,000,000) and age (1..100) will result in distorted clustering results. 4. A description of attribute selection and/or feature construction you have done, and the reasoning behind this. - Some students did not describe WHY they selected certain attributes for clustering, or they selected some attributes because they THOUGHT the selected attributes would describe what the EXPECTED in the data set - by doing this you limit the analysis of a data set substantially. 5. A description of the clustering approach taken, which has to include (a) the clustering technique (or techniques) used, (b) the reason why you chose this techniques(s), and (c) the description of the parameter values you have chosen (for example, if you use k-means provide the number of clusters k and distance measure you have chosen, and why). - I expected that both k-means and hierarchical clustering has been used (if one has not been used an explanation why not was required). - Many students only used k-means and did not write why. - Some students also only chose k=X (a certain number), without explaining why. Or they set k=X because they BELIEVED there were X clusters in the data. This is not the aim - clustering should find groupings in the data that are not obvious. - Some students did not describe at all the parameters they used in their clustering. 6. A description of the clusters you have found in the data set, and if these clusters are sensible, i.e. correspond to what you would have expected in such a data set. - The results should have included numerical cluster centroids and ideally also a data plot. - A description should have been included, trying to describe the clusters found, and not simply the resulting data plot and/or cluster centroids. 7. Finally, a general description of your clustering project, including problems and difficulties encountered, things you have learned, and steps you would possibly do different the next time. - Here students lost marks if they didn't really describe what problems they encountered, what they have learned, or what they would do differently next time. I also expected students to explain the difficulties they encountered, and what they learned (for example: "it was difficult to use Rattle" or "I learned more about clustering" were not really good answers). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Option B: Program a simple clustering algorithm (like k-means or PAM) in a programming language of your choice (preferably Python, C, C++ or Java), and test and evaluate it on a publicly available data set. Two (2) marks for part 1 (program listing), and one (1) mark for all other parts. The report should contain the following information: 1. The program listing, which should be nicely formatted and contain enough comments to understand what the program is doing. - Here I looked for: - Good and meaningful comments that explained all parts of the program (not just a comment at the beginning of a function) - Proper structuring of the program listing, such as empty lines to separate program block, indention for loops, no wrap-around of lines, etc. - Proper structure of the program: a main function/method that called routines for data input, main clustering algorithm, and output of results; and a main clustering algorithm again made of several sub-routines - Proper naming of functions, methods and variables. - No global variables, proper passing of parameters to functions. - The program should not be hardcoded for certain attributes or attribute numbers - a more general program will be more elegant in most cases. - Efficiency of the program (e.g. no allocation/deallocation of objects when they change cluster membership). - I also expected that the main core routines were listed within the given 5 page limit (this included e.g. distance calculation). 2. A description of how the program works (i.e. what the functions or blocks in you program are doing). - A description of the main parts of the program was required here. I expected a proper textual description, not just listing the functions and their parameters. 3. A description of how you tested your program. - Here I wanted to see some description of the CORRECTNESS was tested, for example unit testing, small data sets with know record values/clusters, etc. - Most students did not describe how they tested the correctness of their program!! For example, use unit testing, and/or small example data sets where the outcomes of a clustering can be verified. An example data set could be: 0.0, 0.0, 0.0, 0.0 0.1, 0.1, 0.1, 0.1 0.2, 0.2, 0.2, 0.2 1.0, 1.0, 1.0, 1.0 0.9, 0.9, 0.9, 0.9 0.8, 0.8, 0.8, 0.8 With k=2, your program should generate two clusters and put three records into each cluster. 4. The name of the data set and where you got it from (including the name of the data repository and the URL to a page where the data set is described or from where it is available). - Easy - I think everybody got a mark here. Nice would have been to also have a description of the attributes (type, distribution, etc.), as well as the size of the data set (number of records and number of attributes). 5. The output produced by your program on the data set you used. - The program should at least have printed the number of clusters, the number of iterations it took to converge, and the cluster centroids. 6. A description of your observations when running your program on this data set, for example the number of iterations and total time it took, amount of memory used, etc. Please also describe on what computer (operating system, amount of memory, CPU type and speed, etc.) you developed and run your program. - Most students provided this information, but some did not specify if the time used was for the total running or just one iteration. Most also did not describe their observations when running their program using different numbers of clusters (i.e. scalability). ----------------------------------------------------------------------