Marking guide for COMP8400 assignment 2, 2009 ============================================= General comments: Please follow exactly the questions asked in assignment specifications, as otherwise it is very hard to do an accurate marking! ***** Part 1: 6 marks ***** 1 mark: Correct data set generated. 1 mark: Correct and complete length 2 frequent item-sets. 1 mark: Correct and complete length 3 frequent item-sets. 1 mark: Correct rules generated and their correct support values. 1 mark: Correct confidence values for these rules. 1 mark: Correct lift values for these rules. -0.5 mark for small errors like missing item-sets, wrong counts, etc. ---------------------------------------------------------------------- ***** Part 2: 3 marks ***** 1 mark: Correct confusion matrix, including 'Total' number and normalised confusion matrix (percentages). 1 mark: Correct accuracy and error rate percentage numbers. 1 mark: Correct specificity, precision and recall percentage numbers. No penalties for small rounding errors in parts 2 and 3. ---------------------------------------------------------------------- ***** Part 3 A: 6 marks ***** 1 mark: For executive summary, if 10 lines of text or less, and if it addresses the classification project conducted, including data sets summary and classification outcomes. -0.5 if the name or size (number of records) of the data set is not given. 0.5 mark: Data set description, including name, source, number of records, attributes in data, description of attributes, etc. 1 mark: Data exploration steps, data quality description, including missing values, distribution of values, plots, etc. (a correlation plot would have been nice). Also, if only a subset of attributes are shown in plots you need to explain why you have chosen these attributes and not others. 1 mark: Data cleaning and transformation steps performed, including feature construction if necessary. Converting numerical into categorical variables was not necessary, but considering how to handle records with missing values was important. Normalisation is (usually) not necessary for classification, and neither is categorisation of numerical variables into categoricals. Also, start a classification by using all (or most) variables (unless some are very correlated with each other), and don't manually select a sub-set (unless you have a very large number of variables). 1 mark: Description of classification approaches, including classifiers chosen, their parameters, etc. One paragraph per classifier. Here you needed to explain you chose parameter settings. You lost marks if you did not explain why a certain parameter value/setting was used. 1 mark: Presentation of classifier results, should include confusion matrix (both raw numbers and percentages), and visualisation of the results, e.g. ROC, precision-recall curves, etc. The best would have been to show one ROC plot containing all three classifier results (easier to compare than three separate plots). (0.5 mark for confusion matrix, 0.5 for graphs) 0.5 mark: General description of your classification project, including problems, things learned, etc. ---------------------------------------------------------------------- ***** Part 3 A: 6 marks ***** * two marks (2) for the program listing, its readability, structure, commenting, etc. -0.5 if the Canberra distance function did not check for division by zero in the formula! -0.25 for bad variable / method / function naming. -0.25 for lost of code in the main function / method and not in separate functions / methods. - 0.25 for missing or unclear comments. * half a mark (0.5) for your description of how your program works. * half a mark (0.5) for your description of how you tested your program. -0.25 if testing not properly explain, or only limited testing described. * one mark (1) each for correctly classifying all the test records in the three test input files.