Although machine learning methods perform best with large, exhaustive data sets, most data sets in the scientific domain are incomplete. Missing data occurs when individual experiments fail during data collection, or the results become corrupted. Restrictions based on the cost of samples or access to equipment can also inhibit researchers ability to gather all the data that is needed. The result is that scientific data sets can be flawed, and contain selection or evaluation biases. Rather than dropping samples in an already small data set, or dropping attributes that could significantly impact the results, it may be is preferable to “fill in the blanks” using imputation or interpolation. There are a number of ways to impute missing variables, or interpolate to synthesize missing samples, and some sophisticated methods that learn the distributions and can do both. At this stage it is unclear which methods are more suited to different situations, with more or less missing data, or different types of missing data. In this project you will test the efficacy of different imputation and interpolations methods such as deletion, MMM (mean, median and mode), univariate prediction and multivariable prediction, under a range of different conditions. The latter can include generative adversarial networks (GANs), variational autoencoders (VAEs), self-supervised leaning (SSL) and semi-supervised label propogators. Based on your results you will develop a protocol for selecting the best way to fix a damaged data set, and the impact of including different levels of synthetic data as opposed to removing different levels of observed data. A case study data set will be provided.
Compare simple and advanced imputation and interpolations methods for data with different levels and distributions of missing variables.
Python programming and experience in data science and machine learning is essential (such as COMP3720, COMP4660, COMP4670, COMP6670, COMP8420). Familiarity with platforms such as scikit-learn, Pytorch, Tensorflow and Keras is desirable.
This is a 24cp project.
machine learning, materials informatics, data science, interpolation