Student research opportunities
Scalable knowledge discovery from very large environmental data CSIRO PhD top-up $15000 per year available for application
Project Code: CECS_846
This project is available at the following levels:
Keywords:
Scalable, data mining, spatial/temporal modelling
Supervisors:
Dr Warren JinProfessor Tom Gedeon
Outline:
Due to advancement of technology, massive amounts of data are being collected at a large number of spatial locations for a considerable time period in the environmental and geophysical sciences. For example, soil brightness, which can be retrieved to indicate daily surface soil moisture, is observed over the global by remote sensors, like ASCAT aboard the EUMETSAT METOP satellite; the online SILO database includes about 120 years of continuous daily weather records (up to 16 climate variables) from around 3800 Bureau of Meteorology stations across Australia and is updated near real-time. This project aims to develop computationally efficient models/techniques to discover useful knowledge hidden in very large geo-referenced time series. Spatial and/or temporal dependency within very large spatio-temporal data simply overwhelm traditional learning techniques, including maximum likelihood estimation, Bayesian methods, Monte Carlo Markov chain, or best linear unbiased prediction. In particular, they must invert and calculate determinants of very high dimensional covariance matrices, both of which have cubic time complexity. One possible direction is to develop a distributed computation version of these techniques based on modern high performance computation techniques like cloud computation or high performance computers. Alternatively, computation efficiency can be achieved via trading off accuracy for speed. However, data aggregation or other dimension reduction techniques may lose local accuracy; imposing block diagonal or other sparse matrix approximations may obscure large scale dependencies. Thus, care must be taken with appropriate approximation mechanisms in order to retain acceptable accuracy.
Goals of this project
The project will develop scalable computation/learning techniques/software for very large spatio-temporal data with impact on some national environmental challenges.
Requirements/Prerequisites
- Applicants are expected to have a major in computer science or applied statistics/mathematics.
- Preferably with strong background in statistical computation or machine learning.
- Preferably with excellent programming skills (C/C++, R)
- Preferably with high performance computation experience
Student Gain
A student working in this project can expect
- to learn state-of-art of scalable data mining techniques
- to be involved in developing cutting-edge techniques to handle very large environmental data sets while working with a research group delivering great science and innovative solutions for Australian society and economy;
- Supplementary PhD scholarship available from CSIRO
Background Literature
- Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann, 2011
- Cressie, N., T. Shi, and E. L. Kang (2010), Fixed Rank Filtering for Spatio-Temporal Data, Journal of Computational and Graphical Statistics, 19(3), 724-745, DOI 10.1198/jcgs.2010.09051;
- Harvey Miller and Jiawei Han (eds.), Geographical Data Mining and Knowledge Discovery, (2nd ed.), Taylor & Francis, 2009.
- Porcu et al. (eds.), Advances and Challenges in Space-time Modelling of Natural Events. Springer, 2012.




