Text mining has been increasingly used in real-world application such as classifying patents, discovering complicated gene-biomedical relationship, visualising how a novel evolves. Semantic text mining is to analyse documents by their meaning, instead of what words/terminology used. Recently, our research team has developed several novel topic models like Segmented Topic Model and Adaptive Topic Models. They can model connections between main topics of a whole document with the topics of its segments such as sentences. It can generate substantially better topics to describe the document than the state-of-the-art technique Latent Direchlet Allocation (LDA). That is, both models can capture the meaning of documents.
This project aims at implementing efficient or developing effective semantic text mining techniques for large document corpus based on topic models. The underlying idea is that, in the semantic space spanned by these topics, these segments, documents and corpus can been compared directly. The project can further explore some interesting research challenges such as robustness of the proposed technique, generating concise description of different document classes, and reaching a good trade-off between efficiency and effectiveness.
- Applicants are expected to have a major in information technology, computer engineering with excellent programming skills, especially C/C++, plus good understanding of data mining.
- Or strong background in applied mathematics/statistics, preferably with strong background in algorithms and statistics, plus good programming knowledge
- Huidong Jin, Lijiu Zhang and Lan Du, â€œSemantic Title Evaluation and Recommendation Based on Topic Models.â€ PAKDD. Gold Coast, Australia, 14-17 April 2013
- Lan Du, Wray L. Buntine, Huidong Jin: A segmented topic model based on the two-parameter Poisson-Dirichlet process. Machine Learning 81(1): 5-19 (2010)
- Du, Lan, Wray Buntine, Huidong Jin, and Changyou Chen. "Sequential latent Dirichlet allocation." Knowledge and information systems 31, no. 3 (2012): 475-503.
- Lan Du, Wray Buntine and Huidong Jin, â€œModelling Sequential Text with an Adaptive Topic Model.â€ EMNLP-CoNLL 2012. pp 535-545, Jeju Island, Korea, July 2012.
- Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003) 993-1022
- C.C. Aggarwal and C.X. Zhai (eds.), Mining Text Data, DOI 10.1007/978-1-4614-3223-4_5, 2012
- CSIRO (www.csiro.au), as Australiaâ€™s national science agency is one of the largest and most diverse research agencies in the world. It operates large multi-disciplinary research teams. By doing a project with CSIRO you will have access to world class facilities and be able to work alongside CSIRO scientists while you are enjoying generous personal development and learning opportunities.
A student working in this project can expect
- to learn state-of-art of text mining techniques;
- to be involved in developing cutting-edge document summarisation and title evaluation techniques;
- to gain experiences on solving real-world challenges while working with a research group delivering great science and innovative solutions for Australian society and economy.