Finding key sentences from documents

Research areas

Temporary Supervisor

Dr Huidong Jin

Description

Automatic summarisation is, by a computer program, the creation of a shortened version, which still contains the most important points of the original document. The phenomenon of information overload has meant that access to coherent and correctly-developed summaries is vital. As access to data has increased so has interest in automatic summarization. An example of the use of summarization technology is search engines such as Google. There are mainly two categories of document summarisation techniques, abstraction and extraction. Abstraction involves paraphrasing sections of the source document while extraction techniques merely copy the information deemed most important by the system to the summary (for example, key clauses, sentences or paragraphs). Abstraction strongly depends on natural language generation technology, which itself is a growing field. The project will focus on extraction. Recently, our research team has developed a novel topic model, called Segmented Topic Model. It can model connections between main topics of a whole document with the topics of its segments such as sentences. It can generate substantially better topics to describe the document than the state-of-the-art technique Latent Direchlet Allocation (LDA).

Goals

This project aims at implementing efficient or developing effective key sentence discovery techniques from large document corpus based on the segmented topic model. The underlying idea is that, in the semantic space spanned by these topics, these key sentences expressing more information of documents should be choosen as key sentences. For effective document summarisation, we may have to develop new sentence-level topic models. The project can further explore some interesting research challenges such as robustness of the proposed technique, the appropriate number of crucial sentences, fast ranking techniques for different sentences, and so on.

Requirements

  • Applicants are expected to have a major in information technology, computer engineering with excellent programming skills, especially C/C++, plus good understanding of data mining.
  • Or strong background in applied mathematics/statistics, preferably with strong background in algorithms and statistics, plus good programming knowledge

Background Literature

  • Lan Du, Wray L. Buntine, Huidong Jin: A segmented topic model based on the two-parameter Poisson-Dirichlet process. Machine Learning 81(1): 5-19 (2010)
  • Rachit Arora and Ravindran Balaraman, “Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-Document Summarization”, page 713-718, ICDM'08.
  • Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003) 993-1022
  • CSIRO (www.csiro.au), as Australia’s national science agency is one of the largest and most diverse research agencies in the world. It operates large multi-disciplinary research teams. By doing a project with CSIRO you will have access to world class facilities and be able to work alongside CSIRO scientists while you are enjoying generous personal development and learning opportunities.

Gain

A student working in this project can expect

  • to learn state-of-art of text mining techniques;
  • to be involved in developing cutting-edge document summarisation techniques;
  • to gain experiences on solving real-world challenges while working with a research group delivering great science and innovative solutions for Australian society and economy.

Updated:  1 June 2019/Responsible Officer:  Head of School/Page Contact:  CECS Webmaster