Skip navigation
The Australian National University

Document Analysis COMP4650

Learning outcomes

More information may be available for enrolled students on the course website on Wattle

Upon successful completion of the course, the student will have an understanding of the role documents play in business and community, and the various digital resources available for document analysis. Moreover, the student will have the background theory and practical knowledge necessary to plan and execute a basic document analysis project. The student will be able to:

  1. Understand the basic requirements digital libraries and business processes have w.r.t. documents. obtain documents from various sources and transform them into a common XML or RDF format with a knowledge of SAX and XPATH.
  2. Understand the genres of documents available from the internet such as RSS feeds, social networks, blogs, wikis, archives, etc., and the role they play in the internet ecosystem. Understand the linguistic and semantic resources available from the internet and the so-called ``web of data'', such as dictionaries, repositories and ontologies
  3. Understand basic probabilistic theories of language and document structure, and the basic algorithms and software available for them, and be able to use some common libraries for natural language processing to perform basic analysis tasks.
  4. Understand basic probabilistic theories of information retrieval, and be able to index a document collection for use in an information retrieval system. understanding basic theories and algorithms for large scale named-entity matching and standardization of names within a collection.
  5. Understand basic probabilistic theories of classification, clustering, and document feature ``engineering'', and be able to perform automated classification.

Responsible Officer:  JavaScript must be enabled to display this email address. / Page Contact:  JavaScript must be enabled to display this email address.