Skip navigation

Exploiting Syntactic, Semantic and Lexical Regularities in Language Modeling

Dr Shaojun Wang (Dept of Computing Science, University of Alberta, Canada)

CSL SEMINAR SERIES

DATE: 2005-08-29
TIME: 14:00:00 - 14:45:00
LOCATION: RSISE Seminar Room, ground floor, building 115, cnr. North and Daley Roads, ANU
CONTACT: JavaScript must be enabled to display this email address.

ABSTRACT:
Language modeling -- accurately calculating the probability of naturally occurring word sequences in human natural language -- lies at the heart of some of the most exciting developments in computer science, such as speech recognition, machine translation, information retrieval and bioinformatics. I will present two pieces of my research for statistical language modeling which simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information.

The first piece of work is based on a new machine learning technique we have proposed --- the latent maximum entropy principle --- which allows relationships over hidden features to be effectively captured in a unified model. Our work extends previous research on maximum entropy methods for language modeling, which only allow observed features to be modeled. The ability to conveniently incorporate hidden variables allows us to extend the expressiveness of language models while alleviating the necessity of pre-processing the data to obtain explicitly observed features. We then use these techniques to combine two standard forms of language models: local lexical models (trigram models) and global document-level semantic models (probabilistic latent semantic analysis, PLSA).

The second piece of work is aimed at encoding syntactic structure into semantic n-gram language model with tractable parameter estimation algorithm. We propose a directed Markov random field (MRF) model that combines n-gram models, PCFGs and PLSA. The composite directed MRF model has potentially exponential number of loops and becomes context sensitive grammar, nevertheless we are able to estimate its parameters in cubic time using an efficient modified EM method, the generalized inside-outside algorithm, which extends inside-outside algorithm to incorporate the effects of the n-gram and PLSA language models.

Our experimental results on the Wall Street Journal corpus show that both approaches induce significant reductions in perplexity over current state-of-art technique.
BIO:
Shaojun Wang received the BE and ME degrees from Tsinghua University and the MS degree in mathematics and the PhD degree in electrical and computer engineering from the University of Illinois at Urbana-Champaign. His primary research interest is machine learning and computational intelligence. He is especially interested in developing machine learning methods to solve artificial intelligence problems that arise in human-machine interaction such as text and natural language processing, speech processing and recognition, bioinformatics and computational biology, information retrieval and data mining, and Web-based systems. He is also interested in the underlying statistical learning theory.

Updated:  29 August 2005 / Responsible Officer:  JavaScript must be enabled to display this email address. / Page Contact:  JavaScript must be enabled to display this email address.