Learning multi-word terminology using Bayesian word segmentation
David Newman (University of California Irvine)
NICTA SML SEMINARDATE: 2012-08-09
TIME: 11:15:00 - 12:00:00
LOCATION: NICTA - 7 London Circuit
CONTACT: JavaScript must be enabled to display this email address.
ABSTRACT:
Automatically extracting terminology and index terms from scientific literature is useful for a variety of digital library, indexing and search applications. This task is non-trivial, complicated by domain-specific terminology and a steady introduction of new terminology. Correctly identifying nested terminology is both interesting and challenging (e.g. "belief propagation" vs. "loopy belief propagation" vs. "loopy belief propagation convergence"). We present a Dirichlet Process model of word segmentation where multi-word segments are either retrieved from a cache or newly generated, and the DP concentration parameter controls the number of multi-word segment types. We show how this DP-segmentation model can be used with part-of-speech regular expressions to successfully extract nested terminology, outperforming previous methods for solving this problem.


