Exploring non-parametric approaches in text analysis - topic modelling in tweets
Kar Wai Lim
CS HDR MONITORINGDATE: 2012-11-02
TIME: 14:40:00 - 15:10:00
LOCATION: NICTA - 7 London Circuit
CONTACT: JavaScript must be enabled to display this email address.
ABSTRACT:
I will present some work done in few areas, and my PhD work: i) Relevance vs. Diversity Trade Off MMR (Maximal Marginal Relevance) is a widely used ad-hoc diversification algorithm used heavily in Information Retrieval (IR) over 12 years. MMR performs well despite the fact that there is no theoretical justification. In this work, we explicitly derive the mathematical relation between an optimising objective named 'expected n-call@k' and show that it has the same form as MMR. ii) Angry Birds Project This project aims to integrate different areas in Artificial Intelligence (AI) to solve challenging problem, solving angry bird game is the first step to achieve that. I will briefly discuss the trajectory module and show some demos. iii) Split-merge LDA Traditional topic model like LDA suffers the problem of choosing number of topics (number of clusters in clustering). Various methods and models are proposed to solve this problem but other issues arise, for instance, using HDP results in duplicate or very similar topics (clusters). We propose an extension to LDA that has two operators on the topics, splitting and merging, that aim to eliminate the problem of choosing the number of topics and clean up duplicate topics. iv) Exploring tagged documents (eg, Twitter) Standard LDA performs badly on Twitter due to extremely short documents and heavy noises. Existing extensions of topic model still suffer various issues. I propose author-tag topic model that utilises both author and tags information on tweets. This model also works for other tagged data such as publications (PubMed). I will discuss various potential applications from this model.
My talk will be quite high level so it should not be difficult to understand, so come along to listen on a lazy* Friday afternoon.
* This is a subjective statement.
BIO:


