Probabilistic Topic Models of Text and Users

David Blei
Associate Professor, Computer Science
Princeton University
Location: 
Wachman 1015D
Date: 
Friday, November 22, 2013 - 14:00
Probabilistic topic models provide a suite of tools for analyzing large document collections. Topic modeling algorithms discover the latent themes that underlie the documents and identify how each document exhibits those themes. Topic modeling can be used to help explore, summarize, and form predictions about documents. Traditional topic modeling algorithms take a document collection as input and analyze the texts to estimate its latent thematic structure. However, for many collections, there is an additional type of data: how people use the documents. For example, consider readers clicking on articles in a newspaper website or scientists placing articles in their personal libraries. User behavior data about documents is critical to building recommendation systems and gives new ways of understanding how a collection is implicitly organized.
 
In this talk, I will review the basics of topic modeling and describe our recent research on collaborative topic models, which simultaneously analyze texts and corresponding user behavior data. We studied collaborative topic models on a large collection of 80,000 scientists' libraries and the 250,000 abstracts of the corresponding articles. With this analysis, we can build recommendation systems that point scientists to articles they will like and, further, organize the scientific literature according to the discovered patterns of readership. As examples, we can identify articles that are important within a field and articles that transcend disciplinary boundaries. More broadly, topic modeling is a case study in the large field of applied probabilistic modeling.  Finally, I will survey some recent advances in this field. I will show how modern probabilistic modeling gives data scientists a rich language for expressing statistical assumptions and scalable algorithms for uncovering hidden patterns in massive data.