Analyzing the World's Languages

Ryan McDonald
Google, Inc.
Wachman 447
Wednesday, May 1, 2013 - 11:00
Building computer programs that can analyze and distill human language embodies the field of natural language processing (NLP). In the past decade, a number of significant achievements have been made due to the availability of large training resources coupled with scalable machine learning algorithms. This can be seen in the improved quality of search engines as well as successes such as Google Translate, IBM's Watson and Apple's Siri. Despite this success, NLP technologies are still limited in a number of respects. One particularly glaring shortcoming is that most NLP technologies are robust only for English. This is largely due to limited training resources for the long tail of the world's languages. As English comprises only a fraction of the world's speakers, this is problematic. In this talk, I will describe recent advances in building multilingual language technologies. In particular, I will focus on algorithms that learn using weak constraints that can be derived from multilingual knowledge sources that are already in existence today. These constraints can be used as partial supervision in structured learning to construct analyzers for a diverse set of languages. The resulting system significantly pushes the state-of-the-art in multilingual syntactic and semantic analysis. I will conclude with thoughts on the future of such technologies, particularly with respect to their application to downstream technologies like automatic translation and knowledge curation.