Big Data Challenges and Opportunities: Three Case Studies

Xiaohua Tony Hu
Professor, Founding Director of the Data Mining and Bioinformatics Lab
College of Computing and Informatics, Drexel University
SERC 306
Monday, February 5, 2018 - 11:00
In recent years, "Big Data" has become a new ubiquitous term. Big Data is transforming science, engineering, medicine, healthcare, finance, business, and ultimately our society itself. In this talk, I will discuss the big data challenges, opportunities and its applications in three case studies with extensive experimental evaluations: (1) data analysis and visualization for Microbiome Data: Microbiome datasets are often comprised of different representations or views which provide complementary information to understand microbial communities, such as metabolic pathways, taxonomic assignments, and gene families. Data integration methods based on nonnegative matrix factorization (NMF) combine multi-view data to create a comprehensive view of a given microbiome study by integrating multi-view information. We are presenting a novel variant of NMF called Laplacian regularized joint non-negative matrix factorization (LJ-NMF) for integrating functional and phylogenetic profiles from HMP, and a multiple maps t-SNE regularization method for visualization of mom-metric relationships in microbiome data. (2) Video popularity prediction by sentiment propagation via implicit network: Video popularity prediction is very important in many real applications such as recommendation systems and investment consulting. However, four constraints have limited most existing works' usability. First, most feature oriented models are inadequate in the social media environment, because many videos are published with no specific content features, such as a strong cast or a famous script. Second, many studies assume that there is a linear correlation existing between view counts from early and later days, but this is not the case in every scenario. Third, numerous works just take view counts into consideration, but discount associated sentiments. Nevertheless, it is the public opinions that directly drive a video's final success/failure. Also, many related approaches rely on a network topology, but such topologies are unavailable in many applications. We propose a Dual Sentimental Hawkes Process (DSHP) to cope with all these challenging problems. DSHP's innovations are reflected in three ways: (i) it breaks the "Linear Correlation" assumption, and implements Hawkes Process; (ii) it reveals deeper factors that affect a video's popularity, and (iii) it is topology free. (3) Question-based text summary: In this research we aim to help people who want to quickly capture the main idea of a piece of information before they read the details through text summarization. In contrast with existing works, which mainly utilize declarative sentences to summarize a text document, we aim to use a few questions as a summary. In this way, people would know what questions a given text document can address and thus they may further read it if they have similar questions in mind. We develop a two-stage approach which consists of question selection and question diversification. The question selection component aims to find a set of candidate questions that are relevant to a text document, which in turn can be treated as answers to the questions. Specifically, we explore two lines of approaches that have been developed for traditional text summarization tasks, extractive approaches and abstractive approaches to achieve the goals of relevancy and answerability, respectively. This study opens up a new direction in the intersection of information retrieval and natural language processing.