Two Problems in Data Science on Graphs: Steam Summarization and Signal Processing

Nick Duffield
Professor, Department of Electrical and Computer Engineering
Texas A&M University
SERC 306
Thursday, February 22, 2018 - 11:00
This talk describes research in two data-driven problems related to graphs and networks. The first problem concerns sampling massive graph data streams. Sampling is a powerful approach to reduce Big Data to small data, relieving storage and enabling faster query response when an approximate answer suffices. The focus of this research is a cost-based formulation for data reduction that allows flexible expression of query goals, and weights sampling to minimize information loss for a given space constraint. In the graph setting, Graph Priority Sampling (GPS) adaptively reflects the importance of incoming and stored edges for topological queries such as subgraph counting. For billion-scale graphs, GPS accurately estimates triangle and wedge counts with < 1% error, while storing a small fraction comprising less than 0.01% of the total edges in the graph, with average update times of a few microseconds per edge.

The second problem concerns predicting congestion on road networks from travel time data. This is challenging due both to the complex spatial-temporal dependencies within the data, and the computational cost of prediction over the entire network. Our approach to this problem uses Graph Signal Processing (GSP) methods based on features comprising clusters of elevated travel times. Learning from data collected from a 4,675 edge highway network over a year, we find that GSP outperforms standard methods such as PCA and non-graphical time series prediction. The talk also touches on applications of Data Science to network resilience, software defined networking, and hydrology, and challenges for interdisciplinary research and education in Data Science.