Streaming Architecture for Continuous Entity Linking in Social Media


Supported by NSF Award #1546480. $1,099,399. January 2016 - December 2018.
BIGDATA: Collaborative Research: F: Streaming Architecture for Continuous Entity Linking in Social Media

Overview


A large fraction of the ever-growing internet content is found in social media such as (micro)blogs. Users access it to both form and share their opinions about events and people, election preferences, product and brand recommendations. This situation provides opportunities to create added layers of data mining and analysis regarding users' views on developing events, products, services, or government actions; at the same time, it raises challenges for Entity Linking (EL) in social media. EL is the task of linking an extracted mention to a specific definition of the entity. The definition of an entity is usually a pointer to a Web page that defines the entity. Information extraction from social media generally faces many challenging issues due to: message volume, message speed (Twitter alone generates over 500 million messages per day), variety, free-form language, lack of context, large reference variation and language diversity. Hashtags are an essential part of the ethos of social networks. They are used to denote brands, events, people, social rallies, etc. The hashtag disambiguation problem is to detect synonymous hashtags and recognize the polysemic ones. For example, the hashtag '#BHaram' refers to the entity 'Boko Haram', defined at Wikipedia page en.wikipedia.org/wiki/Boko_Haram or at National Counterterrorism Center Web web page www.nctc.gov/site/groups/boko_haram.html. The purpose of this project is to perform EL in social media. This work will benefit multiple segments of society that rely on applications using data from microblog systems, such as targeted monitoring of Twitter and Facebook to collect and understand users' opinions about a recent product or a world event; data aggregation (e.g., reviews about products and services); and data mining for early crisis detection and response as well as national security. This project is one more step towards addressing the government's latest initiative of fighting crime using big data.

The goals of this project are to research algorithms to detect in near real-time those pieces of text in messages that reference entities, Web pages that describe entities, and to link entity references to Web pages and across microblog systems so that together a broad, more complete characterization of each entity can be automatically generated. The proposed approaches are based on innovative techniques that include: incremental, iterative message analysis; smart indexing techniques with live updates to support fast incremental entity reference detection; computationally light soft-clustering of messages to improve entity reference detection; and fast incremental K-partite graph clustering. The resulting artifacts (e.g., software tools) will be made available to benefit researchers in academe and industry. Distribution of free, open-source software for implementing the techniques developed will enhance existing research infrastructure. The project will support and train at least three PhD students, as well as involve undergraduate students in research at Temple University and Binghampton University. The project web site (http://cis.temple.edu/~edragut/projects/nimel.htm) includes more information on the project, software, datasets, educational materials, and publications.


Data



Team

Faculty

Graduate Students

  • [Temple University] Lihong He
  • [Temple University] Andrew Schneider
  • [Temple University] Chen Shen
  • [Temple University] Shanshan Zhang
  • [Temple University] Chao Han
  • [Binghamton University] Satadisha Bhowmick
  • [Binghamton University] Nebi Aydin

Undergraduate Students

  • [Temple University] John Male
  • [Temple University] Aidan Shea

Publications

  • [TKDE19] Yongquan Dong, Eduard C. Dragut, Weiyi Meng: [PDF] [BibTex]

    Normalization of Duplicate Records from Multiple Sources. IEEE Trans. Knowl. Data Eng. 31(4): 769-782 (2019)

  • [KDD19] Shanshan Zhang, Lihong He, Eduard C. Dragut, Slobodan Vucetic: [PDF] [BibTex]

    How to Invest my Time: Lessons from Human-in-the-Loop Entity Extraction. KDD 2019: 2305-2313

  • [EMNLP18] Shanshan Zhang, Lihong He, Slobodan Vucetic, Eduard C. Dragut: [PDF] [BibTex]

    Regular Expression Guided Entity Mention Mining from Noisy Web Data. EMNLP 2018: 1991-2000

  • [WWW18] Andrew T. Schneider, Arjun Mukherjee, Eduard C. Dragut: [PDF] [BibTex]

    Leveraging Social Media Signals for Record Linkage. WWW 2018:

  • [ACML18] Chen Shen and Yuhong Guo. [PDF] [BibTex]

    Unsupervised Heterogeneous Domain Adaptation with Sparse Feature Transformation. Proceedings of Machine Learning Research (ACML), v.93, 2018.

  • [InfSyst17] Jing Yuan, Lihong He, Eduard C. Dragut, Weiyi Meng, Clement T. Yu: [PDF] [BibTex]

    Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation. Inf. Syst. 64: 93-103 (2017)

  • [EMNLP17] Fan Yang, Arjun Mukherjee, Eduard Constantin Dragut: [PDF] [BibTex]

    Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features. EMNLP 2017: 1979-1989

  • [BigData16] Joseph Jupin, Justin Y. Shi, Eduard C. Dragut: [PDF] [BibTex]

    PSH: A probabilistic signature hash method with hash neighborhood candidate generation for fast edit-distance string comparison on big data. BigData 2016: 122-127

  • [ICDE16] El Kindi Rezig, Eduard C. Dragut, Mourad Ouzzani, Ahmed K. Elmagarmid, Walid G. Aref: [PDF] [BibTex]

    ORLF: A flexible framework for online record linkage and fusion. ICDE 2016: 1378-1381

  • [VLDB15] Qingyuan Liu, Eduard C. Dragut, Arjun Mukherjee, Weiyi Meng: [PDF] [BibTex]

    FLORIN - A System to Support (Near) Real-Time Applications on User Generated Content on Daily News. PVLDB 8(12): 1944-1947 (2015)

Research Experience for Undergraduates

  • Shela Wu, John Male, Eduard C. Dragut:

    Spatial-temporal campus crime pattern mining from historical alert messages. ICNC 2017: 778-782

  • Aidan Patrick Shea.

    A Visual System for Mining Crime Mining across College Campuses. SIGMOD/PODS-SRC. Pages: 1-3. 2017