A large fraction of the ever-growing internet content is found in social media such as (micro)blogs. Users access it to both form and share their opinions about events and people, election preferences, product and brand recommendations. This situation provides opportunities to create added layers of data mining and analysis regarding users' views on developing events, products, services, or government actions; at the same time, it raises challenges for Entity Linking (EL) in social media. EL is the task of linking an extracted mention to a specific definition of the entity. The definition of an entity is usually a pointer to a Web page that defines the entity. Information extraction from social media generally faces many challenging issues due to: message volume, message speed (Twitter alone generates over 500 million messages per day), variety, free-form language, lack of context, large reference variation and language diversity. Hashtags are an essential part of the ethos of social networks. They are used to denote brands, events, people, social rallies, etc. The hashtag disambiguation problem is to detect synonymous hashtags and recognize the polysemic ones. For example, the hashtag '#BHaram' refers to the entity 'Boko Haram', defined at Wikipedia page en.wikipedia.org/wiki/Boko_Haram or at National Counterterrorism Center Web web page www.nctc.gov/site/groups/boko_haram.html. The purpose of this project is to perform EL in social media. This work will benefit multiple segments of society that rely on applications using data from microblog systems, such as targeted monitoring of Twitter and Facebook to collect and understand users' opinions about a recent product or a world event; data aggregation (e.g., reviews about products and services); and data mining for early crisis detection and response as well as national security. This project is one more step towards addressing the government's latest initiative of fighting crime using big data.
The goals of this project are to research algorithms to detect in near real-time those pieces of text in messages that reference entities, Web pages that describe entities, and to link entity references to Web pages and across microblog systems so that together a broad, more complete characterization of each entity can be automatically generated. The proposed approaches are based on innovative techniques that include: incremental, iterative message analysis; smart indexing techniques with live updates to support fast incremental entity reference detection; computationally light soft-clustering of messages to improve entity reference detection; and fast incremental K-partite graph clustering. The resulting artifacts (e.g., software tools) will be made available to benefit researchers in academe and industry. Distribution of free, open-source software for implementing the techniques developed will enhance existing research infrastructure. The project will support and train at least three PhD students, as well as involve undergraduate students in research at Temple University and Binghampton University. The project web site (http://cis.temple.edu/~edragut/projects/nimel.htm) includes more information on the project, software, datasets, educational materials, and publications.
Satadisha Saha Bhowmick, Eduard Dragut, and Weiyi Meng. IEEE Trans. Knowl. Data Eng. (2021)
Segmentation of Tweets with URLs and its Applications to Sentiment Analysis. The 35th AAAI Conference on Artificial Intelligence (AAAI'21): 12480-12488 (2021)
Normalization of Duplicate Records from Multiple Sources. IEEE Trans. Knowl. Data Eng. 31(4): 769-782 (2019)
How to Invest my Time: Lessons from Human-in-the-Loop Entity Extraction. KDD 2019: 2305-2313
Regular Expression Guided Entity Mention Mining from Noisy Web Data. EMNLP 2018: 1991-2000
Leveraging Social Media Signals for Record Linkage. WWW 2018:
Unsupervised Heterogeneous Domain Adaptation with Sparse Feature Transformation. Proceedings of Machine Learning Research (ACML), v.93, 2018.
Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation. Inf. Syst. 64: 93-103 (2017)
Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features. EMNLP 2017: 1979-1989
PSH: A probabilistic signature hash method with hash neighborhood candidate generation for fast edit-distance string comparison on big data. BigData 2016: 122-127
ORLF: A flexible framework for online record linkage and fusion. ICDE 2016: 1378-1381
FLORIN - A System to Support (Near) Real-Time Applications on User Generated Content on Daily News. PVLDB 8(12): 1944-1947 (2015)
Spatial-temporal campus crime pattern mining from historical alert messages. ICNC 2017: 778-782
A Visual System for Mining Crime Mining across College Campuses. SIGMOD/PODS-SRC. Pages: 1-3. 2017