Entity resolution on-demand

Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix

doi:10.14778/3523210.3523226

search hit 20 of 59

Back to Result List

Entity resolution on-demand

Giovanni Simonini, Luca Zecchini, Sonia Bergamaschi, Felix Naumann

Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner-a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range ofEntity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner-a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.…

Metadaten
Author details:	Giovanni Simonini, Luca Zecchini ORCiD, Sonia Bergamaschi, Felix Naumann ORCiD GND
DOI:	https://doi.org/10.14778/3523210.3523226
ISSN:	2150-8097
Title of parent work (English):	Proceedings of the VLDB Endowment
Publisher:	Association for Computing Machinery
Place of publishing:	New York
Publication type:	Article
Language:	English
Date of first publication:	2022/03/01
Publication year:	2022
Release date:	2024/08/02
Volume:	15
Issue:	7
Number of pages:	13
First page:	1506
Last Page:	1518
Organizational units:	An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Peer review:	Referiert
Publishing method:	Open Access / Hybrid Open-Access
License (German):	CC-BY-NC-ND - Namensnennung, nicht kommerziell, keine Bearbeitungen 4.0 International

Entity resolution on-demand

Export metadata

Additional Services