TY  - JOUR
A1  - Bleifuss, Tobias
A1  - Bornemann, Leon
A1  - Johnson, Theodore
A1  - Kalashnikov, Dmitri
A1  - Naumann, Felix
A1  - Srivastava, Divesh
T1  - Exploring Change
BT  - a new dimension of data analytics
JF  - Proceedings of the VLDB Endowment
N2  - Data and metadata in datasets experience many different kinds of change. Values axe inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We envision a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. We identify technical challenges that need to be addressed to make our vision a reality, and propose directions of future work for the data management community.
Y1  - 2018
U6  - https://doi.org/10.14778/3282495.3282496
SN  - 2150-8097
VL  - 12
IS  - 2
SP  - 85
EP  - 98
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Kroschk, Axel
A1  - Mosley, Clifford
A1  - Naumann, Felix
T1  - Experience: Enhancing address matching with geocoding and similarity measure selection
JF  - Journal of Data and Information Quality
N2  - Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
KW  - Address matching
KW  - record linkage
KW  - duplicate detection
KW  - similarity measures
KW  - conditional functional dependencies
KW  - address normalization
KW  - address parsing
KW  - geocoding
KW  - geographic information systems
KW  - random forest
Y1  - 2018
U6  - https://doi.org/10.1145/3232852
SN  - 1936-1955
VL  - 10
IS  - 2
SP  - 1
EP  - 16
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Berti-Equille, Laure
A1  - Harmouch, Nazar
A1  - Naumann, Felix
A1  - Novelli, Noel
A1  - Saravanan, Thirumuruganathan
T1  - Discovery of genuine functional dependencies from relational data with missing values
JF  - Proceedings of the VLDB Endowment
N2  - Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
Y1  - 2018
U6  - https://doi.org/10.14778/3204028.3204032
SN  - 2150-8097
VL  - 11
IS  - 8
SP  - 880
EP  - 892
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - GEN
A1  - Loster, Michael
A1  - Naumann, Felix
A1  - Ehmueller, Jan
A1  - Feldmann, Benjamin
T1  - CurEx
BT  - a system for extracting, curating, and exploring domain-specific knowledge graphs from text
T2  - Proceedings of the 27th ACM International Conference on Information and Knowledge Management
N2  - The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort. With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domain-specific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.
Y1  - 2018
SN  - 978-1-4503-6014-2
U6  - https://doi.org/10.1145/3269206.3269229
SP  - 1883
EP  - 1886
PB  - Association for Computing Machinery
CY  - New York
ER  -