TY - JOUR A1 - Hameed, Mazhar A1 - Naumann, Felix T1 - Data Preparation BT - a survey of commercial tools JF - SIGMOD record N2 - Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day.
To cater to this demand, companies and researchers are developing techniques and tools for data preparation. To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques. KW - data quality KW - data cleaning KW - data wrangling Y1 - 2020 U6 - https://doi.org/10.1145/3444831.3444835 SN - 0163-5808 SN - 1943-5835 VL - 49 IS - 3 SP - 18 EP - 29 PB - Association for Computing Machinery CY - New York ER - TY - JOUR A1 - Loster, Michael A1 - Koumarelas, Ioannis A1 - Naumann, Felix T1 - Knowledge transfer for entity resolution with siamese neural networks JF - ACM journal of data and information quality N2 - The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity-duplicates-into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise.
We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent. KW - Entity resolution KW - duplicate detection KW - transfer learning KW - neural KW - networks KW - metric learning KW - similarity learning KW - data quality Y1 - 2021 U6 - https://doi.org/10.1145/3410157 SN - 1936-1955 SN - 1936-1963 VL - 13 IS - 1 PB - Association for Computing Machinery CY - New York ER -