TY - JOUR A1 - Momtazi, Saeedeh A1 - Naumann, Felix T1 - Topic modeling for expert finding using latent Dirichlet allocation JF - Wiley interdisciplinary reviews : Data mining and knowledge discovery N2 - The task of expert finding is to rank the experts in the search space given a field of expertise as an input query. In this paper, we propose a topic modeling approach for this task. The proposed model uses latent Dirichlet allocation (LDA) to induce probabilistic topics. In the first step of our algorithm, the main topics of a document collection are extracted using LDA. The extracted topics present the connection between expert candidates and user queries. In the second step, the topics are used as a bridge to find the probability of selecting each candidate for a given query. The candidates are then ranked based on these probabilities. The experimental results on the Text REtrieval Conference (TREC) Enterprise track for 2005 and 2006 show that the proposed topic-based approach outperforms the state-of-the-art profile- and document-based models, which use information retrieval methods to rank experts. Moreover, we present the superiority of the proposed topic-based approach to the improved document-based expert finding systems, which consider additional information such as local context, candidate prior, and query expansion. Y1 - 2013 U6 - https://doi.org/10.1002/widm.1102 SN - 1942-4787 VL - 3 IS - 5 SP - 346 EP - 353 PB - Wiley CY - San Fransisco ER - TY - JOUR A1 - Draisbach, Uwe A1 - Christen, Peter A1 - Naumann, Felix T1 - Transforming pairwise duplicates to entity clusters for high-quality duplicate detection JF - ACM Journal of Data and Information Quality N2 - Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result.
We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available. KW - Record linkage KW - data matching KW - entity resolution KW - deduplication KW - clustering Y1 - 2019 U6 - https://doi.org/10.1145/3352591 SN - 1936-1955 SN - 1936-1963 VL - 12 IS - 1 SP - 1 EP - 30 PB - Association for Computing Machinery CY - New York ER - TY - BOOK A1 - Albrecht, Alexander A1 - Naumann, Felix T1 - Understanding cryptic schemata in large extract-transform-load systems N2 - Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to develop and understand ETL workflows. We present a schema decryption technique to support ETL developers in understanding cryptic schemata of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. In this way we are able to decrypt attribute labels consisting of a number of unfamiliar few-letter abbreviations, such as UNP_PEN_INT, which we can decrypt to UNPAID_PENALTY_INTEREST. We evaluate our schema decryption approach on three real-world repositories of ETL workflows and show that our approach is able to suggest high-quality decryptions for cryptic attribute labels in a given schema. N2 - Extract-Transform-Load (ETL) Tools werden häufig beim Erstellen, der Wartung und der Weiterentwicklung von Data Warehouses, Data Marts und operationalen Datenbanken verwendet. ETL Workflows befüllen diese Systeme mit Daten aus vielen unterschiedlichen Quellsystemen. Ein ETL Workflow besteht aus mehreren Transformationsschritten, die einen DAG-strukturierter Graphen bilden. Mit der Zeit entstehen hunderte individueller ETL Workflows, da neue Datenquellen integriert oder neue Anforderungen umgesetzt werden müssen. Die Wartung und Weiterentwicklung von großen ETL Systemen benötigt viel Zeit und manuelle Arbeit. Ein zentrales Problem ist dabei das Verständnis unbekannter Attributnamen in Quell- und Zieldatenbanken und ETL Transformationen. Schwer verständliche Attributnamen führen zu Frustration und hohen Zeitaufwänden bei der Entwicklung und dem Verständnis von ETL Workflows. Wir präsentieren eine Schema Decryption Technik, die ETL Entwicklern das Verständnis kryptischer Schemata in Quell- und Zieldatenbanken und ETL Transformationen erleichtert. Unser Ansatz berücksichtigt für ein gegebenes ETL System die Vielzahl verknüpfter Attributnamen in den existierenden ETL Workflows. So werden gute und aussagekräftige "Decryptions" gefunden und wir sind in der Lage Attributnamen, die aus unbekannten Abkürzungen bestehen, zu "decrypten". So wird z.B. für den Attributenamen UNP_PEN_INT als Decryption UNPAIN_PENALTY_INTEREST vorgeschlagen. Unser Schema Decryption Ansatz wurde für drei ETL-Repositories evaluiert und es zeigte sich, dass unser Ansatz qualitativ hochwertige Decryptions für kryptische Attributnamen vorschlägt. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 60 KW - Extract-Transform-Load (ETL) KW - Data Warehouse KW - Datenintegration KW - Extract-Transform-Load (ETL) KW - Data Warehouse KW - Data Integration Y1 - 2012 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-61257 SN - 978-3-86956-201-8 PB - Universitätsverlag Potsdam CY - Potsdam ER -