TY  - BOOK
A1  - Draisbach, Uwe
A1  - Naumann, Felix
A1  - Szott, Sascha
A1  - Wonneberg, Oliver
T1  - Adaptive windows for duplicate detection
N2  - Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
N2  - Duplikaterkennung beschreibt das Auffinden von mehreren Datensätzen, die das gleiche Realwelt-Objekt repräsentieren. Diese Aufgabe ist nicht trivial, da sich (i) die Datensätze geringfügig unterscheiden können, so dass Ähnlichkeitsmaße für einen paarweisen Vergleich benötigt werden, und (ii) aufgrund der Datenmenge ein vollständiger, paarweiser Vergleich nicht möglich ist. Zur Lösung des zweiten Problems existieren verschiedene Algorithmen, die die Datenmenge partitionieren und nur noch innerhalb der Partitionen Vergleiche durchführen. Einer dieser Algorithmen ist die Sorted-Neighborhood-Methode (SNM), welche Daten anhand eines Schlüssels sortiert und dann ein Fenster über die sortierten Daten schiebt. Vergleiche werden nur innerhalb dieses Fensters durchgeführt. Wir beschreiben verschiedene Variationen der Sorted-Neighborhood-Methode, die auf variierenden Fenstergrößen basieren. Diese Ansätze basieren auf der Intuition, dass Bereiche mit größerer und geringerer Ähnlichkeiten innerhalb der sortierten Datensätze existieren, für die entsprechend größere bzw. kleinere Fenstergrößen sinnvoll sind. Wir beschreiben und evaluieren verschiedene Adaptierungs-Strategien, von denen nachweislich einige bezüglich Effizienz besser sind als die originale Sorted-Neighborhood-Methode (gleiches Ergebnis bei weniger Vergleichen).
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 49 
KW  - Informationssysteme
KW  - Datenqualität
KW  - Datenintegration
KW  - Duplikaterkennung
KW  - Duplicate Detection
KW  - Data Quality
KW  - Data Integration
KW  - Information Systems
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-53007
SN  - 978-3-86956-143-1
SN  - 1613-5652
SN  - 2191-1665
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - BOOK
A1  - Bauckmann, Jana
A1  - Abedjan, Ziawasch
A1  - Leser, Ulf
A1  - Müller, Heiko
A1  - Naumann, Felix
T1  - Covering or complete? : Discovering conditional inclusion dependencies
N2  - Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we define quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.
N2  - Datenabhängigkeiten (wie zum Beispiel Integritätsbedingungen), werden verwendet, um die Qualität eines Datenbankschemas zu erhöhen, um Anfragen zu optimieren und um Konsistenz in einer Datenbank sicherzustellen. In den letzten Jahren wurden bedingte Abhängigkeiten (conditional dependencies) vorgestellt, die die Qualität von Daten analysieren und verbessern sollen. Eine bedingte Abhängigkeit ist eine Abhängigkeit mit begrenztem Gültigkeitsbereich, der über Bedingungen auf einem oder mehreren Attributen definiert wird. In diesem Bericht betrachten wir bedingte Inklusionsabhängigkeiten (conditional inclusion dependencies; CINDs). Wir generalisieren die Definition von CINDs anhand der Unterscheidung von überdeckenden (covering) und vollständigen (completeness) Bedingungen. Wir stellen einen Anwendungsfall für solche CINDs vor, der den Nutzen von CINDs bei der Lösung komplexer Datenqualitätsprobleme aufzeigt. Darüber hinaus definieren wir Qualitätsmaße für Bedingungen basierend auf Sensitivität und Genauigkeit. Wir stellen effiziente Algorithmen vor, die überdeckende und vollständige Bedingungen innerhalb vorgegebener Schwellwerte finden. Unsere Algorithmen wählen nicht nur die Werte der Bedingungen, sondern finden auch die Bedingungsattribute automatisch. Abschließend zeigen wir, dass unser Ansatz effizient sinnvolle und hilfreiche Ergebnisse für den vorgestellten Anwendungsfall liefert.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 62 
KW  - Datenabhängigkeiten
KW  - Bedingte Inklusionsabhängigkeiten
KW  - Erkennen von Meta-Daten
KW  - Linked Open Data
KW  - Link-Entdeckung
KW  - Assoziationsregeln
KW  - Data Dependency
KW  - Conditional Inclusion Dependency
KW  - Metadata Discovery
KW  - Linked Open Data
KW  - Link Discovery
KW  - Association Rule Mining
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-62089
SN  - 978-3-86956-212-4
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - JOUR
A1  - Abramowski, Attila
A1  - Acero, F.
A1  - Aharonian, Felix A.
A1  - Akhperjanian, A. G.
A1  - Anton, Gisela
A1  - Balzer, Arnim
A1  - Barnacka, Anna
A1  - de Almeida, U. Barres
A1  - Becherini, Yvonne
A1  - Becker, J.
A1  - Behera, B.
A1  - Bernlöhr, K.
A1  - Birsin, E.
A1  - Biteau, Jonathan
A1  - Bochow, A.
A1  - Boisson, Catherine
A1  - Bolmont, J.
A1  - Bordas, Pol
A1  - Brucker, J.
A1  - Brun, Francois
A1  - Brun, Pierre
A1  - Bulik, Tomasz
A1  - Buesching, I.
A1  - Carrigan, Svenja
A1  - Casanova, Sabrina
A1  - Cerruti, M.
A1  - Chadwick, Paula M.
A1  - Charbonnier, A.
A1  - Chaves, Ryan C. G.
A1  - Cheesebrough, A.
A1  - Clapson, A. C.
A1  - Coignet, G.
A1  - Cologna, Gabriele
A1  - Conrad, Jan
A1  - Dalton, M.
A1  - Daniel, M. K.
A1  - Davids, I. D.
A1  - Degrange, B.
A1  - Deil, C.
A1  - Dickinson, H. J.
A1  - Djannati-Ataï, A.
A1  - Domainko, W.
A1  - Drury, L. O'C.
A1  - Dubus, G.
A1  - Dutson, K.
A1  - Dyks, J.
A1  - Dyrda, M.
A1  - Egberts, Kathrin
A1  - Eger, P.
A1  - Espigat, P.
A1  - Fallon, L.
A1  - Farnier, C.
A1  - Fegan, S.
A1  - Feinstein, F.
A1  - Fernandes, M. V.
A1  - Fiasson, A.
A1  - Fontaine, G.
A1  - Foerster, A.
A1  - Fuessling, M.
A1  - Gallant, Y. A.
A1  - Gast, H.
A1  - Gerard, L.
A1  - Gerbig, D.
A1  - Giebels, B.
A1  - Glicenstein, J. F.
A1  - Glueck, B.
A1  - Goret, P.
A1  - Goering, D.
A1  - Haeffner, S.
A1  - Hague, J. D.
A1  - Hampf, D.
A1  - Hauser, M.
A1  - Heinz, S.
A1  - Heinzelmann, G.
A1  - Henri, G.
A1  - Hermann, G.
A1  - Hinton, James Anthony
A1  - Hoffmann, A.
A1  - Hofmann, W.
A1  - Hofverberg, P.
A1  - Holler, M.
A1  - Horns, D.
A1  - Jacholkowska, A.
A1  - de Jager, O. C.
A1  - Jahn, C.
A1  - Jamrozy, M.
A1  - Jung, I.
A1  - Kastendieck, M. A.
A1  - Katarzynski, K.
A1  - Katz, U.
A1  - Kaufmann, S.
A1  - Keogh, D.
A1  - Khangulyan, D.
A1  - Khelifi, B.
A1  - Klochkov, D.
A1  - Kluzniak, W.
A1  - Kneiske, T.
A1  - Komin, Nu.
A1  - Kosack, K.
A1  - Kossakowski, R.
A1  - Laffon, H.
A1  - Lamanna, G.
A1  - Lennarz, D.
A1  - Lohse, T.
A1  - Lopatin, A.
A1  - Lu, C. -C.
A1  - Marandon, V.
A1  - Marcowith, Alexandre
A1  - Masbou, J.
A1  - Maurin, D.
A1  - Maxted, N.
A1  - Mayer, M.
A1  - McComb, T. J. L.
A1  - Medina, M. C.
A1  - Mehault, J.
A1  - Moderski, R.
A1  - Moulin, Emmanuel
A1  - Naumann, C. L.
A1  - Naumann-Godo, M.
A1  - de Naurois, M.
A1  - Nedbal, D.
A1  - Nekrassov, D.
A1  - Nguyen, N.
A1  - Nicholas, B.
A1  - Niemiec, J.
A1  - Nolan, S. J.
A1  - Ohm, S.
A1  - Wilhelmi, E. de Ona
A1  - Opitz, B.
A1  - Ostrowski, M.
A1  - Oya, I.
A1  - Panter, M.
A1  - Arribas, M. Paz
A1  - Pedaletti, G.
A1  - Pelletier, G.
A1  - Petrucci, P. -O.
A1  - Pita, S.
A1  - Puehlhofer, G.
A1  - Punch, M.
A1  - Quirrenbach, A.
A1  - Raue, M.
A1  - Rayner, S. M.
A1  - Reimer, A.
A1  - Reimer, O.
A1  - Renaud, M.
A1  - de los Reyes, R.
A1  - Rieger, F.
A1  - Ripken, J.
A1  - Rob, L.
A1  - Rosier-Lees, S.
A1  - Rowell, G.
A1  - Rudak, B.
A1  - Rulten, C. B.
A1  - Ruppel, J.
A1  - Sahakian, V.
A1  - Sanchez, David M.
A1  - Santangelo, Andrea
A1  - Schlickeiser, R.
A1  - Schoeck, F. M.
A1  - Schulz, A.
A1  - Schwanke, U.
A1  - Schwarzburg, S.
A1  - Schwemmer, S.
A1  - Sheidaei, F.
A1  - Skilton, J. L.
A1  - Sol, H.
A1  - Spengler, G.
A1  - Stawarz, L.
A1  - Steenkamp, R.
A1  - Stegmann, Christian
A1  - Stinzing, F.
A1  - Stycz, K.
A1  - Sushch, Iurii
A1  - Szostek, A.
A1  - Tavernet, J. -P.
A1  - Terrier, R.
A1  - Tluczykont, M.
A1  - Valerius, K.
A1  - van Eldik, C.
A1  - Vasileiadis, G.
A1  - Venter, C.
A1  - Vialle, J. P.
A1  - Viana, A.
A1  - Vincent, P.
A1  - Voelk, H. J.
A1  - Volpe, F.
A1  - Vorobiov, S.
A1  - Vorster, M.
A1  - Wagner, S. J.
A1  - Ward, M.
A1  - White, R.
A1  - Wierzcholska, A.
A1  - Zacharias, M.
A1  - Zajczyk, A.
A1  - Zdziarski, A. A.
A1  - Zech, Alraune
A1  - Zechlin, H. -S.
A1  - Aleksic, J.
A1  - Antonelli, L. A.
A1  - Antoranz, P.
A1  - Backes, Michael
A1  - Barrio, J. A.
A1  - Bastieri, D.
A1  - Becerra Gonzalez, J.
A1  - Bednarek, W.
A1  - Berdyugin, A.
A1  - Berger, K.
A1  - Bernardini, E.
A1  - Biland, A.
A1  - Blanch Bigas, O.
A1  - Bock, R. K.
A1  - Boller, A.
A1  - Bonnoli, G.
A1  - Tridon, D. Borla
A1  - Braun, I.
A1  - Bretz, T.
A1  - Canellas, A.
A1  - Carmona, E.
A1  - Carosi, A.
A1  - Colin, P.
A1  - Colombo, E.
A1  - Contreras, J. L.
A1  - Cortina, J.
A1  - Cossio, L.
A1  - Covino, S.
A1  - Dazzi, F.
A1  - De Angelis, A.
A1  - De Cea del Pozo, E.
A1  - De Lotto, B.
A1  - Delgado Mendez, C.
A1  - Diago Ortega, A.
A1  - Doert, M.
A1  - Dominguez, A.
A1  - Prester, Dijana Dominis
A1  - Dorner, D.
A1  - Doro, M.
A1  - Elsaesser, D.
A1  - Ferenc, D.
A1  - Fonseca, M. V.
A1  - Font, L.
A1  - Fruck, C.
A1  - Garcia Lopez, R. J.
A1  - Garczarczyk, M.
A1  - Garrido, D.
A1  - Giavitto, G.
A1  - Godinovic, N.
A1  - Hadasch, D.
A1  - Haefner, D.
A1  - Herrero, A.
A1  - Hildebrand, D.
A1  - Hoehne-Moench, D.
A1  - Hose, J.
A1  - Hrupec, D.
A1  - Huber, B.
A1  - Jogler, T.
A1  - Klepser, S.
A1  - Kraehenbuehl, T.
A1  - Krause, J.
A1  - La Barbera, A.
A1  - Lelas, D.
A1  - Leonardo, E.
A1  - Lindfors, E.
A1  - Lombardi, S.
A1  - Lopez, M.
A1  - Lorenz, E.
A1  - Makariev, M.
A1  - Maneva, G.
A1  - Mankuzhiyil, N.
A1  - Mannheim, K.
A1  - Maraschi, L.
A1  - Mariotti, M.
A1  - Martinez, M.
A1  - Mazin, D.
A1  - Meucci, M.
A1  - Miranda, J. M.
A1  - Mirzoyan, R.
A1  - Miyamoto, H.
A1  - Moldon, J.
A1  - Moralejo, A.
A1  - Munar, P.
A1  - Nieto, D.
A1  - Nilsson, K.
A1  - Orito, R.
A1  - Oya, I.
A1  - Paneque, D.
A1  - Paoletti, R.
A1  - Pardo, S.
A1  - Paredes, J. M.
A1  - Partini, S.
A1  - Pasanen, M.
A1  - Pauss, F.
A1  - Perez-Torres, M. A.
A1  - Persic, M.
A1  - Peruzzo, L.
A1  - Pilia, M.
A1  - Pochon, J.
A1  - Prada, F.
A1  - Moroni, P. G. Prada
A1  - Prandini, E.
A1  - Puljak, I.
A1  - Reichardt, I.
A1  - Reinthal, R.
A1  - Rhode, W.
A1  - Ribo, M.
A1  - Rico, J.
A1  - Ruegamer, S.
A1  - Saggion, A.
A1  - Saito, K.
A1  - Saito, T. Y.
A1  - Salvati, M.
A1  - Satalecka, K.
A1  - Scalzotto, V.
A1  - Scapin, V.
A1  - Schultz, C.
A1  - Schweizer, T.
A1  - Shayduk, M.
A1  - Shore, S. N.
A1  - Sillanpaa, A.
A1  - Sitarek, J.
A1  - Sobczynska, D.
A1  - Spanier, F.
A1  - Spiro, S.
A1  - Stamerra, A.
A1  - Steinke, B.
A1  - Storz, J.
A1  - Strah, N.
A1  - Suric, T.
A1  - Takalo, L.
A1  - Takami, H.
A1  - Tavecchio, F.
A1  - Temnikov, P.
A1  - Terzic, T.
A1  - Tescaro, D.
A1  - Teshima, M.
A1  - Thom, M.
A1  - Tibolla, O.
A1  - Torres, D. F.
A1  - Treves, A.
A1  - Vankov, H.
A1  - Vogler, P.
A1  - Wagner, R. M.
A1  - Weitzel, Q.
A1  - Zabalza, V.
A1  - Zandanel, F.
A1  - Zanin, R.
A1  - Arlen, T.
A1  - Aune, T.
A1  - Beilicke, M.
A1  - Benbow, W.
A1  - Bouvier, A.
A1  - Bradbury, S. M.
A1  - Buckley, J. H.
A1  - Bugaev, V.
A1  - Byrum, K.
A1  - Cannon, A.
A1  - Cesarini, A.
A1  - Ciupik, L.
A1  - Connolly, M. P.
A1  - Cui, W.
A1  - Dickherber, R.
A1  - Duke, C.
A1  - Errando, M.
A1  - Falcone, A.
A1  - Finley, J. P.
A1  - Finnegan, G.
A1  - Fortson, L.
A1  - Furniss, A.
A1  - Galante, N.
A1  - Gall, D.
A1  - Godambe, S.
A1  - Griffin, S.
A1  - Grube, J.
A1  - Gyuk, G.
A1  - Hanna, D.
A1  - Holder, J.
A1  - Huan, H.
A1  - Hui, C. M.
A1  - Kaaret, P.
A1  - Karlsson, N.
A1  - Kertzman, M.
A1  - Khassen, Y.
A1  - Kieda, D.
A1  - Krawczynski, H.
A1  - Krennrich, F.
A1  - Lang, M. J.
A1  - LeBohec, S.
A1  - Maier, G.
A1  - McArthur, S.
A1  - McCann, A.
A1  - Moriarty, P.
A1  - Mukherjee, R.
A1  - Nunez, P. D.
A1  - Ong, R. A.
A1  - Orr, M.
A1  - Otte, A. N.
A1  - Park, N.
A1  - Perkins, J. S.
A1  - Pichel, A.
A1  - Pohl, Martin
A1  - Prokoph, H.
A1  - Ragan, K.
A1  - Reyes, L. C.
A1  - Reynolds, P. T.
A1  - Roache, E.
A1  - Rose, H. J.
A1  - Ruppel, J.
A1  - Schroedter, M.
A1  - Sembroski, G. H.
A1  - Sentuerk, G. D.
A1  - Telezhinsky, Igor O.
A1  - Tesic, G.
A1  - Theiling, M.
A1  - Thibadeau, S.
A1  - Varlotta, A.
A1  - Vassiliev, V. V.
A1  - Vivier, M.
A1  - Wakely, S. P.
A1  - Weekes, T. C.
A1  - Williams, D. A.
A1  - Zitzer, B.
A1  - de Almeida, U. Barres
A1  - Cara, M.
A1  - Casadio, C.
A1  - Cheung, C. C.
A1  - McConville, W.
A1  - Davies, F.
A1  - Doi, A.
A1  - Giovannini, G.
A1  - Giroletti, M.
A1  - Hada, K.
A1  - Hardee, P.
A1  - Harris, D. E.
A1  - Junor, W.
A1  - Kino, M.
A1  - Lee, N. P.
A1  - Ly, C.
A1  - Madrid, J.
A1  - Massaro, F.
A1  - Mundell, C. G.
A1  - Nagai, H.
A1  - Perlman, E. S.
A1  - Steele, I. A.
A1  - Walker, R. C.
A1  - Wood, D. L.
T1  - The 2010 very high energy gamma-ray flare and 10 years ofmulti-wavelength oservations of M 87
JF  - The astrophysical journal : an international review of spectroscopy and astronomical physics
N2  - The giant radio galaxy M 87 with its proximity (16 Mpc), famous jet, and very massive black hole ((3-6) x 10(9) M-circle dot) provides a unique opportunity to investigate the origin of very high energy (VHE; E > 100 GeV) gamma-ray emission generated in relativistic outflows and the surroundings of supermassive black holes. M 87 has been established as a VHE gamma-ray emitter since 2006. The VHE gamma-ray emission displays strong variability on timescales as short as a day. In this paper, results from a joint VHE monitoring campaign on M 87 by the MAGIC and VERITAS instruments in 2010 are reported. During the campaign, a flare at VHE was detected triggering further observations at VHE (H.E.S.S.), X-rays (Chandra), and radio (43 GHz Very Long Baseline Array, VLBA). The excellent sampling of the VHE gamma-ray light curve enables one to derive a precise temporal characterization of the flare: the single, isolated flare is well described by a two-sided exponential function with significantly different flux rise and decay times of tau(rise)(d) = (1.69 +/- 0.30) days and tau(decay)(d) = (0.611 +/- 0.080) days, respectively. While the overall variability pattern of the 2010 flare appears somewhat different from that of previous VHE flares in 2005 and 2008, they share very similar timescales (similar to day), peak fluxes (Phi(>0.35 TeV) similar or equal to (1-3) x 10(-11) photons cm(-2) s(-1)), and VHE spectra. VLBA radio observations of 43 GHz of the inner jet regions indicate no enhanced flux in 2010 in contrast to observations in 2008, where an increase of the radio flux of the innermost core regions coincided with a VHE flare. On the other hand, Chandra X-ray observations taken similar to 3 days after the peak of the VHE gamma-ray emission reveal an enhanced flux from the core (flux increased by factor similar to 2; variability timescale <2 days). The long-term (2001-2010) multi-wavelength (MWL) light curve of M 87, spanning from radio to VHE and including data from Hubble Space Telescope, Liverpool Telescope, Very Large Array, and European VLBI Network, is used to further investigate the origin of the VHE gamma-ray emission. No unique, common MWL signature of the three VHE flares has been identified. In the outer kiloparsec jet region, in particular in HST-1, no enhanced MWL activity was detected in 2008 and 2010, disfavoring it as the origin of the VHE flares during these years. Shortly after two of the three flares (2008 and 2010), the X-ray core was observed to be at a higher flux level than its characteristic range (determined from more than 60 monitoring observations: 2002-2009). In 2005, the strong flux dominance of HST-1 could have suppressed the detection of such a feature. Published models for VHE gamma-ray emission from M 87 are reviewed in the light of the new data.
KW  - galaxies: active
KW  - galaxies: individual (M 87)
KW  - galaxies: jets
KW  - galaxies: nuclei
KW  - gamma rays: galaxies
KW  - radiation mechanisms: non-thermal
Y1  - 2012
U6  - https://doi.org/10.1088/0004-637X/746/2/151
SN  - 0004-637X
VL  - 746
IS  - 2
PB  - IOP Publ. Ltd.
CY  - Bristol
ER  - 
TY  - BOOK
A1  - Albrecht, Alexander
A1  - Naumann, Felix
T1  - Understanding cryptic schemata in large extract-transform-load systems
N2  - Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to develop and understand ETL workflows. We present a schema decryption technique to support ETL developers in understanding cryptic schemata of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. In this way we are able to decrypt attribute labels consisting of a number of unfamiliar few-letter abbreviations, such as UNP_PEN_INT, which we can decrypt to UNPAID_PENALTY_INTEREST. We evaluate our schema decryption approach on three real-world repositories of ETL workflows and show that our approach is able to suggest high-quality decryptions for cryptic attribute labels in a given schema.
N2  - Extract-Transform-Load (ETL) Tools werden häufig beim Erstellen, der Wartung und der Weiterentwicklung von Data Warehouses, Data Marts und operationalen Datenbanken verwendet. ETL Workflows befüllen diese Systeme mit Daten aus vielen unterschiedlichen Quellsystemen. Ein ETL Workflow besteht aus mehreren Transformationsschritten, die einen DAG-strukturierter Graphen bilden. Mit der Zeit entstehen hunderte individueller ETL Workflows, da neue Datenquellen integriert oder neue Anforderungen umgesetzt werden müssen. Die Wartung und Weiterentwicklung von großen ETL Systemen benötigt viel Zeit und manuelle Arbeit. Ein zentrales Problem ist dabei das Verständnis unbekannter Attributnamen in Quell- und Zieldatenbanken und ETL Transformationen. Schwer verständliche Attributnamen führen zu Frustration und hohen Zeitaufwänden bei der Entwicklung und dem Verständnis von ETL Workflows. Wir präsentieren eine Schema Decryption Technik, die ETL Entwicklern das Verständnis kryptischer Schemata in Quell- und Zieldatenbanken und ETL Transformationen erleichtert. Unser Ansatz berücksichtigt für ein gegebenes ETL System die Vielzahl verknüpfter Attributnamen in den existierenden ETL Workflows. So werden gute und aussagekräftige "Decryptions" gefunden und wir sind in der Lage Attributnamen, die aus unbekannten Abkürzungen bestehen, zu "decrypten". So wird z.B. für den Attributenamen UNP_PEN_INT als Decryption UNPAIN_PENALTY_INTEREST vorgeschlagen. Unser Schema Decryption Ansatz wurde für drei ETL-Repositories evaluiert und es zeigte sich, dass unser Ansatz qualitativ hochwertige Decryptions für kryptische Attributnamen vorschlägt.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 60 
KW  - Extract-Transform-Load (ETL)
KW  - Data Warehouse
KW  - Datenintegration
KW  - Extract-Transform-Load (ETL)
KW  - Data Warehouse
KW  - Data Integration
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-61257
SN  - 978-3-86956-201-8
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  -