Refine
Year of publication
Document Type
- Article (41)
- Monograph/Edited Volume (11)
- Other (3)
- Conference Proceeding (1)
- Postprint (1)
- Preprint (1)
Is part of the Bibliography
- yes (58)
Keywords
- radiation mechanisms: non-thermal (8)
- gamma rays: galaxies (6)
- galaxies: active (5)
- gamma rays: general (5)
- ISM: supernova remnants (4)
- data profiling (4)
- Datenintegration (3)
- duplicate detection (3)
- similarity measures (3)
- Data Integration (2)
Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner. For the optimizer to be effective, it must consider data movement costs across different data processing platforms. In this paper, we present the graph-based data movement strategy used by RHEEM, our open-source cross-platform system. In particular, we (i) model the data movement problem as a new graph problem, which we prove to be NP-hard, and (ii) propose a novel graph exploration algorithm, which allows RHEEM to discover multiple hidden opportunities for cross-platform data processing.
Exploring Change
(2018)
Data and metadata in datasets experience many different kinds of change. Values axe inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We envision a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. We identify technical challenges that need to be addressed to make our vision a reality, and propose directions of future work for the data management community.
Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
CurEx
(2018)
The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort. With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domain-specific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.
Die Tagung HDI 2014 in Freiburg zur Hochschuldidaktik der Informatik HDI wurde erneut vom Fachbereich Informatik und Ausbildung / Didaktik der Informatik (IAD) in der Gesellschaft für Informatik e. V. (GI) organisiert. Sie dient den Lehrenden der Informatik in Studiengängen an Hochschulen als Forum der Information und des Austauschs über neue didaktische Ansätze und bildungspolitische Themen im Bereich der Hochschulausbildung aus der fachlichen Perspektive der Informatik.
Die HDI 2014 ist nun bereits die sechste Ausgabe der HDI. Für sie wurde das spezielle Motto „Gestalten und Meistern von Übergängen“ gewählt. Damit soll ein besonderes Augenmerk auf die Übergänge von Schule zum Studium, vom Bachelor zum Master, vom Studium zur Promotion oder vom Studium zur Arbeitswelt gelegt werden.
The gamma-ray spectrum of the low-frequency-peaked BL Lac (LBL) object AP Librae is studied, following the discovery of very-high-energy (VHE; E > 100 GeV) gamma-ray emission up to the TeV range by the H.E.S.S. experiment. Thismakes AP Librae one of the few VHE emitters of the LBL type. The measured spectrum yields a flux of (8.8 +/- 1.5(stat) +/- 1.8(sys)) x 10(-12) cm(-2) s(-1) above 130 GeV and a spectral index of Gamma = 2.65 +/- 0.19(stat) +/- 0.20(sys). This study also makes use of Fermi-LAT observations in the high energy (HE, E > 100 MeV) range, providing the longest continuous light curve (5 years) ever published on this source. The source underwent a flaring event between MJD 56 306-56 376 in the HE range, with a flux increase of a factor of 3.5 in the 14 day bin light curve and no significant variation in spectral shape with respect to the low-flux state. While the H.E.S.S. and (low state) Fermi-LAT fluxes are in good agreement where they overlap, a spectral curvature between the steep VHE spectrum and the Fermi-LAT spectrum is observed. The maximum of the gamma-ray emission in the spectral energy distribution is located below the GeV energy range.
Aims. Previous observations with the High Energy Stereoscopic System (H.E.S.S.) have revealed an extended very-high-energy (VHE; E > 100 GeV) gamma-ray source, HESS J1834-087, coincident with the supernova remnant (SNR) W41. The origin of the gamma-ray emission was investigated in more detail with the H.E.S.S. array and the Large Area Telescope (LAT) onboard the Fermi Gamma-ray Space Telescope.
Methods. The gamma-ray data provided by 61 h of observations with H.E.S.S., and four years with the Fermi LAT were analyzed, covering over five decades in energy from 1.8 GeV up to 30 TeV. The morphology and spectrum of the TeV and GeV sources were studied and multiwavelength data were used to investigate the origin of the gamma-ray emission toward W41.
Results. The TeV source can be modeled with a sum of two components: one point-like and one significantly extended (sigma(TeV) = 0.17 degrees +/- 0.01 degrees), both centered on SNR W41 and exhibiting spectra described by a power law with index Gamma(TeV) similar or equal to 2.6. The GeV source detected with Fermi LAT is extended (sigma(GeV) = 0.15 degrees +/- 0.03 degrees) and morphologically matches the VHE emission. Its spectrum can be described by a power-law model with an index Gamma(GeV) = 2.15 +/- 0.12 and smoothly joins the spectrum of the whole TeV source. A break appears in the gamma-ray spectra around 100 GeV. No pulsations were found in the GeV range.
Conclusions. Two main scenarios are proposed to explain the observed emission: a pulsar wind nebula (PWN) or the interaction of SNR W41 with an associated molecular cloud. X-ray observations suggest the presence of a point-like source (a pulsar candidate) near the center of the remnant and nonthermal X-ray diffuse emission that could arise from the possibly associated PWN. The PWN scenario is supported by the compatible positions of the TeV and GeV sources with the putative pulsar. However, the spectral energy distribution from radio to gamma-rays is reproduced by a one-zone leptonic model only if an excess of low-energy electrons is injected following a Maxwellian distribution by a pulsar with a high spin-down power (> 10(37) erg s(-1)). This additional low-energy component is not needed if we consider that the point-like TeV source is unrelated to the extended GeV and TeV sources. The interacting SNR scenario is supported by the spatial coincidence between the gamma-ray sources, the detection of OH (1720 MHz) maser lines, and the hadronic modeling.