Refine
Year of publication
Document Type
- Article (41)
- Monograph/Edited Volume (10)
- Other (3)
- Postprint (1)
- Preprint (1)
Language
- English (56) (remove)
Is part of the Bibliography
- yes (56)
Keywords
- radiation mechanisms: non-thermal (8)
- gamma rays: galaxies (6)
- galaxies: active (5)
- gamma rays: general (5)
- ISM: supernova remnants (4)
- data profiling (4)
- Datenintegration (3)
- duplicate detection (3)
- similarity measures (3)
- Data Integration (2)
Introducing the CTA concept
(2013)
The Cherenkov Telescope Array (CTA) is a new observatory for very high-energy (VHE) gamma rays. CTA has ambitions science goals, for which it is necessary to achieve full-sky coverage, to improve the sensitivity by about an order of magnitude, to span about four decades of energy, from a few tens of GeV to above 100 TeV with enhanced angular and energy resolutions over existing VHE gamma-ray observatories. An international collaboration has formed with more than 1000 members from 27 countries in Europe, Asia, Africa and North and South America. In 2010 the CTA Consortium completed a Design Study and started a three-year Preparatory Phase which leads to production readiness of CTA in 2014. In this paper we introduce the science goals and the concept of CTA, and provide an overview of the project.
HESS observations of the binary system PSR B1259-63/LS 2883 around the 2010/2011 periastron passage
(2013)
Aims. We present very high energy (VHE; E > 100 GeV) data from the gamma-ray binary system PSR B1259-63/LS 2883 taken around its periastron passage on 15th of December 2010 with the High Energy Stereoscopic System (H. E. S. S.) of Cherenkov Telescopes. We aim to search for a possible TeV counterpart of the GeV flare detected by the Fermi LAT. In addition, we aim to study the current periastron passage in the context of previous observations taken at similar orbital phases, testing the repetitive behaviour of the source.
Methods. Observations at VHEs were conducted with H.E.S.S. from 9th to 16th of January 2011. The total dataset amounts to similar to 6 h of observing time. The data taken around the 2004 periastron passage were also re-analysed with the current analysis techniques in order to extend the energy spectrum above 3 TeV to fully compare observation results from 2004 and 2011.
Results. The source is detected in the 2011 data at a significance level of 11.5 sigma revealing an averaged integral flux above 1 TeV of (1.01 +/- 0.18(stat) +/- 0.20(sys)) x 10(-12) cm(-2) s(-1). The differential energy spectrum follows a power-law shape with a spectral index Gamma = 2.92 +/- 0.30(stat) +/- 0.20(sys) and a flux normalisation at 1 TeV of N-0 = (1.95 +/- 0.32(stat) +/- 0.39(sys)) x 10(-12) TeV-1 cm(-2) s(-1). The measured light curve does not show any evidence for variability of the source on the daily scale. The re-analysis of the 2004 data yields results compatible with the published ones. The differential energy spectrum measured up to similar to 10 TeV is consistent with a power law with a spectral index Gamma = 2.81 +/- 0.10(stat) +/- 0.20(sys) and a flux normalisation at 1 TeV of N-0 = (1.29 +/- 0.08(stat) +/- 0.26(sys)) x 10(-12) TeV-1 cm(-2) s(-1).
Conclusions. The measured integral flux and the spectral shape of the 2011 data are compatible with the results obtained around previous periastron passages. The absence of variability in the H.E.S.S. data indicates that the GeV flare observed by Fermi LAT in the time period covered also by H.E.S.S. observations originates in a different physical scenario than the TeV emission. Moreover, the comparison of the new results to the results from the 2004 observations made at a similar orbital phase provides a stronger evidence of the repetitive behaviour of the source.
Ground-based gamma-ray astronomy has had a major breakthrough with the impressive results obtained using systems of imaging atmospheric Cherenkov telescopes. Ground-based gamma-ray astronomy has a huge potential in astrophysics, particle physics and cosmology. CTA is an international initiative to build the next generation instrument, with a factor of 5-10 improvement in sensitivity in the 100 GeV-10 TeV range and the extension to energies well below 100 GeV and above 100 TeV. CTA will consist of two arrays (one in the north, one in the south) for full sky coverage and will be operated as open observatory. The design of CTA is based on currently available technology. This document reports on the status and presents the major design concepts of CTA.
Context. Globular clusters (GCs) are established emitters of high-energy (HE, 100 MeV < E < 100 GeV) gamma-ray radiation which could originate from the cumulative emission of the numerous millisecond pulsars (msPSRs) in the clusters’ cores or from inverse Compton (IC) scattering of relativistic leptons accelerated in the GC environment. These stellar clusters could also constitute a new class of sources in the very-high-energy (VHE, E > 100 GeV) gamma-ray regime, judging from the recent detection of a signal from the direction of Terzan 5 with the H.E.S.S. telescope array. Aims. To search for VHE gamma-ray sources associated with other GCs, and to put constraints on leptonic emission models, we systematically analyzed the observations towards 15 GCs taken with the H. E. S. S. array of imaging atmospheric Cherenkov telescopes. Methods. We searched for point-like and extended VHE gamma-ray emission from each GC in our sample and also performed a stacking analysis combining the data from all GCs to investigate the hypothesis of a population of faint emitters. Assuming IC emission as the origin of the VHE gamma-ray signal from the direction of Terzan 5, we calculated the expected gamma-ray flux from each of the 15 GCs, based on their number of millisecond pulsars, their optical brightness and the energy density of background photon fields. Results. We did not detect significant VHE gamma-ray emission from any of the 15 GCs in either of the two analyses. Given the uncertainties related to the parameter determinations, the obtained flux upper limits allow to rule out the simple IC/msPSR scaling model for NGC6388 and NGC7078. The upper limits derived from the stacking analyses are factors between 2 and 50 below the flux predicted by the simple leptonic scaling model, depending on the assumed source extent and the dominant target photon fields. Therefore, Terzan 5 still remains exceptional among all GCs, as the VHE gamma-ray emission either arises from extra-ordinarily efficient leptonic processes, or from a recent catastrophic event, or is even unrelated to the GC itself.
The quasar PKS 1510-089 (z = 0.361) was observed with the H.E.S.S. array of imaging atmospheric Cherenkov telescopes during high states in the optical and GeV bands, to search for very high energy (VHE, defined as E >= 0.1 TeV) emission. VHE gamma-rays were detected with a statistical significance of 9.2 standard deviations in 15.8 h of H. E. S. S. data taken during March and April 2009. A VHE integral flux of I(0.15 TeV < E < 1.0TeV) = (1.0 +/- 0.2(stat) +/- 0.2(sys)) x 10(-11) cm(-2) s(-1) is measured. The best-fit power law to the VHE data has a photon index of G = 5.4 +/- 0.7(stat) +/- 0.3(sys). The GeV and optical light curves show pronounced variability during the period of H.E.S.S. observations. However, there is insufficient evidence to claim statistically significant variability in the VHE data. Because of its relatively high redshift, the VHE flux from PKS 1510-089 should suffer considerable attenuation in the intergalactic space due to the extragalactic background light (EBL). Hence, the measured gamma-ray spectrum is used to derive upper limits on the opacity due to EBL, which are found to be comparable with the previously derived limits from relatively-nearby BL Lac objects. Unlike typical VHE-detected blazars where the broadband spectrum is dominated by nonthermal radiation at all wavelengths, the quasar PKS 1510-089 has a bright thermal component in the optical to UV frequency band. Among all VHE detected blazars, PKS 1510-089 has the most luminous broad line region. The detection of VHE emission from this quasar indicates a low level of gamma - gamma absorption on the internal optical to UV photon field.
Gamma-ray line signatures can be expected in the very-high-energy (E-gamma > 100 GeV) domain due to self-annihilation or decay of dark matter (DM) particles in space. Such a signal would be readily distinguishable from astrophysical gamma-ray sources that in most cases produce continuous spectra that span over several orders of magnitude in energy. Using data collected with the H. E. S. S. gamma-ray instrument, upper limits on linelike emission are obtained in the energy range between similar to 500 GeV and similar to 25 TeV for the central part of the Milky Way halo and for extragalactic observations, complementing recent limits obtained with the Fermi-LAT instrument at lower energies. No statistically significant signal could be found. For monochromatic gamma-ray line emission, flux limits of (2 x 10(-7)-2 x 10(-5)) m(-2)s(-1)sr(-1) and (1 x 10(-8)- 2 x 10(-6)) m(-2)s(-1)sr(-1) are obtained for the central part of the Milky Way halo and extragalactic observations, respectively. For a DM particle mass of 1 TeV, limits on the velocity- averaged DM annihilation cross section <sigma upsilon >(chi chi ->gamma gamma) reach similar to 10(-27)cm(3)s(-1), based on the Einasto parametrization of the Galactic DM halo density profile. DOI: 10.1103/PhysRevLett.110.041301
The 2010 very high energy gamma-ray flare and 10 years ofmulti-wavelength oservations of M 87
(2012)
The giant radio galaxy M 87 with its proximity (16 Mpc), famous jet, and very massive black hole ((3-6) x 10(9) M-circle dot) provides a unique opportunity to investigate the origin of very high energy (VHE; E > 100 GeV) gamma-ray emission generated in relativistic outflows and the surroundings of supermassive black holes. M 87 has been established as a VHE gamma-ray emitter since 2006. The VHE gamma-ray emission displays strong variability on timescales as short as a day. In this paper, results from a joint VHE monitoring campaign on M 87 by the MAGIC and VERITAS instruments in 2010 are reported. During the campaign, a flare at VHE was detected triggering further observations at VHE (H.E.S.S.), X-rays (Chandra), and radio (43 GHz Very Long Baseline Array, VLBA). The excellent sampling of the VHE gamma-ray light curve enables one to derive a precise temporal characterization of the flare: the single, isolated flare is well described by a two-sided exponential function with significantly different flux rise and decay times of tau(rise)(d) = (1.69 +/- 0.30) days and tau(decay)(d) = (0.611 +/- 0.080) days, respectively. While the overall variability pattern of the 2010 flare appears somewhat different from that of previous VHE flares in 2005 and 2008, they share very similar timescales (similar to day), peak fluxes (Phi(>0.35 TeV) similar or equal to (1-3) x 10(-11) photons cm(-2) s(-1)), and VHE spectra. VLBA radio observations of 43 GHz of the inner jet regions indicate no enhanced flux in 2010 in contrast to observations in 2008, where an increase of the radio flux of the innermost core regions coincided with a VHE flare. On the other hand, Chandra X-ray observations taken similar to 3 days after the peak of the VHE gamma-ray emission reveal an enhanced flux from the core (flux increased by factor similar to 2; variability timescale <2 days). The long-term (2001-2010) multi-wavelength (MWL) light curve of M 87, spanning from radio to VHE and including data from Hubble Space Telescope, Liverpool Telescope, Very Large Array, and European VLBI Network, is used to further investigate the origin of the VHE gamma-ray emission. No unique, common MWL signature of the three VHE flares has been identified. In the outer kiloparsec jet region, in particular in HST-1, no enhanced MWL activity was detected in 2008 and 2010, disfavoring it as the origin of the VHE flares during these years. Shortly after two of the three flares (2008 and 2010), the X-ray core was observed to be at a higher flux level than its characteristic range (determined from more than 60 monitoring observations: 2002-2009). In 2005, the strong flux dominance of HST-1 could have suppressed the detection of such a feature. Published models for VHE gamma-ray emission from M 87 are reviewed in the light of the new data.
Exploring Change
(2018)
Data and metadata in datasets experience many different kinds of change. Values axe inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We envision a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. We identify technical challenges that need to be addressed to make our vision a reality, and propose directions of future work for the data management community.
Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
MDedup
(2020)
Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned.
For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.
Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
CurEx
(2018)
The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort. With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domain-specific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.
Data Preparation
(2020)
Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. <br /> To cater to this demand, companies and researchers are developing techniques and tools for data preparation. To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.
Primary keys (PKs) and foreign keys (FKs) are important elements of relational schemata in various applications, such as query optimization and data integration. However, in many cases, these constraints are unknown or not documented. Detecting them manually is time-consuming and even infeasible in large-scale datasets. We study the problem of discovering primary keys and foreign keys automatically and propose an algorithm to detect both, namely Holistic Primary Key and Foreign Key Detection (HoPF). PKs and FKs are subsets of the sets of unique column combinations (UCCs) and inclusion dependencies (INDs), respectively, for which efficient discovery algorithms are known. Using score functions, our approach is able to effectively extract the true PKs and FKs from the vast sets of valid UCCs and INDs. Several pruning rules are employed to speed up the procedure. We evaluate precision and recall on three benchmarks and two real-world datasets. The results show that our method is able to retrieve on average 88% of all primary keys, and 91% of all foreign keys. We compare the performance of HoPF with two baseline approaches that both assume the existence of primary keys.
Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets.
We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.
This paper shows that the law, in subtle ways, may set hitherto unrecognized incentives for the adoption of explainable machine learning applications. In doing so, we make two novel contributions. First, on the legal side, we show that to avoid liability, professional actors, such as doctors and managers, may soon be legally compelled to use explainable ML models. We argue that the importance of explainability reaches far beyond data protection law, and crucially influences questions of contractual and tort liability for the use of ML models. To this effect, we conduct two legal case studies, in medical and corporate merger applications of ML. As a second contribution, we discuss the (legally required) trade-off between accuracy and explainability and demonstrate the effect in a technical case study in the context of spam classification.
Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes. With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values for independently extracting value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.