Filtern
Erscheinungsjahr
Dokumenttyp
- Wissenschaftlicher Artikel (44)
- Monographie/Sammelband (10)
- Sonstiges (3)
- Postprint (1)
- Preprint (1)
Sprache
- Englisch (59) (entfernen)
Gehört zur Bibliographie
- ja (59)
Schlagworte
- radiation mechanisms: non-thermal (8)
- gamma rays: galaxies (6)
- galaxies: active (5)
- gamma rays: general (5)
- ISM: supernova remnants (4)
- data profiling (4)
- Datenintegration (3)
- duplicate detection (3)
- similarity measures (3)
- Data Integration (2)
- Forschungskolleg (2)
- Functional dependencies (2)
- Hasso Plattner Institute (2)
- Hasso-Plattner-Institut (2)
- ISM: individual objects: G338.3-0.0 (2)
- Klausurtagung (2)
- Query optimization (2)
- Service-oriented Systems Engineering (2)
- acceleration of particles (2)
- data matching (2)
- data quality (2)
- data wrangling (2)
- entity resolution (2)
- galaxies: jets (2)
- record linkage (2)
- AI Act (1)
- Address matching (1)
- Air showers (1)
- Approximation algorithms (1)
- Apriori (1)
- Association Rule Mining (1)
- Assoziationsregeln (1)
- BL Lacertae objects: general (1)
- BL Lacertae objects: individual: 1ES 1312-423 (1)
- BL Lacertae objects: individual: AP Librae (1)
- BL Lacertae objects: individual: PKS 0301-243 (1)
- BL Lacertae objects: individual: PKS 2155-304 (1)
- BL Lacertae objects: individual: SHBL J001355.9-185406 (1)
- BL Lacertae objects: individual: lES 0229+200 (1)
- BL Lacertae objects: individual: lES 1101-232 (1)
- Bedingte Inklusionsabhängigkeiten (1)
- Big Data (1)
- Cherenkov Telescopes (1)
- Complexity theory (1)
- Conditional Inclusion Dependency (1)
- Cross-platform (1)
- Data Dependency (1)
- Data Profiling (1)
- Data Quality (1)
- Data Warehouse (1)
- Data dependencies (1)
- Data processing (1)
- Data profiling (1)
- Data profiling application (1)
- Database (1)
- Datenabhängigkeiten (1)
- Datenanalyse (1)
- Datenqualität (1)
- Design concepts (1)
- Distributed (1)
- Duplicate Detection (1)
- Duplikaterkennung (1)
- Entity resolution (1)
- Erkennen von Meta-Daten (1)
- Extract-Transform-Load (ETL) (1)
- Foreign key (1)
- Ground based gamma ray astronomy (1)
- ISM: clouds (1)
- ISM: individual objects: Crab nebula (1)
- ISM: individual objects: HESS J1832-093 (1)
- ISM: individual objects: SNR G1.9+0.3 (1)
- ISM: individual objects: SNR G22.7-0.2 (1)
- ISM: individual objects: SNR G330.2+1.0 (1)
- ISM: magnetic fields (1)
- Inclusion dependencies (1)
- Information Extraction (1)
- Information Systems (1)
- Informationsextraktion (1)
- Informationssysteme (1)
- Lakes (1)
- Link Discovery (1)
- Link-Entdeckung (1)
- Linked Data (1)
- Linked Open Data (1)
- Metadata Discovery (1)
- Metadatenentdeckung (1)
- Metadatenqualität (1)
- Next generation Cherenkov telescopes (1)
- Order dependencies (1)
- Ph.D. Retreat (1)
- Ph.D. retreat (1)
- Polystore (1)
- Primary key (1)
- Query execution (1)
- Record linkage (1)
- Relational data (1)
- Research School (1)
- SQL (1)
- Schemaentdeckung (1)
- Schlüsselentdeckung (1)
- Semantics (1)
- TeV gamma-ray astronomy (1)
- Unique column combinations (1)
- Wikipedia (1)
- X-rays: binaries (1)
- X-rays: general (1)
- X-rays: individuals: G15.4+0.1 (1)
- X-rays: stars (1)
- address normalization (1)
- address parsing (1)
- apriori (1)
- astroparticle physics (1)
- binaries: general (1)
- clustering (1)
- compliance (1)
- conditional functional dependencies (1)
- contract (1)
- corporate takeovers (1)
- cosmic rays (1)
- cross-platform (1)
- data cleaning (1)
- data cleansing (1)
- data integration (1)
- data preparation (1)
- data processing (1)
- databases (1)
- deduplication (1)
- dependency discovery (1)
- eindeutig (1)
- errata, addenda (1)
- explainability (1)
- explainability-accuracy trade-off (1)
- explainable AI (1)
- functional dependencies (1)
- functional dependency (1)
- funktionale Abhängigkeit (1)
- galaxies: individual (M 87) (1)
- galaxies: magnetic fields (1)
- galaxies: nuclei (1)
- gamma rays: ISM (1)
- gamma rays: general(HESS J0632+057, VER J0633+057) (1)
- gamma rays: stars (1)
- gamma-ray burst: individual: GRB 100621A (1)
- gamma-rays: ISM (1)
- gamma-rays: galaxies (1)
- gamma-rays: general (1)
- geocoding (1)
- geographic information systems (1)
- globular clusters: general (1)
- information quality (1)
- infrared: diffuse background (1)
- intergalactic medium (1)
- interpretable machine learning (1)
- key discovery (1)
- law (1)
- liability (1)
- management (1)
- matching dependencies (1)
- medical malpractice (1)
- metadata discovery (1)
- metadata quality (1)
- methods: observational (1)
- metric learning (1)
- networks (1)
- neural (1)
- polystore (1)
- privacy (1)
- pulsars: general (1)
- pulsars: individual: PSR B1259-63 (1)
- quasars: individual: PKS 1510-089 (1)
- query optimization (1)
- random forest (1)
- relativistic processes (1)
- research school (1)
- schema discovery (1)
- service-oriented systems engineering (1)
- similarity learning (1)
- stars: individual: LS 2883 (1)
- supernovae: individual: HESS J1818-154 (1)
- tort law (1)
- transfer learning (1)
- transparency (1)
- unique (1)
Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the wellknown GORDIAN algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCAGORDIAN combines the advantages of GORDIAN and our new algorithm HCA, and it significantly outperforms all previous work in many situations.
This vision article outlines the main building blocks of what we term AI Compliance, an effort to bridge two complementary research areas: computer science and the law.
Such research has the goal to model, measure, and affect the quality of AI artifacts, such as data, models, and applications, to then facilitate adherence to legal standards.
Axionlike particles (ALPs) are hypothetical light (sub-eV) bosons predicted in some extensions of the Standard Model of particle physics. In astrophysical environments comprising high-energy gamma rays and turbulent magnetic fields, the existence of ALPs can modify the energy spectrum of the gamma rays for a sufficiently large coupling between ALPs and photons. This modification would take the form of an irregular behavior of the energy spectrum in a limited energy range. Data from the H. E. S. S. observations of the distant BL Lac object PKS 2155 - 304 (z = 0.116) are used to derive upper limits at the 95% C. L. on the strength of the ALP coupling to photons, g(gamma a) < 2.1 x 10(-11) GeV-1 for an ALP mass between 15 and 60 neV. The results depend on assumptions on the magnetic field around the source, which are chosen conservatively. The derived constraints apply to both light pseudoscalar and scalar bosons that couple to the electromagnetic field.
Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we define quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.
CurEx
(2018)
The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort. With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domain-specific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.
Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.
Data Preparation
(2020)
Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. <br /> To cater to this demand, companies and researchers are developing techniques and tools for data preparation. To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.
Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.