Refine
Year of publication
Document Type
- Article (41)
- Monograph/Edited Volume (11)
- Other (3)
- Conference Proceeding (1)
- Postprint (1)
- Preprint (1)
Is part of the Bibliography
- yes (58)
Keywords
- radiation mechanisms: non-thermal (8)
- gamma rays: galaxies (6)
- galaxies: active (5)
- gamma rays: general (5)
- ISM: supernova remnants (4)
- data profiling (4)
- Datenintegration (3)
- duplicate detection (3)
- similarity measures (3)
- Data Integration (2)
- Forschungskolleg (2)
- Functional dependencies (2)
- Hasso Plattner Institute (2)
- Hasso-Plattner-Institut (2)
- ISM: individual objects: G338.3-0.0 (2)
- Klausurtagung (2)
- Query optimization (2)
- Service-oriented Systems Engineering (2)
- acceleration of particles (2)
- data matching (2)
- data quality (2)
- data wrangling (2)
- entity resolution (2)
- galaxies: jets (2)
- record linkage (2)
- Address matching (1)
- Air showers (1)
- Approximation algorithms (1)
- Apriori (1)
- Association Rule Mining (1)
- Assoziationsregeln (1)
- BL Lacertae objects: general (1)
- BL Lacertae objects: individual: 1ES 1312-423 (1)
- BL Lacertae objects: individual: AP Librae (1)
- BL Lacertae objects: individual: PKS 0301-243 (1)
- BL Lacertae objects: individual: PKS 2155-304 (1)
- BL Lacertae objects: individual: SHBL J001355.9-185406 (1)
- BL Lacertae objects: individual: lES 0229+200 (1)
- BL Lacertae objects: individual: lES 1101-232 (1)
- Bedingte Inklusionsabhängigkeiten (1)
- Big Data (1)
- Cherenkov Telescopes (1)
- Complexity theory (1)
- Conditional Inclusion Dependency (1)
- Cross-platform (1)
- Data Dependency (1)
- Data Profiling (1)
- Data Quality (1)
- Data Warehouse (1)
- Data dependencies (1)
- Data processing (1)
- Data profiling (1)
- Data profiling application (1)
- Database (1)
- Datenabhängigkeiten (1)
- Datenanalyse (1)
- Datenqualität (1)
- Design concepts (1)
- Distributed (1)
- Duplicate Detection (1)
- Duplikaterkennung (1)
- Entity resolution (1)
- Erkennen von Meta-Daten (1)
- Extract-Transform-Load (ETL) (1)
- Foreign key (1)
- Ground based gamma ray astronomy (1)
- ISM: clouds (1)
- ISM: individual objects: Crab nebula (1)
- ISM: individual objects: HESS J1832-093 (1)
- ISM: individual objects: SNR G1.9+0.3 (1)
- ISM: individual objects: SNR G22.7-0.2 (1)
- ISM: individual objects: SNR G330.2+1.0 (1)
- ISM: magnetic fields (1)
- Inclusion dependencies (1)
- Information Extraction (1)
- Information Systems (1)
- Informationsextraktion (1)
- Informationssysteme (1)
- Lakes (1)
- Link Discovery (1)
- Link-Entdeckung (1)
- Linked Data (1)
- Linked Open Data (1)
- Metadata Discovery (1)
- Metadatenentdeckung (1)
- Metadatenqualität (1)
- Next generation Cherenkov telescopes (1)
- Order dependencies (1)
- Ph.D. Retreat (1)
- Ph.D. retreat (1)
- Polystore (1)
- Primary key (1)
- Query execution (1)
- Record linkage (1)
- Relational data (1)
- Research School (1)
- SQL (1)
- Schemaentdeckung (1)
- Schlüsselentdeckung (1)
- Semantics (1)
- TeV gamma-ray astronomy (1)
- Unique column combinations (1)
- Wikipedia (1)
- X-rays: binaries (1)
- X-rays: general (1)
- X-rays: individuals: G15.4+0.1 (1)
- X-rays: stars (1)
- address normalization (1)
- address parsing (1)
- apriori (1)
- astroparticle physics (1)
- binaries: general (1)
- clustering (1)
- conditional functional dependencies (1)
- contract (1)
- corporate takeovers (1)
- cosmic rays (1)
- cross-platform (1)
- data cleaning (1)
- data cleansing (1)
- data integration (1)
- data preparation (1)
- data processing (1)
- databases (1)
- deduplication (1)
- dependency discovery (1)
- eindeutig (1)
- errata, addenda (1)
- explainability (1)
- explainability-accuracy trade-off (1)
- explainable AI (1)
- functional dependencies (1)
- functional dependency (1)
- funktionale Abhängigkeit (1)
- galaxies: individual (M 87) (1)
- galaxies: magnetic fields (1)
- galaxies: nuclei (1)
- gamma rays: ISM (1)
- gamma rays: general(HESS J0632+057, VER J0633+057) (1)
- gamma rays: stars (1)
- gamma-ray burst: individual: GRB 100621A (1)
- gamma-rays: ISM (1)
- gamma-rays: galaxies (1)
- gamma-rays: general (1)
- geocoding (1)
- geographic information systems (1)
- globular clusters: general (1)
- infrared: diffuse background (1)
- intergalactic medium (1)
- interpretable machine learning (1)
- key discovery (1)
- law (1)
- management (1)
- matching dependencies (1)
- medical malpractice (1)
- metadata discovery (1)
- metadata quality (1)
- methods: observational (1)
- metric learning (1)
- networks (1)
- neural (1)
- polystore (1)
- pulsars: general (1)
- pulsars: individual: PSR B1259-63 (1)
- quasars: individual: PKS 1510-089 (1)
- query optimization (1)
- random forest (1)
- relativistic processes (1)
- research school (1)
- schema discovery (1)
- service-oriented systems engineering (1)
- similarity learning (1)
- stars: individual: LS 2883 (1)
- supernovae: individual: HESS J1818-154 (1)
- tort law (1)
- transfer learning (1)
- unique (1)
Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is CSV. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard CSV formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic lpollutionz process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and a data visualization tool.