The search result changed since you submitted your search request. Documents might be displayed in a different sort order.
  • search hit 2 of 428
Back to Result List

Pollock: A Data Loading Benchmark

  • Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is CSV. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard CSV formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic lpollutionz process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and aAny system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is CSV. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard CSV formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic lpollutionz process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and a data visualization tool.show moreshow less

Export metadata

Additional Services

Search Google Scholar Statistics
Metadaten
Author details:Gerardo VitaglianoORCiDGND, Mazhar HameedORCiD, Lan JiangORCiDGND, Lucas Reisener, Eugene Wu, Felix NaumannORCiDGND
DOI:https://doi.org/10.14778/3594512.3594518
ISSN:2150-8097
Title of parent work (English):Proceedings of the VLDB Endowment
Publisher:Association for Computing Machinery
Place of publishing:New York
Publication type:Article
Language:English
Date of first publication:2023/04/01
Publication year:2023
Release date:2024/06/25
Volume:16
Issue:8
Number of pages:13
First page:1870
Last Page:1882
Funding institution:HPI research school on Data Science and Engineering
Organizational units:Digital Engineering Fakultät / Hasso-Plattner-Institut für Digital Engineering GmbH
DDC classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Peer review:Referiert
Accept ✔
This website uses technically necessary session cookies. By continuing to use the website, you agree to this. You can find our privacy policy here.