Pollock: A Data Loading Benchmark

Vitagliano, Gerardo; Hameed, Mazhar; Jiang, Lan; Reisener, Lucas; Wu, Eugene; Naumann, Felix

doi:10.14778/3594512.3594518

The search result changed since you submitted your search request. Documents might be displayed in a different sort order.

search hit 2 of 428

Back to Result List

Pollock: A Data Loading Benchmark

Gerardo Vitagliano, Mazhar Hameed, Lan Jiang, Lucas Reisener, Eugene Wu, Felix Naumann

Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is CSV. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard CSV formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic lpollutionz process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and aAny system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is CSV. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard CSV formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic lpollutionz process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and a data visualization tool.…

Metadaten
Author details:	Gerardo Vitagliano ORCiD GND, Mazhar Hameed ORCiD, Lan Jiang ORCiD GND, Lucas Reisener, Eugene Wu, Felix Naumann ORCiD GND
DOI:	https://doi.org/10.14778/3594512.3594518
ISSN:	2150-8097
Title of parent work (English):	Proceedings of the VLDB Endowment
Publisher:	Association for Computing Machinery
Place of publishing:	New York
Publication type:	Article
Language:	English
Date of first publication:	2023/04/01
Publication year:	2023
Release date:	2024/06/25
Volume:	16
Issue:	8
Number of pages:	13
First page:	1870
Last Page:	1882
Funding institution:	HPI research school on Data Science and Engineering
Organizational units:	Digital Engineering Fakultät / Hasso-Plattner-Institut für Digital Engineering GmbH
DDC classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Peer review:	Referiert

Pollock: A Data Loading Benchmark

Export metadata

Additional Services