Refine
Has Fulltext
- yes (2)
Year of publication
- 2024 (2)
Document Type
- Doctoral Thesis (2) (remove)
Language
- English (2) (remove)
Is part of the Bibliography
- yes (2)
Keywords
- Datenverwaltung (2) (remove)
Institute
Efficiently managing large state is a key challenge for data management systems. Traditionally, state is split into fast but volatile state in memory for processing and persistent but slow state on secondary storage for durability. Persistent memory (PMem), as a new technology in the storage hierarchy, blurs the lines between these states by offering both byte-addressability and low latency like DRAM as well persistence like secondary storage. These characteristics have the potential to cause a major performance shift in database systems.
Driven by the potential impact that PMem has on data management systems, in this thesis we explore their use of PMem. We first evaluate the performance of real PMem hardware in the form of Intel Optane in a wide range of setups. To this end, we propose PerMA-Bench, a configurable benchmark framework that allows users to evaluate the performance of customizable database-related PMem access. Based on experimental results obtained with PerMA-Bench, we discuss findings and identify general and implementation-specific aspects that influence PMem performance and should be considered in future work to improve PMem-aware designs. We then propose Viper, a hybrid PMem-DRAM key-value store. Based on PMem-aware access patterns, we show how to leverage PMem and DRAM efficiently to design a key database component. Our evaluation shows that Viper outperforms existing key-value stores by 4–18x for inserts while offering full data persistence and achieving similar or better lookup performance. Next, we show which changes must be made to integrate PMem components into larger systems. By the example of stream processing engines, we highlight limitations of current designs and propose a prototype engine that overcomes these limitations. This allows our prototype to fully leverage PMem's performance for its internal state management. Finally, in light of Optane's discontinuation, we discuss how insights from PMem research can be transferred to future multi-tier memory setups by the example of Compute Express Link (CXL).
Overall, we show that PMem offers high performance for state management, bridging the gap between fast but volatile DRAM and persistent but slow secondary storage. Although Optane was discontinued, new memory technologies are continuously emerging in various forms and we outline how novel designs for them can build on insights from existing PMem research.
Data preparation stands as a cornerstone in the landscape of data science workflows, commanding a significant portion—approximately 80%—of a data scientist's time. The extensive time consumption in data preparation is primarily attributed to the intricate challenge faced by data scientists in devising tailored solutions for downstream tasks. This complexity is further magnified by the inadequate availability of metadata, the often ad-hoc nature of preparation tasks, and the necessity for data scientists to grapple with a diverse range of sophisticated tools, each presenting its unique intricacies and demands for proficiency.
Previous research in data management has traditionally concentrated on preparing the content within columns and rows of a relational table, addressing tasks, such as string disambiguation, date standardization, or numeric value normalization, commonly referred to as data cleaning. This focus assumes a perfectly structured input table. Consequently, the mentioned data cleaning tasks can be effectively applied only after the table has been successfully loaded into the respective data cleaning environment, typically in the later stages of the data processing pipeline.
While current data cleaning tools are well-suited for relational tables, extensive data repositories frequently contain data stored in plain text files, such as CSV files, due to their adaptable standard. Consequently, these files often exhibit tables with a flexible layout of rows and columns, lacking a relational structure. This flexibility often results in data being distributed across cells in arbitrary positions, typically guided by user-specified formatting guidelines.
Effectively extracting and leveraging these tables in subsequent processing stages necessitates accurate parsing. This thesis emphasizes what we define as the “structure” of a data file—the fundamental characters within a file essential for parsing and comprehending its content. Concentrating on the initial stages of the data preprocessing pipeline, this thesis addresses two crucial aspects: comprehending the structural layout of a table within a raw data file and automatically identifying and rectifying any structural issues that might hinder its parsing. Although these issues may not directly impact the table's content, they pose significant challenges in parsing the table within the file.
Our initial contribution comprises an extensive survey of commercially available data preparation tools. This survey thoroughly examines their distinct features, the lacking features, and the necessity for preliminary data processing despite these tools. The primary goal is to elucidate the current state-of-the-art in data preparation systems while identifying areas for enhancement. Furthermore, the survey explores the encountered challenges in data preprocessing, emphasizing opportunities for future research and improvement.
Next, we propose a novel data preparation pipeline designed for detecting and correcting structural errors. The aim of this pipeline is to assist users at the initial preprocessing stage by ensuring the correct loading of their data into their preferred systems. Our approach begins by introducing SURAGH, an unsupervised system that utilizes a pattern-based method to identify dominant patterns within a file, independent of external information, such as data types, row structures, or schemata. By identifying deviations from the dominant pattern, it detects ill-formed rows. Subsequently, our structure correction system, TASHEEH, gathers the identified ill-formed rows along with dominant patterns and employs a novel pattern transformation algebra to automatically rectify errors. Our pipeline serves as an end-to-end solution, transforming a structurally broken CSV file into a well-formatted one, usually suitable for seamless loading.
Finally, we introduce MORPHER, a user-friendly GUI integrating the functionalities of both SURAGH and TASHEEH. This interface empowers users to access the pipeline's features through visual elements. Our extensive experiments demonstrate the effectiveness of our data preparation systems, requiring no user involvement. Both SURAGH and TASHEEH outperform existing state-of-the-art methods significantly in both precision and recall.