Refine
Year of publication
- 2014 (28) (remove)
Document Type
- Article (13)
- Monograph/Edited Volume (7)
- Doctoral Thesis (6)
- Postprint (1)
- Preprint (1)
Is part of the Bibliography
- yes (28) (remove)
Keywords
- cloud computing (4)
- Cloud Computing (3)
- 3D geovisualization (2)
- Classification (2)
- Forschungsprojekte (2)
- Future SOC Lab (2)
- In-Memory Technologie (2)
- Multicore Architekturen (2)
- Virtualization (2)
- image-based representation (2)
Institute
- Hasso-Plattner-Institut für Digital Engineering gGmbH (28) (remove)
The data quality of real-world datasets need to be constantly monitored and maintained to allow organizations and individuals to reliably use their data. Especially, data integration projects suffer from poor initial data quality and as a consequence consume more effort and money. Commercial products and research prototypes for data cleansing and integration help users to improve the quality of individual and combined datasets. They can be divided into either standalone systems or database management system (DBMS) extensions. On the one hand, standalone systems do not interact well with DBMS and require time-consuming data imports and exports. On the other hand, DBMS extensions are often limited by the underlying system and do not cover the full set of data cleansing and integration tasks.
We overcome both limitations by implementing a concise set of five data cleansing and integration operators on the parallel data analytics platform Stratosphere. We define the semantics of the operators, present their parallel implementation, and devise optimization techniques for individual operators and combinations thereof. Users specify declarative queries in our query language METEOR with our new operators to improve the data quality of individual datasets or integrate them to larger datasets. By integrating the data cleansing operators into the higher level language layer of Stratosphere, users can easily combine cleansing operators with operators from other domains, such as information extraction, to complex data flows. Through a generic description of the operators, the Stratosphere optimizer reorders operators even from different domains to find better query plans.
As a case study, we reimplemented a part of the large Open Government Data integration project GovWILD with our new operators and show that our queries run significantly faster than the original GovWILD queries, which rely on relational operators. Evaluation reveals that our operators exhibit good scalability on up to 100 cores, so that even larger inputs can be efficiently processed by scaling out to more machines. Finally, our scripts are considerably shorter than the original GovWILD scripts, which results in better maintainability of the scripts.
This work introduces concepts and corresponding tool support to enable a complementary approach in dealing with recovery. Programmers need to recover a development state, or a part thereof, when previously made changes reveal undesired implications. However, when the need arises suddenly and unexpectedly, recovery often involves expensive and tedious work. To avoid tedious work, literature recommends keeping away from unexpected recovery demands by following a structured and disciplined approach, which consists of the application of various best practices including working only on one thing at a time, performing small steps, as well as making proper use of versioning and testing tools. However, the attempt to avoid unexpected recovery is both time-consuming and error-prone. On the one hand, it requires disproportionate effort to minimize the risk of unexpected situations. On the other hand, applying recommended practices selectively, which saves time, can hardly avoid recovery. In addition, the constant need for foresight and self-control has unfavorable implications. It is exhaustive and impedes creative problem solving. This work proposes to make recovery fast and easy and introduces corresponding support called CoExist. Such dedicated support turns situations of unanticipated recovery from tedious experiences into pleasant ones. It makes recovery fast and easy to accomplish, even if explicit commits are unavailable or tests have been ignored for some time. When mistakes and unexpected insights are no longer associated with tedious corrective actions, programmers are encouraged to change source code as a means to reason about it, as opposed to making changes only after structuring and evaluating them mentally. This work further reports on an implementation of the proposed tool support in the Squeak/Smalltalk development environment. The development of the tools has been accompanied by regular performance and usability tests. In addition, this work investigates whether the proposed tools affect programmers’ performance. In a controlled lab study, 22 participants improved the design of two different applications. Using a repeated measurement setup, the study examined the effect of providing CoExist on programming performance. The result of analyzing 88 hours of programming suggests that built-in recovery support as provided with CoExist positively has a positive effect on programming performance in explorative programming tasks.
In the field of disk-based parallel database management systems exists a great variety of solutions based on a shared-storage or a shared-nothing architecture. In contrast, main memory-based parallel database management systems are dominated solely by the shared-nothing approach as it preserves the in-memory performance advantage by processing data locally on each server. We argue that this unilateral development is going to cease due to the combination of the following three trends: a) Nowadays network technology features remote direct memory access (RDMA) and narrows the performance gap between accessing main memory inside a server and of a remote server to and even below a single order of magnitude. b) Modern storage systems scale gracefully, are elastic, and provide high-availability. c) A modern storage system such as Stanford's RAMCloud even keeps all data resident in main memory. Exploiting these characteristics in the context of a main-memory parallel database management system is desirable. The advent of RDMA-enabled network technology makes the creation of a parallel main memory DBMS based on a shared-storage approach feasible.
This thesis describes building a columnar database on shared main memory-based storage. The thesis discusses the resulting architecture (Part I), the implications on query processing (Part II), and presents an evaluation of the resulting solution in terms of performance, high-availability, and elasticity (Part III).
In our architecture, we use Stanford's RAMCloud as shared-storage, and the self-designed and developed in-memory AnalyticsDB as relational query processor on top. AnalyticsDB encapsulates data access and operator execution via an interface which allows seamless switching between local and remote main memory, while RAMCloud provides not only storage capacity, but also processing power. Combining both aspects allows pushing-down the execution of database operators into the storage system. We describe how the columnar data processed by AnalyticsDB is mapped to RAMCloud's key-value data model and how the performance advantages of columnar data storage can be preserved.
The combination of fast network technology and the possibility to execute database operators in the storage system opens the discussion for site selection. We construct a system model that allows the estimation of operator execution costs in terms of network transfer, data processed in memory, and wall time. This can be used for database operators that work on one relation at a time - such as a scan or materialize operation - to discuss the site selection problem (data pull vs. operator push). Since a database query translates to the execution of several database operators, it is possible that the optimal site selection varies per operator. For the execution of a database operator that works on two (or more) relations at a time, such as a join, the system model is enriched by additional factors such as the chosen algorithm (e.g. Grace- vs. Distributed Block Nested Loop Join vs. Cyclo-Join), the data partitioning of the respective relations, and their overlapping as well as the allowed resource allocation.
We present an evaluation on a cluster with 60 nodes where all nodes are connected via RDMA-enabled network equipment. We show that query processing performance is about 2.4x slower if everything is done via the data pull operator execution strategy (i.e. RAMCloud is being used only for data access) and about 27% slower if operator execution is also supported inside RAMCloud (in comparison to operating only on main memory inside a server without any network communication at all). The fast-crash recovery feature of RAMCloud can be leveraged to provide high-availability, e.g. a server crash during query execution only delays the query response for about one second. Our solution is elastic in a way that it can adapt to changing workloads a) within seconds, b) without interruption of the ongoing query processing, and c) without manual intervention.
Concepts and techniques for integration, analysis and visualization of massive 3D point clouds
(2014)
Remote sensing methods, such as LiDAR and image-based photogrammetry, are established approaches for capturing the physical world. Professional and low-cost scanning devices are capable of generating dense 3D point clouds. Typically, these 3D point clouds are preprocessed by GIS and are then used as input data in a variety of applications such as urban planning, environmental monitoring, disaster management, and simulation. The availability of area-wide 3D point clouds will drastically increase in the future due to the availability of novel capturing methods (e.g., driver assistance systems) and low-cost scanning devices. Applications, systems, and workflows will therefore face large collections of redundant, up-to-date 3D point clouds and have to cope with massive amounts of data. Hence, approaches are required that will efficiently integrate, update, manage, analyze, and visualize 3D point clouds. In this paper, we define requirements for a system infrastructure that enables the integration of 3D point clouds from heterogeneous capturing devices and different timestamps. Change detection and update strategies for 3D point clouds are presented that reduce storage requirements and offer new insights for analysis purposes. We also present an approach that attributes 3D point clouds with semantic information (e.g., object class category information), which enables more effective data processing, analysis, and visualization. Out-of-core real-time rendering techniques then allow for an interactive exploration of the entire 3D point cloud and the corresponding analysis results. Web-based visualization services are utilized to make 3D point clouds available to a large community. The proposed concepts and techniques are designed to establish 3D point clouds as base datasets, as well as rendering primitives for analysis and visualization tasks, which allow operations to be performed directly on the point data. Finally, we evaluate the presented system, report on its applications, and discuss further research challenges.
Software maintenance encompasses any changes made to a software system after its initial deployment and is thereby one of the key phases in the typical software-engineering lifecycle. In software maintenance, we primarily need to understand structural and behavioral aspects, which are difficult to obtain, e.g., by code reading. Software analysis is therefore a vital tool for maintaining these systems: It provides - the preferably automated - means to extract and evaluate information from their artifacts such as software structure, runtime behavior, and related processes. However, such analysis typically results in massive raw data, so that even experienced engineers face difficulties directly examining, assessing, and understanding these data. Among other things, they require tools with which to explore the data if no clear question can be formulated beforehand. For this, software analysis and visualization provide its users with powerful interactive means. These enable the automation of tasks and, particularly, the acquisition of valuable and actionable insights into the raw data. For instance, one means for exploring runtime behavior is trace visualization. This thesis aims at extending and improving the tool set for visual software analysis by concentrating on several open challenges in the fields of dynamic and static analysis of software systems. This work develops a series of concepts and tools for the exploratory visualization of the respective data to support users in finding and retrieving information on the system artifacts concerned. This is a difficult task, due to the lack of appropriate visualization metaphors; in particular, the visualization of complex runtime behavior poses various questions and challenges of both a technical and conceptual nature. This work focuses on a set of visualization techniques for visually representing control-flow related aspects of software traces from shared-memory software systems: A trace-visualization concept based on icicle plots aids in understanding both single-threaded as well as multi-threaded runtime behavior on the function level. The concept’s extensibility further allows the visualization and analysis of specific aspects of multi-threading such as synchronization, the correlation of such traces with data from static software analysis, and a comparison between traces. Moreover, complementary techniques for simultaneously analyzing system structures and the evolution of related attributes are proposed. These aim at facilitating long-term planning of software architecture and supporting management decisions in software projects by extensions to the circular-bundle-view technique: An extension to 3-dimensional space allows for the use of additional variables simultaneously; interaction techniques allow for the modification of structures in a visual manner. The concepts and techniques presented here are generic and, as such, can be applied beyond software analysis for the visualization of similarly structured data. The techniques' practicability is demonstrated by several qualitative studies using subject data from industry-scale software systems. The studies provide initial evidence that the techniques' application yields useful insights into the subject data and its interrelationships in several scenarios.
Modern 3D geovisualization systems (3DGeoVSs) are complex and evolving systems that are required to be adaptable and leverage distributed resources, including massive geodata. This article focuses on 3DGeoVSs built based on the principles of service-oriented architectures, standards and image-based representations (SSI) to address practically relevant challenges and potentials. Such systems facilitate resource sharing and agile and efficient system construction and change in an interoperable manner, while exploiting images as efficient, decoupled and interoperable representations. The software architecture of a 3DGeoVS and its underlying visualization model have strong effects on the system's quality attributes and support various system life cycle activities. This article contributes a software reference architecture (SRA) for 3DGeoVSs based on SSI that can be used to design, describe and analyze concrete software architectures with the intended primary benefit of an increase in effectiveness and efficiency in such activities. The SRA integrates existing, proven technology and novel contributions in a unique manner. As the foundation for the SRA, we propose the generalized visualization pipeline model that generalizes and overcomes expressiveness limitations of the prevalent visualization pipeline model. To facilitate exploiting image-based representations (IReps), the SRA integrates approaches for the representation, provisioning and styling of and interaction with IReps. Five applications of the SRA provide proofs of concept for the general applicability and utility of the SRA. A qualitative evaluation indicates the overall suitability of the SRA, its applications and the general approach of building 3DGeoVSs based on SSI.
A growing number of enterprises use complex event processing for monitoring and controlling their operations, while business process models are used to document working procedures. In this work, we propose a comprehensive method for complex event processing optimization using business process models. Our proposed method is based on the extraction of behaviorial constraints that are used, in turn, to rewrite patterns for event detection, and select and transform execution plans. We offer a set of rewriting rules that is shown to be complete with respect to the all, seq, and any patterns. The effectiveness of our method is demonstrated in an experimental evaluation with a large number of processes from an insurance company. We illustrate that the proposed optimization leads to significant savings in query processing. By integrating the optimization in state-of-the-art systems for event pattern matching, we demonstrate that these savings materialize in different technical infrastructures and can be combined with existing optimization techniques.
Virtual 3D city models serve as integration platforms for complex geospatial and georeferenced information and as medium for effective communication of spatial information. In order to explore these information spaces, navigation techniques for controlling the virtual camera are required to facilitate wayfinding and movement. However, navigation is not a trivial task and many available navigation techniques do not support users effectively and efficiently with their respective skills and tasks. In this article, we present an assisting, constrained navigation technique for multiscale virtual 3D city models that is based on three basic principles: users point to navigate, users are lead by suggestions, and the exploitation of semantic, multiscale, hierarchical structurings of city models. The technique particularly supports users with low navigation and virtual camera control skills but is also valuable for experienced users. It supports exploration, search, inspection, and presentation tasks, is easy to learn and use, supports orientation, is efficient, and yields effective view properties. In particular, the technique is suitable for interactive kiosks and mobile devices with a touch display and low computing resources and for use in mobile situations where users only have restricted resources for operating the application. We demonstrate the validity of the proposed navigation technique by presenting an implementation and evaluation results. The implementation is based on service-oriented architectures, standards, and image-based representations and allows exploring massive virtual 3D city models particularly on mobile devices with limited computing resources. Results of a user study comparing the proposed navigation technique with standard techniques suggest that the proposed technique provides the targeted properties, and that it is more advantageous to novice than to expert users.
Text displayed in a video is an essential part for the high-level semantic information of the video content. Therefore, video text can be used as a valuable source for automated video indexing in digital video libraries. In this paper, we propose a workflow for video text detection and recognition. In the text detection stage, we have developed a fast localization-verification scheme, in which an edge-based multi-scale text detector first identifies potential text candidates with high recall rate. Then, detected candidate text lines are refined by using an image entropy-based filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine (SVM)-based verification procedures are applied to eliminate the false alarms. For text recognition, we have developed a novel skeleton-based binarization method in order to separate text from complex backgrounds to make it processible for standard OCR (Optical Character Recognition) software. Operability and accuracy of proposed text detection and binarization methods have been evaluated by using publicly available test data sets.