TY  - JOUR
A1  - Teske, Daniel
T1  - Geocoder accuracy ranking
JF  - Process design for natural scientists: an agile model-driven approach
N2  - Finding an address on a map is sometimes tricky: the chosen map application may be unfamiliar with the enclosed region. There are several geocoders on the market, they have different databases and algorithms to compute the query. Consequently, the geocoding results differ in their quality. Fortunately the geocoders provide a rich set of metadata. The workflow described in this paper compares this metadata with the aim to find out which geocoder is offering the best-fitting coordinate for a given address.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 161
EP  - 174
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Sens, Henriette
T1  - Web-Based map generalization tools put to the test: a jABC workflow
JF  - Process Design for Natural Scientists: an agile model-driven approach
N2  - Geometric generalization is a fundamental concept in the digital mapping process. An increasing amount of spatial data is provided on the web as well as a range of tools to process it. This jABC workflow is used for the automatic testing of web-based generalization services like mapshaper.org by executing its functionality, overlaying both datasets before and after the transformation and displaying them visually in a .tif file. Mostly Web Services and command line tools are used to build an environment where ESRI shapefiles can be uploaded, processed through a chosen generalization service and finally visualized in Irfanview.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 175
EP  - 185
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Noack, Franziska
T1  - CREADED: Colored-Relief application for digital elevation data
JF  - Process design for natural scientists: an agile model-driven approach
N2  - In the geoinformatics field, remote sensing data is often used for analyzing the characteristics of the current investigation area. This includes DEMs, which are simple raster grids containing grey scales representing the respective elevation values. The project CREADED that is presented in this paper aims at making these monochrome raster images more significant and more intuitively interpretable. For this purpose, an executable interactive model for creating a colored and relief-shaded Digital Elevation Model (DEM) has been designed using the jABC framework. The process is based on standard jABC-SIBs and SIBs that provide specific GIS functions, which are available as Web services, command line tools and scripts.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 186
EP  - 199
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Respondek, Tobias
T1  - A workflow for computing potential areas for wind turbines
JF  - Process design for natural scientists: an agile model-driven approach
N2  - This paper describes the implementation of a workflow model for service-oriented computing of potential areas for wind turbines in jABC. By implementing a re-executable model the manual effort of a multi-criteria site analysis can be reduced. The aim is to determine the shift of typical geoprocessing tools of geographic information systems (GIS) from the desktop to the web. The analysis is based on a vector data set and mainly uses web services of the “Center for Spatial Information Science and Systems” (CSISS). This paper discusses effort, benefits and problems associated with the use of the web services.
Y1  - 2014
SN  - 978-3-662-45005-5
IS  - 500
SP  - 200
EP  - 215
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Scheele, Lasse
T1  - Location analysis for placing artificial reefs
JF  - Process design for natural scientists: an agile model-driven approach
N2  - Location analyses are among the most common tasks while working with spatial data and geographic information systems. Automating the most frequently used procedures is therefore an important aspect of improving their usability. In this context, this project aims to design and implement a workflow, providing some basic tools for a location analysis. For the implementation with jABC, the workflow was applied to the problem of finding a suitable location for placing an artificial reef. For this analysis three parameters (bathymetry, slope and grain size of the ground material) were taken into account, processed, and visualized with the The Generic Mapping Tools (GMT), which were integrated into the workflow as jETI-SIBs. The implemented workflow thereby showed that the approach to combine jABC with GMT resulted in an user-centric yet user-friendly tool with high-quality cartographic outputs.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 216
EP  - 228
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Kind, Josephine
T1  - Creation of topographic maps
JF  - Process design for natural scientists: an agile model-driven approach
N2  - Location analyses are among the most common tasks while working with spatial data and geographic information systems. Automating the most frequently used procedures is therefore an important aspect of improving their usability. In this context, this project aims to design and implement a workflow, providing some basic tools for a location analysis. For the implementation with jABC, the workflow was applied to the problem of finding a suitable location for placing an artificial reef. For this analysis three parameters (bathymetry, slope and grain size of the ground material) were taken into account, processed, and visualized with the The Generic Mapping Tools (GMT), which were integrated into the workflow as jETI-SIBs. The implemented workflow thereby showed that the approach to combine jABC with GMT resulted in an user-centric yet user-friendly tool with high-quality cartographic outputs.
Y1  - 2014
SN  - 978-3-662-45005-5
IS  - 500
SP  - 229
EP  - 238
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Holler, Robin
T1  - GraffDok - a graffiti documentation application
JF  - Process design for natural scientists: an agile model-driven approach
N2  - GraffDok is an application helping to maintain an overview over sprayed images somewhere in a city. At the time of writing it aims at vandalism rather than at beautiful photographic graffiti in an underpass. Looking at hundreds of tags and scribbles on monuments, house walls, etc. it would be interesting to not only record them in writing but even make them accessible electronically, including images.
GraffDok’s workflow is simple and only requires an EXIF-GPS-tagged photograph of a graffito. It automatically determines its location by using reverse geocoding with the given GPS-coordinates and the Gisgraphy WebService. While asking the user for some more meta data, GraffDok analyses the image in parallel with this and tries to detect fore- and background – before extracting the drawing lines and make them stand alone. The command line based tool ImageMagick is used here as well as for accessing EXIF data.
Any meta data is written to csv-files, which will stay easily accessible and can be integrated in TeX-files as well. The latter ones are converted to PDF at the end of the workflow, containing a table about all graffiti and a summary for each – including the generated characteristic graffiti pattern image.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 239
EP  - 251
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Reso, Judith
ED  - Lambrecht, Anna-Lena
ED  - Margaria, Tiziana
T1  - Protein Classification Workflow
JF  - Process Design for Natural Scientists: an agile model-driven approach
N2  - The protein classification workflow described in this report enables users to get information about a novel protein sequence automatically. The information is derived by different bioinformatic analysis tools which calculate or predict features of a protein sequence. Also, databases are used to compare the novel sequence with known proteins.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 65
EP  - 72
PB  - Springer Verlag
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Schulze, Gunnar
T1  - Workflow for rapid metagenome analysis
JF  - Process design for natural scientists: an agile model-driven approach
N2  - Analyses of metagenomes in life sciences present new opportunities as well as challenges to the scientific community and call for advanced computational methods and workflows. The large amount of data collected from samples via next-generation sequencing (NGS) technologies render manual approaches to sequence comparison and annotation unsuitable. Rather, fast and efficient computational pipelines are needed to provide comprehensive statistics and summaries and enable the researcher to choose appropriate tools for more specific analyses. The workflow presented here builds upon previous pipelines designed for automated clustering and annotation of raw sequence reads obtained from next-generation sequencing technologies such as 454 and Illumina. Employing specialized algorithms, the sequence reads are processed at three different levels. First, raw reads are clustered at high similarity cutoff to yield clusters which can be exported as multifasta files for further analyses. Independently, open reading frames (ORFs) are predicted from raw reads and clustered at two strictness levels to yield sets of non-redundant sequences and ORF families. Furthermore, single ORFs are annotated by performing searches against the Pfam database
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 88
EP  - 100
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Vierheller, Janine
ED  - Lambrecht, Anna-Lena
ED  - Margaria, Tiziana
T1  - Exploratory Data Analysis
JF  - Process Design for Natural Scientists: an agile model-driven approach
N2  - In bioinformatics the term exploratory data analysis refers to different methods to get an overview of large biological data sets. Hence, it helps to create a framework for further analysis and hypothesis testing. The workflow facilitates this first important step of the data analysis created by high-throughput technologies. The results are different plots showing the structure of the measurements. The goal of the workflow is the automatization of the exploratory data analysis, but also the flexibility should be guaranteed. The basic tool is the free software R.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 110
EP  - 126
PB  - Axel Springer Verlag
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Schütt, Christine
T1  - Identification of differentially expressed genes
JF  - Process design for natural scientists: an agile model-driven approach
N2  - With the jABC it is possible to realize workflows for numerous questions in different fields. The goal of this project was to create a workflow for the identification of differentially expressed genes. This is of special interest in biology, for it gives the opportunity to get a better insight in cellular changes due to exogenous stress, diseases and so on. With the knowledge that can be derived from the differentially expressed genes in diseased tissues, it becomes possible to find new targets for treatment.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 127
EP  - 139
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Kuntzsch, Christian
T1  - Visualization of data transfer paths
JF  - Process design for natural scientists: an agile model-driven approach
N2  - A workflow for visualizing server connections using the Google Maps API was built in the jABC. It makes use of three basic services: An XML-based IP address geolocation web service, a command line tool and the Static Maps API. The result of the workflow is an URL leading to an image file of a map, showing server connections between a client and a target host.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 140
EP  - 148
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Hibbe, Marcel
ED  - Lambrecht, Anna-Lena
ED  - Margaria, Tiziana
T1  - Spotlocator - Guess Where the Photo Was Taken!
JF  - Process Design for Natural Scientists: an agile model-driven approach
N2  - Spotlocator is a game wherein people have to guess the spots of where photos were taken. The photos of a defined area for each game are from panoramio.com. They are published at http://spotlocator. drupalgardens.com with an ID. Everyone can guess the photo spots by sending a special tweet via Twitter that contains the hashtag #spotlocator, the guessed coordinates and the ID of the photo. An evaluation is published for all tweets. The players are informed about the distance to the real photo spots and the positions are shown on a map.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 149
EP  - 160
PB  - Springer Verlag
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Blaese, Leif
T1  - Data mining for unidentified protein squences
JF  - Process design for natural scientists: an agile model-driven approach
N2  - Through the use of next generation sequencing (NGS) technology, a lot of newly sequenced organisms are now available. Annotating those genes is one of the most challenging tasks in sequence biology. Here, we present an automated workflow to find homologue proteins, annotate sequences according to function and create a three-dimensional model.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 73
EP  - 87
PB  - Springer
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Lis, Monika
ED  - Lambrecht, Anna-Lena
ED  - Margaria, Tiziana
T1  - Constructing a Phylogenetic Tree
JF  - Process Design for Natural Scientists: an agile model-driven approach
N2  - In this project I constructed a workflow that takes a DNA sequence as input and provides a phylogenetic tree, consisting of the input sequence and other sequences which were found during a database search. In this phylogenetic tree the sequences are arranged depending on similarities. In bioinformatics, constructing phylogenetic trees is often used to explore the evolutionary relationships of genes or organisms and to understand the mechanisms of evolution itself.
Y1  - 2014
SN  - 978-3-662-45005-5
SN  - 1865-0929
IS  - 500
SP  - 101
EP  - 109
PB  - Springer Verlag
CY  - Berlin
ER  - 
TY  - THES
A1  - Heise, Arvid
T1  - Data cleansing and integration operators for a parallel data analytics platform
T1  - Datenreinigungs- und Integrationsoperatoren für ein
paralles Datenanalyseframework
N2  - The data quality of real-world datasets need to be constantly monitored and maintained to allow organizations and individuals to reliably use their data. Especially, data integration projects suffer from poor initial data quality and as a consequence consume more effort and money. Commercial products and research prototypes for data cleansing and integration help users to improve the quality of individual and combined datasets. They can be divided into either standalone systems or database management system (DBMS) extensions. On the one hand, standalone systems do not interact well with DBMS and require time-consuming data imports and exports. On the other hand, DBMS extensions are often limited by the underlying system and do not cover the full set of data cleansing and integration tasks.

We overcome both limitations by implementing a concise set of five data cleansing and integration operators on the parallel data analytics platform Stratosphere. We define the semantics of the operators, present their parallel implementation, and devise optimization techniques for individual operators and combinations thereof. Users specify declarative queries in our query language METEOR with our new operators to improve the data quality of individual datasets or integrate them to larger datasets. By integrating the data cleansing operators into the higher level language layer of Stratosphere, users can easily combine cleansing operators with operators from other domains, such as information extraction, to complex data flows. Through a generic description of the operators, the Stratosphere optimizer reorders operators even from different domains to find better query plans.

As a case study, we reimplemented a part of the large Open Government Data integration project GovWILD with our new operators and show that our queries run significantly faster than the original GovWILD queries, which rely on relational operators. Evaluation reveals that our operators exhibit good scalability on up to 100 cores, so that even larger inputs can be efficiently processed by scaling out to more machines. Finally, our scripts are considerably shorter than the original GovWILD scripts, which results in better maintainability of the scripts.
N2  - Die Datenqualität von Realweltdaten muss ständig überwacht und gewartet werden, damit Organisationen und Individuen ihre Daten verlässlich nutzen können. Besonders Datenintegrationsprojekte leiden unter schlechter Datenqualität in den Quelldaten und benötigen somit mehr Zeit und Geld. Kommerzielle Produkte und Forschungsprototypen helfen Nutzern die Qualität in einzelnen und kombinierten Datensätzen zu verbessern. Die Systeme können in selbständige Systeme und Erweiterungen von bestehenden Datenbankmanagementsystemen (DBMS) unterteilt werden. Auf der einen Seite interagieren selbständige Systeme nicht gut mit DBMS und brauchen zeitaufwändigen Datenimport und -export. Auf der anderen Seite sind die DBMS Erweiterungen häufig durch das unterliegende System limitiert und unterstützen nicht die gesamte Bandbreite an Datenreinigungs- und -integrationsaufgaben.

Wir überwinden beide Limitationen, indem wir eine Menge von häufig benötigten Datenreinigungs- und Datenintegrationsoperatoren direkt in der parallelen Datenanalyseplattform Stratosphere implementieren. Wir definieren die Semantik der Operatoren, präsentieren deren parallele Implementierung und entwickeln Optimierungstechniken für die einzelnen und mehrere Operatoren. Nutzer können deklarative Anfragen in unserer Anfragesprache METEOR mit unseren neuen Operatoren formulieren, um die Datenqualität von einzelnen Datensätzen zu erhöhen, oder um sie zu größeren Datensätzen zu integrieren. Durch die Integration der Operatoren in die Hochsprachenschicht von Stratosphere können Nutzer Datenreinigungsoperatoren einfach mit Operatoren aus anderen Domänen wie Informationsextraktion zu komplexen Datenflüssen kombinieren. Da Stratosphere Operatoren durch generische Beschreibungen in den Optimierer integriert werden, ist es für den Optimierer sogar möglich Operatoren unterschiedlicher Domänen zu vertauschen, um besseren Anfrageplänen zu ermitteln.

Für eine Fallstudie haben wir Teile des großen Datenintegrationsprojektes GovWILD auf Stratosphere mit den neuen Operatoren nachimplementiert und zeigen, dass unsere Anfragen signifikant schneller laufen als die originalen GovWILD Anfragen, die sich auf relationale Operatoren verlassen. Die Evaluation zeigt, dass unsere Operatoren gut auf bis zu 100 Kernen skalieren, sodass sogar größere Datensätze effizient verarbeitet werden können, indem die Anfragen auf mehr Maschinen ausgeführt werden. Schließlich sind unsere Skripte erheblich kürzer als die originalen GovWILD Skripte, was in besserer Wartbarkeit unserer Skripte resultiert.
KW  - data
KW  - cleansing
KW  - holistic
KW  - parallel
KW  - map reduce
KW  - Datenreinigung
KW  - Datenintegration
KW  - ganzheitlich
KW  - parallel
KW  - map reduce
Y1  - 2014
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-77100
ER  - 
TY  - THES
A1  - Voland, Patrick
T1  - Webbasierte Visualisierung von Extended Floating Car Data (XFCD)
T1  - Web-based visualisation of Extended Floating Car Data (XFCD)
BT  - Ein Ansatz zur raumzeitlichen Visualisierung und technischen Implementierung mit Open Source Software unter spezieller Betrachtung des Umwelt- und Verkehrsmonitoring
BT  - An approach for spatio-temporal visualisation and implementation with open-source software under special emphasis of environment and traffic monitoring
N2  - Moderne Kraftfahrzeuge verfügen über eine Vielzahl an Sensoren, welche für einen reibungslosen technischen Betrieb benötigt werden. Hierzu zählen neben fahrzeugspezifischen Sensoren (wie z.B. Motordrehzahl und Fahrzeuggeschwindigkeit) auch umweltspezifische Sensoren (wie z.B. Luftdruck und Umgebungstemperatur). Durch die zunehmende technische Vernetzung wird es möglich, diese Daten der Kraftfahrzeugelektronik aus dem Fahrzeug heraus für die verschiedensten Zwecke zu verwenden. 
Die vorliegende Arbeit soll einen Beitrag dazu leisten, diese neue Art an massenhaften Daten im Sinne des Konzepts der „Extended Floating Car Data“ (XFCD) als Geoinformationen nutzbar zu machen und diese für raumzeitliche Visualisierungen (zur visuellen Analyse) anwenden zu können. In diesem Zusammenhang wird speziell die Perspektive des Umwelt- und Verkehrsmonitoring betrachtet, wobei die Anforderungen und Potentiale mit Hilfe von Experteninterviews untersucht werden. Es stellt sich die Frage, welche Daten durch die Kraftfahrzeugelektronik geliefert und wie diese möglichst automatisiert erfasst, verarbeitet, visualisiert und öffentlich bereitgestellt werden können. Neben theoretischen und technischen Grundlagen zur Datenerfassung und -nutzung liegt der Fokus auf den Methoden der kartographischen Visualisierung. Dabei soll der Frage nachgegangenen werden, ob eine technische Implementierung ausschließlich unter Verwendung von Open Source Software möglich ist. Das Ziel der Arbeit bildet ein zweigliedriger Ansatz, welcher zum einen die Visualisierung für ein exemplarisch gewähltes Anwendungsszenario und zum anderen die prototypische Implementierung von der Datenerfassung im Fahrzeug unter Verwendung der gesetzlich vorgeschriebenen „On Board Diagnose“-Schnittstelle und einem Smartphone-gestützten Ablauf bis zur webbasierten Visualisierung umfasst.
KW  - spatio-temporal sensor data
KW  - open source software
KW  - automotive electronics
KW  - geovisualization
Y1  - 2017
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-96751
ER  - 
TY  - THES
A1  - Lorey, Johannes
T1  - What's in a query : analyzing, predicting, and managing linked data access
T1  - Was ist in einer Anfrage : Analyse, Vorhersage und Verwaltung von Zugriffen auf Linked Data
N2  - The term Linked Data refers to connected information sources comprising structured data about a wide range of topics and for a multitude of applications. In recent years, the conceptional and technical foundations of Linked Data have been formalized and refined. To this end, well-known technologies have been established, such as the Resource Description Framework (RDF) as a Linked Data model or the SPARQL Protocol and RDF Query Language (SPARQL) for retrieving this information. Whereas most research has been conducted in the area of generating and publishing Linked Data, this thesis presents novel approaches for improved management. In particular, we illustrate new methods for analyzing and processing SPARQL queries. Here, we present two algorithms suitable for identifying structural relationships between these queries. Both algorithms are applied to a large number of real-world requests to evaluate the performance of the approaches and the quality of their results. Based on this, we introduce different strategies enabling optimized access of Linked Data sources. We demonstrate how the presented approach facilitates effective utilization of SPARQL endpoints by prefetching results relevant for multiple subsequent requests. Furthermore, we contribute a set of metrics for determining technical characteristics of such knowledge bases. To this end, we devise practical heuristics and validate them through thorough analysis of real-world data sources. We discuss the findings and evaluate their impact on utilizing the endpoints. Moreover, we detail the adoption of a scalable infrastructure for improving Linked Data discovery and consumption. As we outline in an exemplary use case, this platform is eligible both for processing and provisioning the corresponding information.
N2  - Unter dem Begriff Linked Data werden untereinander vernetzte Datenbestände verstanden, die große Mengen an strukturierten Informationen für verschiedene Anwendungsgebiete enthalten. In den letzten Jahren wurden die konzeptionellen und technischen Grundlagen für die Veröffentlichung von Linked Data gelegt und verfeinert. Zu diesem Zweck wurden eine Reihe von Technologien eingeführt, darunter das Resource Description Framework (RDF) als Datenmodell für Linked Data und das SPARQL Protocol and RDF Query Language (SPARQL) zum Abfragen dieser Informationen. Während bisher hauptsächlich die Erzeugung und Bereitstellung von Linked Data Forschungsgegenstand war, präsentiert die vorliegende Arbeit neuartige Verfahren zur besseren Nutzbarmachung. Insbesondere werden dafür Methoden zur Analyse und Verarbeitung von SPARQL-Anfragen entwickelt. Zunächst werden daher zwei Algorithmen vorgestellt, die die strukturelle Ähnlichkeit solcher Anfragen bestimmen. Beide Algorithmen werden auf eine große Anzahl von authentischen Anfragen angewandt, um sowohl die Güte der Ansätze als auch die ihrer Resultate zu untersuchen. Darauf aufbauend werden verschiedene Strategien erläutert, mittels derer optimiert auf Quellen von Linked Data zugegriffen werden kann. Es wird gezeigt, wie die dabei entwickelte Methode zur effektiven Verwendung von SPARQL-Endpunkten beiträgt, indem relevante Ergebnisse für mehrere nachfolgende Anfragen vorgeladen werden. Weiterhin werden in dieser Arbeit eine Reihe von Metriken eingeführt, die eine Einschätzung der technischen Eigenschaften solcher Endpunkte erlauben. Hierfür werden praxisrelevante Heuristiken entwickelt, die anschließend ausführlich mit Hilfe von konkreten Datenquellen analysiert werden. Die dabei gewonnenen Erkenntnisse werden erörtert und in Hinblick auf die Verwendung der Endpunkte interpretiert. Des Weiteren wird der Einsatz einer skalierbaren Plattform vorgestellt, die die Entdeckung und Nutzung von Beständen an Linked Data erleichtert. Diese Plattform dient dabei sowohl zur Verarbeitung als auch zur Verfügbarstellung der zugehörigen Information, wie in einem exemplarischen Anwendungsfall erläutert wird.
KW  - Vernetzte Daten
KW  - SPARQL
KW  - RDF
KW  - Anfragepaare
KW  - Informationsvorhaltung
KW  - linked data
KW  - SPARQL
KW  - RDF
KW  - query matching
KW  - prefetching
Y1  - 2014
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-72312
ER  - 
TY  - THES
A1  - Steinert, Bastian
T1  - Built-in recovery support for explorative programming
T1  - Eingebaute Unterstützung für Wiederherstellungsbedürfnisse für unstrukturierte ergebnisoffene Programmieraufgaben
BT  - preserving immediate access to static and dynamic information of intermediate development states
BT  - Erhaltung des unmittelbaren Zugriffs auf statische und dynamische Informationen von Entwicklungszwischenständen
N2  - This work introduces concepts and corresponding tool support to enable a complementary approach in dealing with recovery. Programmers need to recover a development state, or a part thereof, when previously made changes reveal undesired implications. However, when the need arises suddenly and unexpectedly, recovery often involves expensive and tedious work. To avoid tedious work, literature recommends keeping away from unexpected recovery demands by following a structured and disciplined approach, which consists of the application of various best practices including working only on one thing at a time, performing small steps, as well as making proper use of versioning and testing tools. However, the attempt to avoid unexpected recovery is both time-consuming and error-prone. On the one hand, it requires disproportionate effort to minimize the risk of unexpected situations. On the other hand, applying recommended practices selectively, which saves time, can hardly avoid recovery. In addition, the constant need for foresight and self-control has unfavorable implications. It is exhaustive and impedes creative problem solving. This work proposes to make recovery fast and easy and introduces corresponding support called CoExist. Such dedicated support turns situations of unanticipated recovery from tedious experiences into pleasant ones. It makes recovery fast and easy to accomplish, even if explicit commits are unavailable or tests have been ignored for some time. When mistakes and unexpected insights are no longer associated with tedious corrective actions, programmers are encouraged to change source code as a means to reason about it, as opposed to making changes only after structuring and evaluating them mentally. This work further reports on an implementation of the proposed tool support in the Squeak/Smalltalk development environment. The development of the tools has been accompanied by regular performance and usability tests. In addition, this work investigates whether the proposed tools affect programmers’ performance. In a controlled lab study, 22 participants improved the design of two different applications. Using a repeated measurement setup, the study examined the effect of providing CoExist on programming performance. The result of analyzing 88 hours of programming suggests that built-in recovery support as provided with CoExist positively has a positive effect on programming performance in explorative programming tasks.
N2  - Diese Arbeit präsentiert Konzepte und die zugehörige Werkzeugunterstützung um einen komplementären Umgang mit Wiederherstellungsbedürfnissen zu ermöglichen. Programmierer haben Bedarf zur Wiederherstellung eines früheren Entwicklungszustandes oder Teils davon, wenn ihre Änderungen ungewünschte Implikationen aufzeigen. Wenn dieser Bedarf plötzlich und unerwartet auftritt, dann ist die notwendige Wiederherstellungsarbeit häufig mühsam und aufwendig. Zur Vermeidung mühsamer Arbeit empfiehlt die Literatur die Vermeidung von unerwarteten Wiederherstellungsbedürfnissen durch einen strukturierten und disziplinierten Programmieransatz, welcher die Verwendung verschiedener bewährter Praktiken vorsieht. Diese Praktiken sind zum Beispiel: nur an einer Sache gleichzeitig zu arbeiten, immer nur kleine Schritte auszuführen, aber auch der sachgemäße Einsatz von Versionskontroll- und Testwerkzeugen. Jedoch ist der Versuch des Abwendens unerwarteter Wiederherstellungsbedürfnisse sowohl zeitintensiv als auch fehleranfällig. Einerseits erfordert es unverhältnismäßig hohen Aufwand, das Risiko des Eintretens unerwarteter Situationen auf ein Minimum zu reduzieren. Andererseits ist eine zeitsparende selektive Ausführung der empfohlenen Praktiken kaum hinreichend, um Wiederherstellungssituationen zu vermeiden. Zudem bringt die ständige Notwendigkeit an Voraussicht und Selbstkontrolle Nachteile mit sich. Dies ist ermüdend und erschwert das kreative Problemlösen. Diese Arbeit schlägt vor, Wiederherstellungsaufgaben zu vereinfachen und beschleunigen, und stellt entsprechende Werkzeugunterstützung namens CoExist vor. Solche zielgerichtete Werkzeugunterstützung macht aus unvorhergesehenen mühsamen Wiederherstellungssituationen eine konstruktive Erfahrung. Damit ist Wiederherstellung auch dann leicht und schnell durchzuführen, wenn explizit gespeicherte Zwischenstände fehlen oder die Tests für einige Zeit ignoriert wurden. Wenn Fehler und unerwartete Ein- sichten nicht länger mit mühsamen Schadensersatz verbunden sind, fühlen sich Programmierer eher dazu ermutig, Quelltext zu ändern, um dabei darüber zu reflektieren, und nehmen nicht erst dann Änderungen vor, wenn sie diese gedanklich strukturiert und evaluiert haben. Diese Arbeit berichtet weiterhin von einer Implementierung der vorgeschlagenen Werkzeugunterstützung in der Squeak/Smalltalk Entwicklungsumgebung. Regelmäßige Tests von Laufzeitverhalten und Benutzbarkeit begleiteten die Entwicklung. Zudem prüft die Arbeit, ob sich die Verwendung der vorgeschlagenen Werkzeuge auf die Leistung der Programmierer auswirkt. In einem kontrollierten Experiment, verbesserten 22 Teilnehmer den Aufbau von zwei verschiedenen Anwendungen. Unter der Verwendung einer Versuchsanordnung mit wiederholter Messung, ermittelte die Studie die Auswirkung von CoExist auf die Programmierleistung. Das Ergebnis der Analyse von 88 Programmierstunden deutet darauf hin, dass sich eingebaute Werkzeugunterstützung für Wiederherstellung, wie sie mit CoExist bereitgestellt wird, positiv bei der Bearbeitung von unstrukturierten ergebnisoffenen Programmieraufgaben auswirkt.
KW  - Softwaretechnik
KW  - Entwicklungswerkzeuge
KW  - Versionierung
KW  - Testen
KW  - software engineering
KW  - development tools
KW  - versioning
KW  - testing
Y1  - 2014
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-71305
ER  - 
TY  - THES
A1  - Tinnefeld, Christian
T1  - Building a columnar database on shared main memory-based storage
BT  - database operator placement in a shared main memory-based storage system that supports data access and code execution
N2  - In the field of disk-based parallel database management systems exists a great variety of solutions based on a shared-storage or a shared-nothing architecture. In contrast, main memory-based parallel database management systems are dominated solely by the shared-nothing approach as it preserves the in-memory performance advantage by processing data locally on each server. We argue that this unilateral development is going to cease due to the combination of the following three trends: a) Nowadays network technology features remote direct memory access (RDMA) and narrows the performance gap between accessing main memory inside a server and of a remote server to and even below a single order of magnitude. b) Modern storage systems scale gracefully, are elastic, and provide high-availability. c) A modern storage system such as Stanford's RAMCloud even keeps all data resident in main memory. Exploiting these characteristics in the context of a main-memory parallel database management system is desirable. The advent of RDMA-enabled network technology makes the creation of a parallel main memory DBMS based on a shared-storage approach feasible.

This thesis describes building a columnar database on shared main memory-based storage. The thesis discusses the resulting architecture (Part I), the implications on query processing (Part II), and presents an evaluation of the resulting solution in terms of performance, high-availability, and elasticity (Part III).

In our architecture, we use Stanford's RAMCloud as shared-storage, and the self-designed and developed in-memory AnalyticsDB as relational query processor on top. AnalyticsDB encapsulates data access and operator execution via an interface which allows seamless switching between local and remote main memory, while RAMCloud provides not only storage capacity, but also processing power. Combining both aspects allows pushing-down the execution of database operators into the storage system. We describe how the columnar data processed by AnalyticsDB is mapped to RAMCloud's key-value data model and how the performance advantages of columnar data storage can be preserved.

The combination of fast network technology and the possibility to execute database operators in the storage system opens the discussion for site selection. We construct a system model that allows the estimation of operator execution costs in terms of network transfer, data processed in memory, and wall time. This can be used for database operators that work on one relation at a time - such as a scan or materialize operation - to discuss the site selection problem (data pull vs. operator push). Since a database query translates to the execution of several database operators, it is possible that the optimal site selection varies per operator. For the execution of a database operator that works on two (or more) relations at a time, such as a join, the system model is enriched by additional factors such as the chosen algorithm (e.g. Grace- vs. Distributed Block Nested Loop Join vs. Cyclo-Join), the data partitioning of the respective relations, and their overlapping as well as the allowed resource allocation.

We present an evaluation on a cluster with 60 nodes where all nodes are connected via RDMA-enabled network equipment. We show that query processing performance is about 2.4x slower if everything is done via the data pull operator execution strategy (i.e. RAMCloud is being used only for data access) and about 27% slower if operator execution is also supported inside RAMCloud (in comparison to operating only on main memory inside a server without any network communication at all). The fast-crash recovery feature of RAMCloud can be leveraged to provide high-availability, e.g. a server crash during query execution only delays the query response for about one second. Our solution is elastic in a way that it can adapt to changing workloads a) within seconds, b) without interruption of the ongoing query processing, and c) without manual intervention.
N2  - Diese Arbeit beschreibt die Erstellung einer spalten-orientierten Datenbank auf einem geteilten, Hauptspeicher-basierenden Speichersystem. Motiviert wird diese Arbeit durch drei Faktoren. Erstens ist moderne Netzwerktechnologie mit “Remote Direct Memory Access” (RDMA) ausgestattet. Dies reduziert den Unterschied hinsichtlich Latenz und Durchsatz zwischen dem Speicherzugriff innerhalb eines Rechners und auf einen entfernten Rechner auf eine Größenordnung. Zweitens skalieren moderne Speichersysteme, sind elastisch und hochverfügbar. Drittens hält ein modernes Speichersystem wie Stanford's RAMCloud alle Daten im Hauptspeicher vor. Diese Eigenschaften im Kontext einer spalten-orientierten Datenbank zu nutzen ist erstrebenswert. Die Arbeit ist in drei Teile untergliedert. Der erste Teile beschreibt die Architektur einer spalten-orientierten Datenbank auf einem geteilten, Hauptspeicher-basierenden Speichersystem. Hierbei werden die im Rahmen dieser Arbeit entworfene und entwickelte Datenbank AnalyticsDB sowie Stanford's RAMCloud verwendet. Die Architektur beschreibt wie Datenzugriff und Operatorausführung gekapselt werden um nahtlos zwischen lokalem und entfernten Hauptspeicher wechseln zu können. Weiterhin wird die Ablage der nach einem relationalen Schema formatierten Daten von AnalyticsDB in RAMCloud behandelt, welches mit einem Schlüssel-Wertpaar Datenmodell operiert. Der zweite Teil fokussiert auf die Implikationen bei der Abarbeitung von Datenbankanfragen. Hier steht die Diskussion im Vordergrund wo (entweder in AnalyticsDB oder in RAMCloud) und mit welcher Parametrisierung einzelne Datenbankoperationen ausgeführt werden. Dafür werden passende Kostenmodelle vorgestellt, welche die Abbildung von Datenbankoperationen ermöglichen, die auf einer oder mehreren Relationen arbeiten. Der dritte Teil der Arbeit präsentiert eine Evaluierung auf einem Verbund von 60 Rechnern hinsichtlich der Leistungsfähigkeit, der Hochverfügbarkeit und der Elastizität vom System.
T2  - Die Erstellung einer spaltenorientierten Datenbank auf einem verteilten, Hauptspeicher-basierenden Speichersystem
KW  - computer science
KW  - database technology
KW  - main memory computing
KW  - cloud computing
KW  - verteilte Datenbanken
KW  - Hauptspeicher Technologie
KW  - virtualisierte IT-Infrastruktur
Y1  - 2014
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-72063
ER  -