Institut für Informatik und Computational Science
Refine
Year of publication
Document Type
- Article (534)
- Doctoral Thesis (126)
- Monograph/Edited Volume (82)
- Other (26)
- Postprint (10)
- Conference Proceeding (5)
- Part of a Book (4)
- Preprint (4)
- Bachelor Thesis (1)
- Habilitation Thesis (1)
Language
- English (794) (remove)
Is part of the Bibliography
- yes (794) (remove)
Keywords
- answer set programming (12)
- Answer Set Programming (10)
- Answer set programming (10)
- Machine Learning (7)
- Maschinelles Lernen (7)
- Antwortmengenprogrammierung (5)
- Internet of Things (4)
- MQTT (4)
- machine learning (4)
- security (4)
Institute
- Institut für Informatik und Computational Science (794)
- Hasso-Plattner-Institut für Digital Engineering gGmbH (16)
- Extern (4)
- Institut für Physik und Astronomie (2)
- Department Erziehungswissenschaft (1)
- Department Linguistik (1)
- Institut für Biochemie und Biologie (1)
- Potsdam Institute for Climate Impact Research (PIK) e. V. (1)
- Wirtschaftswissenschaften (1)
Business process management
(2006)
Synchronization
(2006)
With the next generation Internet protocol IPv6 at the horizon, it is time to think about how applications can migrate to IPv6. Web traffic is currently one of the most important applications in the Internet. The increasing popularity of dynamically generated content on the World Wide Web, has created the need for fast web servers. Server clustering together with server load balancing has emerged as a promising technique to build scalable web servers. The paper gives a short overview over the new features of IPv6 and different server load balancing technologies. Further, we present and evaluate Loaded, an user-space server load balancer for IPv4 and IPv6 based on Linux.
An Extended Query language for action languages (and its application to aggregates and preferences)
(2006)
In this article, we consider high-dimensional data which contains a low-dimensional non-Gaussian structure contaminated with Gaussian noise and propose a new linear method to identify the non-Gaussian subspace. Our method NGCA (Non-Gaussian Component Analysis) is based on a very general semi-parametric framework and has a theoretical guarantee that the estimation error of finding the non-Gaussian components tends to zero at a parametric rate. NGCA can be used not only as preprocessing for ICA, but also for extracting and visualizing more general structures like clusters. A numerical study demonstrates the usefulness of our method
We give a new view on building content clusters from page pair models. We measure the heuristic importance within every two pages by computing the distance of their accessed positions in usage sessions. We also compare our page pair models with the classical pair models used in information theories and natural language processing, and give different evaluation methods to build the reasonable content communities. And we finally interpret the advantages and disadvantages of our models from detailed experiment results
Two common data representations are mostly used in intelligent data analysis, namely the vectorial and the pairwise representation. Pairwise data which satisfy the restrictive conditions of Euclidean spaces can be faithfully translated into a Euclidean vectorial representation by embedding. Non-metric pairwise data with violations of symmetry, reflexivity or triangle inequality pose a substantial conceptual problem for pattern recognition since the amount of predictive structural information beyond what can be measured by embeddings is unclear. We show by systematic modeling of non-Euclidean pairwise data that there exists metric violations which can carry valuable problem specific information. Furthermore, Euclidean and non-metric data can be unified on the level of structural information contained in the data. Stable component analysis selects linear subspaces which are particularly insensitive to data fluctuations. Experimental results from different domains support our pattern recognition strategy.
Most information systems log events (e.g., transaction logs, audit traits) to audit and monitor the processes they support. At the same time, many of these processes have been explicitly modeled. For example, SAP R/3 logs events in transaction logs and there are EPCs (Event-driven Process Chains) describing the so-called reference models. These reference models describe how the system should be used. The coexistence of event logs and process models raises an interesting question: "Does the event log conform to the process model and vice versa?". This paper demonstrates that there is not a simple answer to this question. To tackle the problem, we distinguish two dimensions of conformance: fitness (the event log may be the result of the process modeled) and appropriateness (the model is a likely candidate from a structural and behavioral point of view). Different metrics have been defined and a Conformance Checker has been implemented within the ProM Framework
The emergence of drug resistance remains one of the most challenging issues in the treatment of HIV-1 infection. The extreme replication dynamics of HIV facilitates its escape from the selective pressure exerted by the human immune system and by the applied combination drug therapy. This article reviews computational methods whose combined use can support the design of optimal antiretroviral therapies based on viral genotypic and phenotypic data. Genotypic assays are based on the analysis of mutations associated with reduced drug susceptibility, but are difficult to interpret due to the numerous mutations and mutational patterns that confer drug resistance. Phenotypic resistance or susceptibility can be experimentally evaluated by measuring the inhibition of the viral replication in cell culture assays. However, this procedure is expensive and time consuming
When decomposing single trial electroencephalography it is a challenge to incorporate prior physiological knowledge. Here, we develop a method that uses prior information about the phase-locking property of event-related potentials in a regularization framework to bias a blind source separation algorithm toward an improved separation of single-trial phase-locked responses in terms of an increased signal-to-noise ratio. In particular, we suggest a transformation of the data, using weighted average of the single trial and trial-averaged response, that redirects the focus of source separation methods onto the subspace of event-related potentials. The practical benefit with respect to an improved separation of such components from ongoing background activity and extraneous noise is first illustrated on artificial data and finally verified in a real-world application of extracting single-trial somatosensory evoked potentials from multichannel EEG-recordings
The Berlin Brain-Computer Interface (BBCI) project develops a noninvasive BCI system whose key features are 1) the use of well-established motor competences as control paradigms, 2) high-dimensional features from 128-channel electroencephalogram (EEG), and 3) advanced machine learning techniques. As reported earlier, our experiments demonstrate that very high information transfer rates can be achieved using the readiness potential (RP) when predicting the laterality of upcoming left-versus right-hand movements in healthy subjects. A more recent study showed that the RP similarily accompanies phantom movements in arm amputees, but the signal strength decreases with longer loss of the limb. In a complementary approach, oscillatory features are used to discriminate imagined movements (left hand versus right hand versus foot). In a recent feedback study with six healthy subjects with no or very little experience with BCI control, three subjects achieved an information transfer rate above 35 bits per minute (bpm), and further two subjects above 24 and 15 bpm, while one subject could not achieve any BCI control. These results are encouraging for an EEG-based BCI system in untrained subjects that is independent of peripheral nervous system activity and does not rely on evoked potentials even when compared to results with very well-trained subjects operating other BCI systems
Non-stationarities are ubiquitous in EEG signals. They are especially apparent in the use of EEG-based brain- computer interfaces (BCIs): (a) in the differences between the initial calibration measurement and the online operation of a BCI, or (b) caused by changes in the subject's brain processes during an experiment (e.g. due to fatigue, change of task involvement, etc). In this paper, we quantify for the first time such systematic evidence of statistical differences in data recorded during offline and online sessions. Furthermore, we propose novel techniques of investigating and visualizing data distributions, which are particularly useful for the analysis of (non-) stationarities. Our study shows that the brain signals used for control can change substantially from the offline calibration sessions to online control, and also within a single session. In addition to this general characterization of the signals, we propose several adaptive classification schemes and study their performance on data recorded during online experiments. An encouraging result of our study is that surprisingly simple adaptive methods in combination with an offline feature selection scheme can significantly increase BCI performance
GeneFisher-P
(2007)
Design Issues in the Implementation of MPI2 One Sided Communication in Ethernet based Networks
(2007)
In current research, one sided communication of the MPI2 standard is pushed as a promising technique [6, 7, 10, 18]. But measurements of applications and MPI2 primitives show a different picture [17]. In this paper we analyze de sign issues of MPI2 one sided communication and its im plementations. We focus on asynchronous communication for parallel applications in Ethernet cluster environments. Further, one sided communication is compared to two sided communication. This paper will prove that the key problem to performance is not only the implementation of MPI2 one sided communication - it is the design.
Proceedings of the 2nd International Workshop on e-learning and Virtual and Remote Laboratories
(2008)
The introduction of columnar in-memory databases, along with hardware evolution, has made the execution of transactional and analytical enterprise application workloads on a single system both feasible and viable. Yet, we argue that executing analytical aggregate queries directly on the transactional data can decrease the overall system performance. Despite the aggregation capabilities of columnar in-memory databases, the direct access to records of a materialized aggregate is always more efficient than aggregating on the fly. The traditional approach to materialized aggregates, however, introduces significant overhead in terms of materialized view selection, maintenance, and exploitation. When this overhead is handled by the application, it increases the application complexity, and can slow down the transactional throughput of inserts, updates, and deletes.
In this thesis, we motivate, propose, and evaluate the aggregate cache, a materialized aggregate engine in the main-delta architecture of a columnar in-memory database that provides efficient means to handle costly aggregate queries of enterprise applications. For our design, we leverage the specifics of the main-delta architecture that separates a table into a main and delta partition. The central concept is to only cache the partial aggregate query result as defined on the main partition of a table, because the main partition is relatively stable as records are only inserted into the delta partition. We contribute by proposing incremental aggregate maintenance and query compensation techniques for mixed workloads of enterprise applications. In addition, we introduce aggregate profit metrics that increase the likelihood of persisting the most profitable aggregates in the aggregate cache.
Query compensation and maintenance of materialized aggregates based on joins of multiple tables is expensive due to the partitioned tables in the main-delta architecture. Our analysis of enterprise applications has revealed several data schema and workload patterns. This includes the observation that transactional data is persisted in header and item tables, whereas in many cases, the insertion of related header and item records is executed in a single database transaction. We contribute by proposing an approach to transport these application object semantics to the database system and optimize the query processing using the aggregate cache by applying partition pruning and predicate pushdown techniques.
For the experimental evaluation, we propose the FICO benchmark that is based on data from a productive ERP system with extracted mixed workloads. Our evaluation reveals that the aggregate cache can accelerate the execution of aggregate queries up to a factor of 60 whereas the speedup highly depends on the number of aggregated records in the main and delta partitions. In mixed workloads, the proposed aggregate maintenance and query compensation techniques perform up to an order of magnitude better than traditional materialized aggregate maintenance approaches. The introduced aggregate profit metrics outperform existing costbased metrics by up to 20%. Lastly, the join pruning and predicate pushdown techniques can accelerate query execution in the aggregate cache in the presence of multiple partitioned tables by up to an order of magnitude.
Services that operate over the Internet are under constant threat of being exposed to fraudulent use. Maintaining good user experience for legitimate users often requires the classification of entities as malicious or legitimate in order to initiate countermeasures. As an example, inbound email spam filters decide for spam or non-spam. They can base their decision on both the content of each email as well as on features that summarize prior emails received from the sending server. In general, discriminative classification methods learn to distinguish positive from negative entities. Each decision for a label may be based on features of the entity and related entities. When labels of related entities have strong interdependencies---as can be assumed e.g. for emails being delivered by the same user---classification decisions should not be made independently and dependencies should be modeled in the decision function. This thesis addresses the formulation of discriminative classification problems that are tailored for the specific demands of the following three Internet security applications. Theoretical and algorithmic solutions are devised to protect an email service against flooding of user inboxes, to mitigate abusive usage of outbound email servers, and to protect web servers against distributed denial of service attacks.
In the application of filtering an inbound email stream for unsolicited emails, utilizing features that go beyond each individual email's content can be valuable. Information about each sending mail server can be aggregated over time and may help in identifying unwanted emails. However, while this information will be available to the deployed email filter, some parts of the training data that are compiled by third party providers may not contain this information. The missing features have to be estimated at training time in order to learn a classification model. In this thesis an algorithm is derived that learns a decision function that integrates over a distribution of values for each missing entry. The distribution of missing values is a free parameter that is optimized to learn an optimal decision function.
The outbound stream of emails of an email service provider can be separated by the customer IDs that ask for delivery. All emails that are sent by the same ID in the same period of time are related, both in content and in label. Hijacked customer accounts may send batches of unsolicited emails to other email providers, which in turn might blacklist the sender's email servers after detection of incoming spam emails. The risk of being blocked from further delivery depends on the rate of outgoing unwanted emails and the duration of high spam sending rates. An optimization problem is developed that minimizes the expected cost for the email provider by learning a decision function that assigns a limit on the sending rate to customers based on the each customer's email stream.
Identifying attacking IPs during HTTP-level DDoS attacks allows to block those IPs from further accessing the web servers. DDoS attacks are usually carried out by infected clients that are members of the same botnet and show similar traffic patterns. HTTP-level attacks aim at exhausting one or more resources of the web server infrastructure, such as CPU time. If the joint set of attackers cannot increase resource usage close to the maximum capacity, no effect will be experienced by legitimate users of hosted web sites. However, if the additional load raises the computational burden towards the critical range, user experience will degrade until service may be unavailable altogether. As the loss of missing one attacker depends on block decisions for other attackers---if most other attackers are detected, not blocking one client will likely not be harmful---a structured output model has to be learned. In this thesis an algorithm is developed that learns a structured prediction decoder that searches the space of label assignments, guided by a policy.
Each model is evaluated on real-world data and is compared to reference methods. The results show that modeling each classification problem according to the specific demands of the task improves performance over solutions that do not consider the constraints inherent to an application.
Personal fabrication tools, such as 3D printers, are on the way of enabling a future in which non-technical users will be able to create custom objects. However, while the hardware is there, the current interaction model behind existing design tools is not suitable for non-technical users. Today, 3D printers are operated by fabricating the object in one go, which tends to take overnight due to the slow 3D printing technology. Consequently, the current interaction model requires users to think carefully before printing as every mistake may imply another overnight print. Planning every step ahead, however, is not feasible for non-technical users as they lack the experience to reason about the consequences of their design decisions.
In this dissertation, we propose changing the interaction model around personal fabrication tools to better serve this user group. We draw inspiration from personal computing and argue that the evolution of personal fabrication may resemble the evolution of personal computing: Computing started with machines that executed a program in one go before returning the result to the user. By decreasing the interaction unit to single requests, turn-taking systems such as the command line evolved, which provided users with feedback after every input. Finally, with the introduction of direct-manipulation interfaces, users continuously interacted with a program receiving feedback about every action in real-time. In this dissertation, we explore whether these interaction concepts can be applied to personal fabrication as well.
We start with fabricating an object in one go and investigate how to tighten the feedback-cycle on an object-level: We contribute a method called low-fidelity fabrication, which saves up to 90% fabrication time by creating objects as fast low-fidelity previews, which are sufficient to evaluate key design aspects. Depending on what is currently being tested, we propose different conversions that enable users to focus on different parts: faBrickator allows for a modular design in the early stages of prototyping; when users move on WirePrint allows quickly testing an object's shape, while Platener allows testing an object's technical function. We present an interactive editor for each technique and explain the underlying conversion algorithms.
By interacting on smaller units, such as a single element of an object, we explore what it means to transition from systems that fabricate objects in one go to turn-taking systems. We start with a 2D system called constructable: Users draw with a laser pointer onto the workpiece inside a laser cutter. The drawing is captured with an overhead camera. As soon as the the user finishes drawing an element, such as a line, the constructable system beautifies the path and cuts it--resulting in physical output after every editing step. We extend constructable towards 3D editing by developing a novel laser-cutting technique for 3D objects called LaserOrigami that works by heating up the workpiece with the defocused laser until the material becomes compliant and bends down under gravity. While constructable and LaserOrigami allow for fast physical feedback, the interaction is still best described as turn-taking since it consists of two discrete steps: users first create an input and afterwards the system provides physical output.
By decreasing the interaction unit even further to a single feature, we can achieve real-time physical feedback: Input by the user and output by the fabrication device are so tightly coupled that no visible lag exists. This allows us to explore what it means to transition from turn-taking interfaces, which only allow exploring one option at a time, to direct manipulation interfaces with real-time physical feedback, which allow users to explore the entire space of options continuously with a single interaction. We present a system called FormFab, which allows for such direct control. FormFab is based on the same principle as LaserOrigami: It uses a workpiece that when warmed up becomes compliant and can be reshaped. However, FormFab achieves the reshaping not based on gravity, but through a pneumatic system that users can control interactively. As users interact, they see the shape change in real-time.
We conclude this dissertation by extrapolating the current evolution into a future in which large numbers of people use the new technology to create objects. We see two additional challenges on the horizon: sustainability and intellectual property. We investigate sustainability by demonstrating how to print less and instead patch physical objects. We explore questions around intellectual property with a system called Scotty that transfers objects without creating duplicates, thereby preserving the designer's copyright.
Computer Security deals with the detection and mitigation of threats to computer networks, data, and computing hardware. This
thesis addresses the following two computer security problems: email spam campaign and malware detection.
Email spam campaigns can easily be generated using popular dissemination tools by specifying simple grammars that serve as message templates. A grammar is disseminated to nodes of a bot net, the nodes create messages by instantiating the grammar at random. Email spam campaigns can encompass huge data volumes and therefore pose a threat to the stability of the infrastructure of email service providers that have to store them. Malware -software that serves a malicious purpose- is affecting web servers, client computers via active content, and client computers through executable files. Without the help of malware detection systems it would be easy for malware creators to collect sensitive information or to infiltrate computers.
The detection of threats -such as email-spam messages, phishing messages, or malware- is an adversarial and therefore intrinsically
difficult problem. Threats vary greatly and evolve over time. The detection of threats based on manually-designed rules is therefore
difficult and requires a constant engineering effort. Machine-learning is a research area that revolves around the analysis of data and the discovery of patterns that describe aspects of the data. Discriminative learning methods extract prediction models from data that are optimized to predict a target attribute as accurately as possible. Machine-learning methods hold the promise of automatically identifying patterns that robustly and accurately detect threats. This thesis focuses on the design and analysis of discriminative learning methods for the two computer-security problems under investigation: email-campaign and malware detection.
The first part of this thesis addresses email-campaign detection. We focus on regular expressions as a syntactic framework, because regular expressions are intuitively comprehensible by security engineers and administrators, and they can be applied as a detection mechanism in an extremely efficient manner. In this setting, a prediction model is provided with exemplary messages from an email-spam campaign. The prediction model has to generate a regular expression that reveals the syntactic pattern that underlies the entire campaign, and that a security engineers finds comprehensible and feels confident enough to use the expression to blacklist further messages at the email server. We model this problem as two-stage learning problem with structured input and output spaces which can be solved using standard cutting plane methods. Therefore we develop an appropriate loss function, and derive a decoder for the resulting optimization problem.
The second part of this thesis deals with the problem of predicting whether a given JavaScript or PHP file is malicious or benign. Recent malware analysis techniques use static or dynamic features, or both. In fully dynamic analysis, the software or script is executed and observed for malicious behavior in a sandbox environment. By contrast, static analysis is based on features that can be extracted directly from the program file. In order to bypass static detection mechanisms, code obfuscation techniques are used to spread a malicious program file in many different syntactic variants. Deobfuscating the code before applying a static classifier can be subjected to mostly static code analysis and can overcome the problem of obfuscated malicious code, but on the other hand increases the computational costs of malware detection by an order of magnitude. In this thesis we present a cascaded architecture in which a classifier first performs a static analysis of the original code and -based on the outcome of this first classification step- the code may be deobfuscated and classified again. We explore several types of features including token $n$-grams, orthogonal sparse bigrams, subroutine-hashings, and syntax-tree features and study the robustness of detection methods and feature types against the evolution of malware over time. The developed tool scans very large file collections quickly and accurately.
Each model is evaluated on real-world data and compared to reference methods. Our approach of inferring regular expressions to filter emails belonging to an email spam campaigns leads to models with a high true-positive rate at a very low false-positive rate that is an order of magnitude lower than that of a commercial content-based filter. Our presented system -REx-SVMshort- is being used by a commercial email service provider and complements content-based and IP-address based filtering.
Our cascaded malware detection system is evaluated on a high-quality data set of almost 400,000 conspicuous PHP files and a collection of more than 1,00,000 JavaScript files. From our case study we can conclude that our system can quickly and accurately process large data collections at a low false-positive rate.
Geospatial data has become a natural part of a growing number of information systems and services in the economy, society, and people's personal lives. In particular, virtual 3D city and landscape models constitute valuable information sources within a wide variety of applications such as urban planning, navigation, tourist information, and disaster management. Today, these models are often visualized in detail to provide realistic imagery. However, a photorealistic rendering does not automatically lead to high image quality, with respect to an effective information transfer, which requires important or prioritized information to be interactively highlighted in a context-dependent manner.
Approaches in non-photorealistic renderings particularly consider a user's task and camera perspective when attempting optimal expression, recognition, and communication of important or prioritized information. However, the design and implementation of non-photorealistic rendering techniques for 3D geospatial data pose a number of challenges, especially when inherently complex geometry, appearance, and thematic data must be processed interactively. Hence, a promising technical foundation is established by the programmable and parallel computing architecture of graphics processing units.
This thesis proposes non-photorealistic rendering techniques that enable both the computation and selection of the abstraction level of 3D geospatial model contents according to user interaction and dynamically changing thematic information. To achieve this goal, the techniques integrate with hardware-accelerated rendering pipelines using shader technologies of graphics processing units for real-time image synthesis. The techniques employ principles of artistic rendering, cartographic generalization, and 3D semiotics—unlike photorealistic rendering—to synthesize illustrative renditions of geospatial feature type entities such as water surfaces, buildings, and infrastructure networks. In addition, this thesis contributes a generic system that enables to integrate different graphic styles—photorealistic and non-photorealistic—and provide their seamless transition according to user tasks, camera view, and image resolution.
Evaluations of the proposed techniques have demonstrated their significance to the field of geospatial information visualization including topics such as spatial perception, cognition, and mapping. In addition, the applications in illustrative and focus+context visualization have reflected their potential impact on optimizing the information transfer regarding factors such as cognitive load, integration of non-realistic information, visualization of uncertainty, and visualization on small displays.
The main objective of this dissertation is to analyse prerequisites, expectations, apprehensions, and attitudes of students studying computer science, who are willing to gain a bachelor degree. The research will also investigate in the students’ learning style according to the Felder-Silverman model. These investigations fall in the attempt to make an impact on reducing the “dropout”/shrinkage rate among students, and to suggest a better learning environment.
The first investigation starts with a survey that has been made at the computer science department at the University of Baghdad to investigate the attitudes of computer science students in an environment dominated by women, showing the differences in attitudes between male and female students in different study years. Students are accepted to university studies via a centrally controlled admission procedure depending mainly on their final score at school. This leads to a high percentage of students studying subjects they do not want. Our analysis shows that 75% of the female students do not regret studying computer science although it was not their first choice. And according to statistics over previous years, women manage to succeed in their study and often graduate on top of their class. We finish with a comparison of attitudes between the freshman students of two different cultures and two different university enrolment procedures (University of Baghdad, in Iraq, and the University of Potsdam, in Germany) both with opposite gender majority.
The second step of investigation took place at the department of computer science at the University of Potsdam in Germany and analyzes the learning styles of students studying the three major fields of study offered by the department (computer science, business informatics, and computer science teaching). Investigating the differences in learning styles between the students of those study fields who usually take some joint courses is important to be aware of which changes are necessary to be adopted in the teaching methods to address those different students. It was a two stage study using two questionnaires; the main one is based on the Index of Learning Styles Questionnaire of B. A. Solomon and R. M. Felder, and the second questionnaire was an investigation on the students’ attitudes towards the findings of their personal first questionnaire. Our analysis shows differences in the preferences of learning style between male and female students of the different study fields, as well as differences between students with the different specialties (computer science, business informatics, and computer science teaching).
The third investigation looks closely into the difficulties, issues, apprehensions and expectations of freshman students studying computer science. The study took place at the computer science department at the University of Potsdam with a volunteer sample of students. The goal is to determine and discuss the difficulties and issues that they are facing in their study that may lead them to think in dropping-out, changing the study field, or changing the university. The research continued with the same sample of students (with business informatics students being the majority) through more than three semesters. Difficulties and issues during the study were documented, as well as students’ attitudes, apprehensions, and expectations. Some of the professors and lecturers opinions and solutions to some students’ problems were also documented. Many participants had apprehensions and difficulties, especially towards informatics subjects. Some business informatics participants began to think of changing the university, in particular when they reached their third semester, others thought about changing their field of study. Till the end of this research, most of the participants continued in their studies (the study they have started with or the new study they have changed to) without leaving the higher education system.
In this project I constructed a workflow that takes a DNA sequence as input and provides a phylogenetic tree, consisting of the input sequence and other sequences which were found during a database search. In this phylogenetic tree the sequences are arranged depending on similarities. In bioinformatics, constructing phylogenetic trees is often used to explore the evolutionary relationships of genes or organisms and to understand the mechanisms of evolution itself.
Spotlocator is a game wherein people have to guess the spots of where photos were taken. The photos of a defined area for each game are from panoramio.com. They are published at http://spotlocator. drupalgardens.com with an ID. Everyone can guess the photo spots by sending a special tweet via Twitter that contains the hashtag #spotlocator, the guessed coordinates and the ID of the photo. An evaluation is published for all tweets. The players are informed about the distance to the real photo spots and the positions are shown on a map.
Exploratory Data Analysis
(2014)
In bioinformatics the term exploratory data analysis refers to different methods to get an overview of large biological data sets. Hence, it helps to create a framework for further analysis and hypothesis testing. The workflow facilitates this first important step of the data analysis created by high-throughput technologies. The results are different plots showing the structure of the measurements. The goal of the workflow is the automatization of the exploratory data analysis, but also the flexibility should be guaranteed. The basic tool is the free software R.
The protein classification workflow described in this report enables users to get information about a novel protein sequence automatically. The information is derived by different bioinformatic analysis tools which calculate or predict features of a protein sequence. Also, databases are used to compare the novel sequence with known proteins.
Lessons Learned
(2014)
This chapter summarizes the experience and the lessons we learned concerning the application of the jABC as a framework for design and execution of scientific workflows. It reports experiences from the domain modeling (especially service integration) and workflow design phases and evaluates the resulting models statistically with respect to the SIB library and hierarchy levels.
The Course's SIB Libraries
(2014)
This chapter gives a detailed description of the service framework underlying all the example projects that form the foundation of this book. It describes the different SIB libraries that we made available for the course “Process modeling in the natural sciences” to provide the functionality that was required for the envisaged applications. The students used these SIB libraries to realize their projects.
A major part of the scientific experiments that are carried out today requires thorough computational support. While database and algorithm providers face the problem of bundling resources to create and sustain powerful computation nodes, the users have to deal with combining sets of (remote) services into specific data analysis and transformation processes. Today’s attention to “big data” amplifies the issues of size, heterogeneity, and process-level diversity/integration. In the last decade, especially workflow-based approaches to deal with these processes have enjoyed great popularity. This book concerns a particularly agile and model-driven approach to manage scientific workflows that is based on the XMDD paradigm. In this chapter we explain the scope and purpose of the book, briefly describe the concepts and technologies of the XMDD paradigm, explain the principal differences to related approaches, and outline the structure of the book.
We summarize here the main characteristics and features of the jABC framework, used in the case studies as a graphical tool for modeling scientific processes and workflows. As a comprehensive environment for service-oriented modeling and design according to the XMDD (eXtreme Model-Driven Design) paradigm, the jABC offers much more than the pure modeling capability. Associated technologies and plugins provide in fact means for a rich variety of supporting functionality, such as remote service integration, taxonomical service classification, model execution, model verification, model synthesis, and model compilation. We describe here in short both the essential jABC features and the service integration philosophy followed in the environment. In our work over the last years we have seen that this kind of service definition and provisioning platform has the potential to become a core technology in interdisciplinary service orchestration and technology transfer: Domain experts, like scientists not specially trained in computer science, directly define complex service orchestrations as process models and use efficient and complex domain-specific tools in a simple and intuitive way.
Software-as-a-Service (SaaS) offers several advantages to both service providers and users. Service providers can benefit from the reduction of Total Cost of Ownership (TCO), better scalability, and better resource utilization. On the other hand, users can use the service anywhere and anytime, and minimize upfront investment by following the pay-as-you-go model. Despite the benefits of SaaS, users still have concerns about the security and privacy of their data. Due to the nature of SaaS and the Cloud in general, the data and the computation are beyond the users' control, and hence data security becomes a vital factor in this new paradigm. Furthermore, in multi-tenant SaaS applications, the tenants become more concerned about the confidentiality of their data since several tenants are co-located onto a shared infrastructure.
To address those concerns, we start protecting the data from the provisioning process by controlling how tenants are being placed in the infrastructure. We present a resource allocation algorithm designed to minimize the risk of co-resident tenants called SecPlace. It enables the SaaS provider to control the resource (i.e., database instance) allocation process while taking into account the security of tenants as a requirement.
Due to the design principles of the multi-tenancy model, tenants follow some degree of sharing on both application and infrastructure levels. Thus, strong security-isolation should be present. Therefore, we develop SignedQuery, a technique that prevents one tenant from accessing others' data. We use the Signing Concept to create a signature that is used to sign the tenant's request, then the server can verifies the signature and recognizes the requesting tenant, and hence ensures that the data to be accessed is belonging to the legitimate tenant.
Finally, Data confidentiality remains a critical concern due to the fact that data in the Cloud is out of users' premises, and hence beyond their control. Cryptography is increasingly proposed as a potential approach to address such a challenge. Therefore, we present SecureDB, a system designed to run SQL-based applications over an encrypted database. SecureDB captures the schema design and analyzes it to understand the internal structure of the data (i.e., relationships between the tables and their attributes). Moreover, we determine the appropriate partialhomomorphic encryption scheme for each attribute where computation is possible even when the data is encrypted.
To evaluate our work, we conduct extensive experiments with di↵erent settings. The main use case in our work is a popular open source HRM application, called OrangeHRM. The results show that our multi-layered approach is practical, provides enhanced security and isolation among tenants, and have a moderate complexity in terms of processing encrypted data.
ProtoSense
(2015)
Physical computing covers the design and realization of interactive
objects and installations and allows students to develop concrete,
tangible products of the real world that arise from the learners’
imagination. This way, constructionist learning is raised to a level that
enables students to gain haptic experience and thereby concretizes the
virtual. In this paper the defining characteristics of physical computing
are described. Key competences to be gained with physical computing
will be identified.
KEYCIT 2014
(2015)
In our rapidly changing world it is increasingly important not only to be an expert in a chosen field of study but also to be able to respond to developments, master new approaches to solving problems, and fulfil changing requirements in the modern world and in the job market. In response to these needs key competencies in understanding, developing and using new digital technologies are being brought into focus in school and university programmes. The IFIP TC3 conference "KEYCIT – Key Competences in Informatics and ICT (KEYCIT 2014)" was held at the University of Potsdam in Germany from July 1st to 4th, 2014 and addressed the combination of key competencies, Informatics and ICT in detail. The conference was organized into strands focusing on secondary education, university education and teacher education (organized by IFIP WGs 3.1 and 3.3) and provided a forum to present and to discuss research, case studies, positions, and national perspectives in this field.
Through the use of next generation sequencing (NGS) technology, a lot of newly sequenced organisms are now available. Annotating those genes is one of the most challenging tasks in sequence biology. Here, we present an automated workflow to find homologue proteins, annotate sequences according to function and create a three-dimensional model.
With the jABC it is possible to realize workflows for numerous questions in different fields. The goal of this project was to create a workflow for the identification of differentially expressed genes. This is of special interest in biology, for it gives the opportunity to get a better insight in cellular changes due to exogenous stress, diseases and so on. With the knowledge that can be derived from the differentially expressed genes in diseased tissues, it becomes possible to find new targets for treatment.
A workflow for visualizing server connections using the Google Maps API was built in the jABC. It makes use of three basic services: An XML-based IP address geolocation web service, a command line tool and the Static Maps API. The result of the workflow is an URL leading to an image file of a map, showing server connections between a client and a target host.
Geocoder accuracy ranking
(2014)
Finding an address on a map is sometimes tricky: the chosen map application may be unfamiliar with the enclosed region. There are several geocoders on the market, they have different databases and algorithms to compute the query. Consequently, the geocoding results differ in their quality. Fortunately the geocoders provide a rich set of metadata. The workflow described in this paper compares this metadata with the aim to find out which geocoder is offering the best-fitting coordinate for a given address.
Analyses of metagenomes in life sciences present new opportunities as well as challenges to the scientific community and call for advanced computational methods and workflows. The large amount of data collected from samples via next-generation sequencing (NGS) technologies render manual approaches to sequence comparison and annotation unsuitable. Rather, fast and efficient computational pipelines are needed to provide comprehensive statistics and summaries and enable the researcher to choose appropriate tools for more specific analyses. The workflow presented here builds upon previous pipelines designed for automated clustering and annotation of raw sequence reads obtained from next-generation sequencing technologies such as 454 and Illumina. Employing specialized algorithms, the sequence reads are processed at three different levels. First, raw reads are clustered at high similarity cutoff to yield clusters which can be exported as multifasta files for further analyses. Independently, open reading frames (ORFs) are predicted from raw reads and clustered at two strictness levels to yield sets of non-redundant sequences and ORF families. Furthermore, single ORFs are annotated by performing searches against the Pfam database
This book presents an agile and model-driven approach to manage scientific workflows. The approach is based on the Extreme Model Driven Design (XMDD) paradigm and aims at simplifying and automating the complex data analysis processes carried out by scientists in their day-to-day work. Besides documenting the impact the workflow modeling might have on the work of natural scientists, this book serves three major purposes: 1. It acts as a primer for practitioners who are interested to learn how to think in terms of services and workflows when facing domain-specific scientific processes. 2. It provides interesting material for readers already familiar with this kind of tools, because it introduces systematically both the technologies used in each case study and the basic concepts behind them. 3. As the addressed thematic field becomes increasingly relevant for lectures in both computer science and experimental sciences, it also provides helpful material for teachers that plan similar courses.
Geometric generalization is a fundamental concept in the digital mapping process. An increasing amount of spatial data is provided on the web as well as a range of tools to process it. This jABC workflow is used for the automatic testing of web-based generalization services like mapshaper.org by executing its functionality, overlaying both datasets before and after the transformation and displaying them visually in a .tif file. Mostly Web Services and command line tools are used to build an environment where ESRI shapefiles can be uploaded, processed through a chosen generalization service and finally visualized in Irfanview.
In the geoinformatics field, remote sensing data is often used for analyzing the characteristics of the current investigation area. This includes DEMs, which are simple raster grids containing grey scales representing the respective elevation values. The project CREADED that is presented in this paper aims at making these monochrome raster images more significant and more intuitively interpretable. For this purpose, an executable interactive model for creating a colored and relief-shaded Digital Elevation Model (DEM) has been designed using the jABC framework. The process is based on standard jABC-SIBs and SIBs that provide specific GIS functions, which are available as Web services, command line tools and scripts.
This paper describes the implementation of a workflow model for service-oriented computing of potential areas for wind turbines in jABC. By implementing a re-executable model the manual effort of a multi-criteria site analysis can be reduced. The aim is to determine the shift of typical geoprocessing tools of geographic information systems (GIS) from the desktop to the web. The analysis is based on a vector data set and mainly uses web services of the “Center for Spatial Information Science and Systems” (CSISS). This paper discusses effort, benefits and problems associated with the use of the web services.
Location analyses are among the most common tasks while working with spatial data and geographic information systems. Automating the most frequently used procedures is therefore an important aspect of improving their usability. In this context, this project aims to design and implement a workflow, providing some basic tools for a location analysis. For the implementation with jABC, the workflow was applied to the problem of finding a suitable location for placing an artificial reef. For this analysis three parameters (bathymetry, slope and grain size of the ground material) were taken into account, processed, and visualized with the The Generic Mapping Tools (GMT), which were integrated into the workflow as jETI-SIBs. The implemented workflow thereby showed that the approach to combine jABC with GMT resulted in an user-centric yet user-friendly tool with high-quality cartographic outputs.
GraffDok is an application helping to maintain an overview over sprayed images somewhere in a city. At the time of writing it aims at vandalism rather than at beautiful photographic graffiti in an underpass. Looking at hundreds of tags and scribbles on monuments, house walls, etc. it would be interesting to not only record them in writing but even make them accessible electronically, including images.
GraffDok’s workflow is simple and only requires an EXIF-GPS-tagged photograph of a graffito. It automatically determines its location by using reverse geocoding with the given GPS-coordinates and the Gisgraphy WebService. While asking the user for some more meta data, GraffDok analyses the image in parallel with this and tries to detect fore- and background – before extracting the drawing lines and make them stand alone. The command line based tool ImageMagick is used here as well as for accessing EXIF data.
Any meta data is written to csv-files, which will stay easily accessible and can be integrated in TeX-files as well. The latter ones are converted to PDF at the end of the workflow, containing a table about all graffiti and a summary for each – including the generated characteristic graffiti pattern image.
The in-memory revolution
(2015)
This book describes the next generation of business applications enabled by SAP's in-memory database, SAP HANA. In particular, the authors show the substantial changes introduced in S4/HANA by switching to SAP HANA. Using numerous examples and use cases from the authors' wealth of real-world experience, it illustrates the quantum leap in performance made possible by the new technology. The book is written by two of the most prominent actors in the area of business application systems: Hasso Plattner, co-founder of SAP and inaugurator of the Hasso Plattner Institute at the University of Potsdam, and Bernd Leukert, member of the Executive Board and the Global Managing Board of SAP. This clearly structured, highly illustrated book takes an exciting new technology and presents the practicality and success of first mover applications.
Boolean constraint solving technology has made tremendous progress over the last decade, leading to industrial-strength solvers, for example, in the areas of answer set programming (ASP), the constraint satisfaction problem (CSP), propositional satisfiability (SAT) and satisfiability of quantified Boolean formulas (QBF). However, in all these areas, there exist multiple solving strategies that work well on different applications; no strategy dominates all other strategies. Therefore, no individual solver shows robust state-of-the-art performance in all kinds of applications. Additionally, the question arises how to choose a well-performing solving strategy for a given application; this is a challenging question even for solver and domain experts. One way to address this issue is the use of portfolio solvers, that is, a set of different solvers or solver configurations. We present three new automatic portfolio methods: (i) automatic construction of parallel portfolio solvers (ACPP) via algorithm configuration,(ii) solving the $NP$-hard problem of finding effective algorithm schedules with Answer Set Programming (aspeed), and (iii) a flexible algorithm selection framework (claspfolio2) allowing for fair comparison of different selection approaches. All three methods show improved performance and robustness in comparison to individual solvers on heterogeneous instance sets from many different applications. Since parallel solvers are important to effectively solve hard problems on parallel computation systems (e.g., multi-core processors), we extend all three approaches to be effectively applicable in parallel settings. We conducted extensive experimental studies different instance sets from ASP, CSP, MAXSAT, Operation Research (OR), SAT and QBF that indicate an improvement in the state-of-the-art solving heterogeneous instance sets. Last but not least, from our experimental studies, we deduce practical advice regarding the question when to apply which of our methods.
Deciphering the functioning of biological networks is one of the central tasks in systems biology. In particular, signal transduction networks are crucial for the understanding of the cellular response to external and internal perturbations. Importantly, in order to cope with the complexity of these networks, mathematical and computational modeling is required. We propose a computational modeling framework in order to achieve more robust discoveries in the context of logical signaling networks. More precisely, we focus on modeling the response of logical signaling networks by means of automated reasoning using Answer Set Programming (ASP). ASP provides a declarative language for modeling various knowledge representation and reasoning problems. Moreover, available ASP solvers provide several reasoning modes for assessing the multitude of answer sets. Therefore, leveraging its rich modeling language and its highly efficient solving capacities, we use ASP to address three challenging problems in the context of logical signaling networks: learning of (Boolean) logical networks, experimental design, and identification of intervention strategies. Overall, the contribution of this thesis is three-fold. Firstly, we introduce a mathematical framework for characterizing and reasoning on the response of logical signaling networks. Secondly, we contribute to a growing list of successful applications of ASP in systems biology. Thirdly, we present a software providing a complete pipeline for automated reasoning on the response of logical signaling networks.