Institut für Informatik und Computational Science
Refine
Year of publication
Document Type
- Doctoral Thesis (206) (remove)
Keywords
- Maschinelles Lernen (7)
- Antwortmengenprogrammierung (6)
- Machine Learning (6)
- Modellierung (5)
- answer set programming (4)
- Answer Set Programming (3)
- Ontologie (3)
- Semantic Web (3)
- machine learning (3)
- Algorithmen (2)
Boolean constraint solving technology has made tremendous progress over the last decade, leading to industrial-strength solvers, for example, in the areas of answer set programming (ASP), the constraint satisfaction problem (CSP), propositional satisfiability (SAT) and satisfiability of quantified Boolean formulas (QBF). However, in all these areas, there exist multiple solving strategies that work well on different applications; no strategy dominates all other strategies. Therefore, no individual solver shows robust state-of-the-art performance in all kinds of applications. Additionally, the question arises how to choose a well-performing solving strategy for a given application; this is a challenging question even for solver and domain experts. One way to address this issue is the use of portfolio solvers, that is, a set of different solvers or solver configurations. We present three new automatic portfolio methods: (i) automatic construction of parallel portfolio solvers (ACPP) via algorithm configuration,(ii) solving the $NP$-hard problem of finding effective algorithm schedules with Answer Set Programming (aspeed), and (iii) a flexible algorithm selection framework (claspfolio2) allowing for fair comparison of different selection approaches. All three methods show improved performance and robustness in comparison to individual solvers on heterogeneous instance sets from many different applications. Since parallel solvers are important to effectively solve hard problems on parallel computation systems (e.g., multi-core processors), we extend all three approaches to be effectively applicable in parallel settings. We conducted extensive experimental studies different instance sets from ASP, CSP, MAXSAT, Operation Research (OR), SAT and QBF that indicate an improvement in the state-of-the-art solving heterogeneous instance sets. Last but not least, from our experimental studies, we deduce practical advice regarding the question when to apply which of our methods.
The introduction of columnar in-memory databases, along with hardware evolution, has made the execution of transactional and analytical enterprise application workloads on a single system both feasible and viable. Yet, we argue that executing analytical aggregate queries directly on the transactional data can decrease the overall system performance. Despite the aggregation capabilities of columnar in-memory databases, the direct access to records of a materialized aggregate is always more efficient than aggregating on the fly. The traditional approach to materialized aggregates, however, introduces significant overhead in terms of materialized view selection, maintenance, and exploitation. When this overhead is handled by the application, it increases the application complexity, and can slow down the transactional throughput of inserts, updates, and deletes.
In this thesis, we motivate, propose, and evaluate the aggregate cache, a materialized aggregate engine in the main-delta architecture of a columnar in-memory database that provides efficient means to handle costly aggregate queries of enterprise applications. For our design, we leverage the specifics of the main-delta architecture that separates a table into a main and delta partition. The central concept is to only cache the partial aggregate query result as defined on the main partition of a table, because the main partition is relatively stable as records are only inserted into the delta partition. We contribute by proposing incremental aggregate maintenance and query compensation techniques for mixed workloads of enterprise applications. In addition, we introduce aggregate profit metrics that increase the likelihood of persisting the most profitable aggregates in the aggregate cache.
Query compensation and maintenance of materialized aggregates based on joins of multiple tables is expensive due to the partitioned tables in the main-delta architecture. Our analysis of enterprise applications has revealed several data schema and workload patterns. This includes the observation that transactional data is persisted in header and item tables, whereas in many cases, the insertion of related header and item records is executed in a single database transaction. We contribute by proposing an approach to transport these application object semantics to the database system and optimize the query processing using the aggregate cache by applying partition pruning and predicate pushdown techniques.
For the experimental evaluation, we propose the FICO benchmark that is based on data from a productive ERP system with extracted mixed workloads. Our evaluation reveals that the aggregate cache can accelerate the execution of aggregate queries up to a factor of 60 whereas the speedup highly depends on the number of aggregated records in the main and delta partitions. In mixed workloads, the proposed aggregate maintenance and query compensation techniques perform up to an order of magnitude better than traditional materialized aggregate maintenance approaches. The introduced aggregate profit metrics outperform existing costbased metrics by up to 20%. Lastly, the join pruning and predicate pushdown techniques can accelerate query execution in the aggregate cache in the presence of multiple partitioned tables by up to an order of magnitude.
In the last decades, there was a notable progress in solving the well-known Boolean satisfiability (Sat) problem, which can be witnessed by powerful Sat solvers. One of the reasons why these solvers are so fast are structural properties of instances that are utilized by the solver’s interna. This thesis deals with the well-studied structural property treewidth, which measures the closeness of an instance to being a tree. In fact, there are many problems parameterized by treewidth that are solvable in polynomial time in the instance size when parameterized by treewidth.
In this work, we study advanced treewidth-based methods and tools for problems in knowledge representation and reasoning (KR). Thereby, we provide means to establish precise runtime results (upper bounds) for canonical problems relevant to KR. Then, we present a new type of problem reduction, which we call decomposition-guided (DG) that
allows us to precisely monitor the treewidth when reducing from one problem to another problem. This new reduction type will be the basis for a long-open lower bound result for quantified Boolean formulas and allows us to design a new methodology for establishing runtime lower bounds for problems parameterized by treewidth.
Finally, despite these lower bounds, we provide an efficient implementation of algorithms that adhere to treewidth. Our approach finds suitable abstractions of instances, which are subsequently refined in a recursive fashion, and it uses Sat solvers for solving subproblems. It turns out that our resulting solver is quite competitive for two canonical counting problems related to Sat.
The field of machine learning studies algorithms that infer predictive models from data. Predictive models are applicable for many practical tasks such as spam filtering, face and handwritten digit recognition, and personalized product recommendation. In general, they are used to predict a target label for a given data instance. In order to make an informed decision about the deployment of a predictive model, it is crucial to know the model’s approximate performance. To evaluate performance, a set of labeled test instances is required that is drawn from the distribution the model will be exposed to at application time. In many practical scenarios, unlabeled test instances are readily available, but the process of labeling them can be a time- and cost-intensive task and may involve a human expert. This thesis addresses the problem of evaluating a given predictive model accurately with minimal labeling effort. We study an active model evaluation process that selects certain instances of the data according to an instrumental sampling distribution and queries their labels. We derive sampling distributions that minimize estimation error with respect to different performance measures such as error rate, mean squared error, and F-measures. An analysis of the distribution that governs the estimator leads to confidence intervals, which indicate how precise the error estimation is. Labeling costs may vary across different instances depending on certain characteristics of the data. For instance, documents differ in their length, comprehensibility, and technical requirements; these attributes affect the time a human labeler needs to judge relevance or to assign topics. To address this, the sampling distribution is extended to incorporate instance-specific costs. We empirically study conditions under which the active evaluation processes are more accurate than a standard estimate that draws equally many instances from the test distribution. We also address the problem of comparing the risks of two predictive models. The standard approach would be to draw instances according to the test distribution, label the selected instances, and apply statistical tests to identify significant differences. Drawing instances according to an instrumental distribution affects the power of a statistical test. We derive a sampling procedure that maximizes test power when used to select instances, and thereby minimizes the likelihood of choosing the inferior model. Furthermore, we investigate the task of comparing several alternative models; the objective of an evaluation could be to rank the models according to the risk that they incur or to identify the model with lowest risk. An experimental study shows that the active procedure leads to higher test power than the standard test in many application domains. Finally, we study the problem of evaluating the performance of ranking functions, which are used for example for web search. In practice, ranking performance is estimated by applying a given ranking model to a representative set of test queries and manually assessing the relevance of all retrieved items for each query. We apply the concepts of active evaluation and active comparison to ranking functions and derive optimal sampling distributions for the commonly used performance measures Discounted Cumulative Gain and Expected Reciprocal Rank. Experiments on web search engine data illustrate significant reductions in labeling costs.
This thesis discusses challenges in IT security education, points out a gap between e-learning and practical education, and presents a work to fill the gap. E-learning is a flexible and personalized alternative to traditional education. Nonetheless, existing e-learning systems for IT security education have difficulties in delivering hands-on experience because of the lack of proximity. Laboratory environments and practical exercises are indispensable instruction tools to IT security education, but security education in conventional computer laboratories poses particular problems such as immobility as well as high creation and maintenance costs. Hence, there is a need to effectively transform security laboratories and practical exercises into e-learning forms. In this thesis, we introduce the Tele-Lab IT-Security architecture that allows students not only to learn IT security principles, but also to gain hands-on security experience by exercises in an online laboratory environment. In this architecture, virtual machines are used to provide safe user work environments instead of real computers. Thus, traditional laboratory environments can be cloned onto the Internet by software, which increases accessibility to laboratory resources and greatly reduces investment and maintenance costs. Under the Tele-Lab IT-Security framework, a set of technical solutions is also proposed to provide effective functionalities, reliability, security, and performance. The virtual machines with appropriate resource allocation, software installation, and system configurations are used to build lightweight security laboratories on a hosting computer. Reliability and availability of laboratory platforms are covered by a virtual machine management framework. This management framework provides necessary monitoring and administration services to detect and recover critical failures of virtual machines at run time. Considering the risk that virtual machines can be misused for compromising production networks, we present a security management solution to prevent the misuse of laboratory resources by security isolation at the system and network levels. This work is an attempt to bridge the gap between e-learning/tele-teaching and practical IT security education. It is not to substitute conventional teaching in laboratories but to add practical features to e-learning. This thesis demonstrates the possibility to implement hands-on security laboratories on the Internet reliably, securely, and economically.
As a result of CMOS scaling, radiation-induced Single-Event Effects (SEEs) in electronic circuits became a critical reliability issue for modern Integrated Circuits (ICs) operating under harsh radiation conditions. SEEs can be triggered in combinational or sequential logic by the impact of high-energy particles, leading to destructive or non-destructive faults, resulting in data corruption or even system failure. Typically, the SEE mitigation methods are deployed statically in processing architectures based on the worst-case radiation conditions, which is most of the time unnecessary and results in a resource overhead. Moreover, the space radiation conditions are dynamically changing, especially during Solar Particle Events (SPEs). The intensity of space radiation can differ over five orders of magnitude within a few hours or days, resulting in several orders of magnitude fault probability variation in ICs during SPEs. This thesis introduces a comprehensive approach for designing a self-adaptive fault resilient multiprocessing system to overcome the static mitigation overhead issue. This work mainly addresses the following topics: (1) Design of on-chip radiation particle monitor for real-time radiation environment detection, (2) Investigation of space environment predictor, as support for solar particle events forecast, (3) Dynamic mode configuration in the resilient multiprocessing system. Therefore, according to detected and predicted in-flight space radiation conditions, the target system can be configured to use no mitigation or low-overhead mitigation during non-critical periods of time. The redundant resources can be used to improve system performance or save power. On the other hand, during increased radiation activity periods, such as SPEs, the mitigation methods can be dynamically configured appropriately depending on the real-time space radiation environment, resulting in higher system reliability. Thus, a dynamic trade-off in the target system between reliability, performance and power consumption in real-time can be achieved. All results of this work are evaluated in a highly reliable quad-core multiprocessing system that allows the self-adaptive setting of optimal radiation mitigation mechanisms during run-time. Proposed methods can serve as a basis for establishing a comprehensive self-adaptive resilient system design process. Successful implementation of the proposed design in the quad-core multiprocessor shows its application perspective also in the other designs.
Quantified Boolean formulas (QBFs) play an important role in theoretical computer science. QBF extends propositional logic in such a way that many advanced forms of reasoning can be easily formulated and evaluated. In this dissertation we present our ZQSAT, which is an algorithm for evaluating quantified Boolean formulas. ZQSAT is based on ZBDD: Zero-Suppressed Binary Decision Diagram , which is a variant of BDD, and an adopted version of the DPLL algorithm. It has been implemented in C using the CUDD: Colorado University Decision Diagram package. The capability of ZBDDs in storing sets of subsets efficiently enabled us to store the clauses of a QBF very compactly and let us to embed the notion of memoization to the DPLL algorithm. These points led us to implement the search algorithm in such a way that we could store and reuse the results of all previously solved subformulas with a little overheads. ZQSAT can solve some sets of standard QBF benchmark problems (known to be hard for DPLL based algorithms) faster than the best existing solvers. In addition to prenex-CNF, ZQSAT accepts prenex-NNF formulas. We show and prove how this capability can be exponentially beneficial.