Refine
Document Type
- Article (2)
- Doctoral Thesis (1)
Language
- English (3)
Keywords
- Databases (3) (remove)
Given a large set of records in a database and a query record, similarity search aims to find all records sufficiently similar to the query record. To solve this problem, two main aspects need to be considered: First, to perform effective search, the set of relevant records is defined using a similarity measure. Second, an efficient access method is to be found that performs only few database accesses and comparisons using the similarity measure. This thesis solves both aspects with an emphasis on the latter. In the first part of this thesis, a frequency-aware similarity measure is introduced. Compared record pairs are partitioned according to frequencies of attribute values. For each partition, a different similarity measure is created: machine learning techniques combine a set of base similarity measures into an overall similarity measure. After that, a similarity index for string attributes is proposed, the State Set Index (SSI), which is based on a trie (prefix tree) that is interpreted as a nondeterministic finite automaton. For processing range queries, the notion of query plans is introduced in this thesis to describe which similarity indexes to access and which thresholds to apply. The query result should be as complete as possible under some cost threshold. Two query planning variants are introduced: (1) Static planning selects a plan at compile time that is used for all queries. (2) Query-specific planning selects a different plan for each query. For answering top-k queries, the Bulk Sorted Access Algorithm (BSA) is introduced, which retrieves large chunks of records from the similarity indexes using fixed thresholds, and which focuses its efforts on records that are ranked high in more than one attribute and thus promising candidates. The described components form a complete similarity search system. Based on prototypical implementations, this thesis shows comparative evaluation results for all proposed approaches on different real-world data sets, one of which is a large person data set from a German credit rating agency.
Stochastic modeling is a common practice for modeling uncertainty in hydrogeology. In stochastic modeling, aquifer properties are characterized by their probability density functions (PDFs). The Bayesian approach for inverse modeling is often used to assimilate information from field measurements collected at a site into properties’ posterior PDFs. This necessitates the definition of a prior PDF, characterizing the knowledge of hydrological properties before undertaking any investigation at the site, and usually coming from previous studies at similar sites. In this paper, we introduce a Bayesian hierarchical algorithm capable of assimilating various information–like point measurements, bounds and moments–into a single, informative PDF that we call ex-situ prior. This informative PDF summarizes the ex-situ information available about a hydrogeological parameter at a site of interest, which can then be used as a prior PDF in future studies at the site. We demonstrate the behavior of the algorithm on several synthetic case studies, compare it to other methods described in the literature, and illustrate the approach by applying it to a public open-access hydrogeological dataset.
Teaching Data Management
(2015)
Data management is a central topic in computer science as
well as in computer science education. Within the last years, this topic is
changing tremendously, as its impact on daily life becomes increasingly
visible. Nowadays, everyone not only needs to manage data of various
kinds, but also continuously generates large amounts of data. In
addition, Big Data and data analysis are intensively discussed in public
dialogue because of their influences on society. For the understanding of
such discussions and for being able to participate in them, fundamental
knowledge on data management is necessary. Especially, being aware
of the threats accompanying the ability to analyze large amounts of
data in nearly real-time becomes increasingly important. This raises the
question, which key competencies are necessary for daily dealings with
data and data management.
In this paper, we will first point out the importance of data management
and of Big Data in daily life. On this basis, we will analyze which are
the key competencies everyone needs concerning data management to
be able to handle data in a proper way in daily life. Afterwards, we will
discuss the impact of these changes in data management on computer
science education and in particular database education.