000 Informatik, Informationswissenschaft, allgemeine Werke
Refine
Year of publication
Document Type
- Other (117)
- Article (110)
- Doctoral Thesis (68)
- Part of a Book (10)
- Postprint (9)
- Conference Proceeding (8)
- Monograph/Edited Volume (5)
- Master's Thesis (2)
- Habilitation Thesis (1)
- Part of Periodical (1)
Is part of the Bibliography
- yes (331)
Keywords
- MOOC (6)
- E-Learning (5)
- fabrication (5)
- 3D printing (4)
- Answer set programming (4)
- Blockchain (4)
- Machine Learning (4)
- duplicate detection (4)
- machine learning (4)
- Algorithms (3)
Institute
- Hasso-Plattner-Institut für Digital Engineering GmbH (165)
- Institut für Informatik und Computational Science (71)
- Hasso-Plattner-Institut für Digital Engineering gGmbH (49)
- Wirtschaftswissenschaften (20)
- Institut für Physik und Astronomie (8)
- Institut für Mathematik (3)
- Universitätsbibliothek (3)
- Department Linguistik (2)
- Institut für Geowissenschaften (2)
- Philosophische Fakultät (2)
Sustained glacier melt in the Himalayas has gradually spawned more than 5,000 glacier lakes that are dammed by potentially unstable moraines. When such dams break, glacier lake outburst floods (GLOFs) can cause catastrophic societal and geomorphic impacts. We present a robust probabilistic estimate of average GLOFs return periods in the Himalayan region, drawing on 5.4 billion simulations. We find that the 100-y outburst flood has an average volume of 33.5(+3.7)/(-3.7) x 10(6) m(3) (posterior mean and 95% highest density interval [HDI]) with a peak discharge of 15,600(+2.000)/(-1,800) m(3).S-1. Our estimated GLOF hazard is tied to the rate of historic lake outbursts and the number of present lakes, which both are highest in the Eastern Himalayas. There, the estimated 100-y GLOF discharge (similar to 14,500 m(3).s(-1)) is more than 3 times that of the adjacent Nyainqentanglha Mountains, and at least an order of magnitude higher than in the Hindu Kush, Karakoram, and Western Himalayas. The GLOF hazard may increase in these regions that currently have large glaciers, but few lakes, if future projected ice loss generates more unstable moraine-dammed lakes than we recognize today. Flood peaks from GLOFs mostly attenuate within Himalayan headwaters, but can rival monsoon-fed discharges in major rivers hundreds to thousands of kilometers downstream. Projections of future hazard from meteorological floods need to account for the extreme runoffs during lake outbursts, given the increasing trends in population, infrastructure, and hydropower projects in Himalayan headwaters.
In the smallest grammar problem, we are given a word w and we want to compute a preferably small context-free grammar G for the singleton language {w} (where the size of a grammar is the sum of the sizes of its rules, and the size of a rule is measured by the length of its right side). It is known that, for unbounded alphabets, the decision variant of this problem is NP-hard and the optimisation variant does not allow a polynomial-time approximation scheme, unless P = NP. We settle the long-standing open problem whether these hardness results also hold for the more realistic case of a constant-size alphabet. More precisely, it is shown that the smallest grammar problem remains NP-complete (and its optimisation version is APX-hard), even if the alphabet is fixed and has size of at least 17. The corresponding reduction is robust in the sense that it also works for an alternative size-measure of grammars that is commonly used in the literature (i. e., a size measure also taking the number of rules into account), and it also allows to conclude that even computing the number of rules required by a smallest grammar is a hard problem. On the other hand, if the number of nonterminals (or, equivalently, the number of rules) is bounded by a constant, then the smallest grammar problem can be solved in polynomial time, which is shown by encoding it as a problem on graphs with interval structure. However, treating the number of rules as a parameter (in terms of parameterised complexity) yields W[1]-hardness. Furthermore, we present an O(3(vertical bar w vertical bar)) exact exponential-time algorithm, based on dynamic programming. These three main questions are also investigated for 1-level grammars, i. e., grammars for which only the start rule contains nonterminals on the right side; thus, investigating the impact of the "hierarchical depth" of grammars on the complexity of the smallest grammar problem. In this regard, we obtain for 1-level grammars similar, but slightly stronger results.
MDedup
(2020)
Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned.
For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.
The notion of coherence relations is quite widely accepted in general, but concrete proposals differ considerably on the questions of how they should be motivated, which relations are to be assumed, and how they should be defined. This paper takes a "bottom-up" perspective by assessing the contribution made by linguistic signals (connectives), using insights from the relevant literature as well as verification by practical text annotation. We work primarily with the German language here and focus on the realm of contrast. Thus, we suggest a new inventory of contrastive connective functions and discuss their relationship to contrastive coherence relations that have been proposed in earlier work.
Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation
(2020)
We propose a new recurrent generative adversarial architecture named RNN-GAN to mitigate imbalance data problem in medical image semantic segmentation where the number of pixels belongs to the desired object are significantly lower than those belonging to the background. A model trained with imbalanced data tends to bias towards healthy data which is not desired in clinical applications and predicted outputs by these networks have high precision and low recall. To mitigate imbalanced training data impact, we train RNN-GAN with proposed complementary segmentation mask, in addition, ordinary segmentation masks. The RNN-GAN consists of two components: a generator and a discriminator. The generator is trained on the sequence of medical images to learn corresponding segmentation label map plus proposed complementary label both at a pixel level, while the discriminator is trained to distinguish a segmentation image coming from the ground truth or from the generator network. Both generator and discriminator substituted with bidirectional LSTM units to enhance temporal consistency and get inter and intra-slice representation of the features. We show evidence that the proposed framework is applicable to different types of medical images of varied sizes. In our experiments on ACDC-2017, HVSMR-2016, and LiTS-2017 benchmarks we find consistently improved results, demonstrating the efficacy of our approach.
Data Preparation
(2020)
Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. <br /> To cater to this demand, companies and researchers are developing techniques and tools for data preparation. To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.
Primary keys (PKs) and foreign keys (FKs) are important elements of relational schemata in various applications, such as query optimization and data integration. However, in many cases, these constraints are unknown or not documented. Detecting them manually is time-consuming and even infeasible in large-scale datasets. We study the problem of discovering primary keys and foreign keys automatically and propose an algorithm to detect both, namely Holistic Primary Key and Foreign Key Detection (HoPF). PKs and FKs are subsets of the sets of unique column combinations (UCCs) and inclusion dependencies (INDs), respectively, for which efficient discovery algorithms are known. Using score functions, our approach is able to effectively extract the true PKs and FKs from the vast sets of valid UCCs and INDs. Several pruning rules are employed to speed up the procedure. We evaluate precision and recall on three benchmarks and two real-world datasets. The results show that our method is able to retrieve on average 88% of all primary keys, and 91% of all foreign keys. We compare the performance of HoPF with two baseline approaches that both assume the existence of primary keys.
Dieser Artikel thematisiert die technische Umsetzung eines Webportals und einer SQL-Datenbank (Output.UP), um die manuelle Erfassung und Auswertung von wissenschaftlichen Publikationen für die Universitätsbibliothek Potsdam weitestgehend zu automatisieren. Ein besonderes Augenmerk wird auf die Importe mittels API von ORCID, Crossref und Unpaywall gelegt. Nach Abschluss der Testphase wird Output.UP in einem Git-Repository für die Nachnutzung zur Verfügung gestellt.
In many businesses, firms are selling different types of products, which share mutual substitution effects in demand. To compute effective pricing strategies is challenging as the sales probabilities of each of a firm's products can also be affected by the prices of potential substitutes. In this paper, we analyze stochastic dynamic multi-product pricing models for the sale of perishable goods. To circumvent the limitations of time-consuming optimal solutions for highly complex models, we propose different relaxation techniques, which allow to reduce the size of critical model components, such as the state space, the action space, or the set of potential sales events. Our heuristics are able to decrease the size of those components by forming corresponding clusters and using subsets of representative elements. Using numerical examples, we verify that our heuristics make it possible to dramatically reduce the computation time while still obtaining close-to-optimal expected profits. Further, we show that our heuristics are (i) flexible, (ii) scalable, and (iii) can be arbitrarily combined in a mutually supportive way.
Selection of initial points, the number of clusters and finding proper clusters centers are still the main challenge in clustering processes. In this paper, we suggest genetic algorithm based method which searches several solution spaces simultaneously. The solution spaces are population groups consisting of elements with similar structure. Elements in a group have the same size, while elements in different groups are of different sizes. The proposed algorithm processes the population in groups of chromosomes with one gene, two genes to k genes. These genes hold corresponding information about the cluster centers. In the proposed method, the crossover and mutation operators can accept parents with different sizes; this can lead to versatility in population and information transfer among sub-populations. We implemented the proposed method and evaluated its performance against some random datasets and the Ruspini dataset as well. The experimental results show that the proposed method could effectively determine the appropriate number of clusters and recognize their centers. Overall this research implies that using heterogeneous population in the genetic algorithm can lead to better results.