Refine
Has Fulltext
- yes (7) (remove)
Year of publication
- 2020 (7) (remove)
Document Type
- Doctoral Thesis (2)
- Postprint (2)
- Monograph/Edited Volume (1)
- Habilitation Thesis (1)
- Working Paper (1)
Language
- English (7)
Is part of the Bibliography
- yes (7)
Keywords
- machine learning (7) (remove)
Comment sections of online news platforms are an essential space to express opinions and discuss political topics. However, the misuse by spammers, haters, and trolls raises doubts about whether the benefits justify the costs of the time-consuming content moderation. As a consequence, many platforms limited or even shut down comment sections completely. In this thesis, we present deep learning approaches for comment classification, recommendation, and prediction to foster respectful and engaging online discussions. The main focus is on two kinds of comments: toxic comments, which make readers leave a discussion, and engaging comments, which make readers join a discussion. First, we discourage and remove toxic comments, e.g., insults or threats. To this end, we present a semi-automatic comment moderation process, which is based on fine-grained text classification models and supports moderators. Our experiments demonstrate that data augmentation, transfer learning, and ensemble learning allow training robust classifiers even on small datasets. To establish trust in the machine-learned models, we reveal which input features are decisive for their output with attribution-based explanation methods. Second, we encourage and highlight engaging comments, e.g., serious questions or factual statements. We automatically identify the most engaging comments, so that readers need not scroll through thousands of comments to find them. The model training process builds on upvotes and replies as a measure of reader engagement. We also identify comments that address the article authors or are otherwise relevant to them to support interactions between journalists and their readership. Taking into account the readers' interests, we further provide personalized recommendations of discussions that align with their favored topics or involve frequent co-commenters. Our models outperform multiple baselines and recent related work in experiments on comment datasets from different platforms.
Research synthesis on simple yet general hypotheses and ideas is challenging in scientific disciplines studying highly context-dependent systems such as medical, social, and biological sciences. This study shows that machine learning, equation-free statistical modeling of artificial intelligence, is a promising synthesis tool for discovering novel patterns and the source of controversy in a general hypothesis. We apply a decision tree algorithm, assuming that evidence from various contexts can be adequately integrated in a hierarchically nested structure. As a case study, we analyzed 163 articles that studied a prominent hypothesis in invasion biology, the enemy release hypothesis. We explored if any of the nine attributes that classify each study can differentiate conclusions as classification problem. Results corroborated that machine learning can be useful for research synthesis, as the algorithm could detect patterns that had been already focused in previous narrative reviews. Compared with the previous synthesis study that assessed the same evidence collection based on experts' judgement, the algorithm has newly proposed that the studies focusing on Asian regions mostly supported the hypothesis, suggesting that more detailed investigations in these regions can enhance our understanding of the hypothesis. We suggest that machine learning algorithms can be a promising synthesis tool especially where studies (a) reformulate a general hypothesis from different perspectives, (b) use different methods or variables, or (c) report insufficient information for conducting meta-analyses.
Rivers have always flooded their floodplains. Over 2.5 billion people worldwide have been affected by flooding in recent decades. The economic damage is also considerable, averaging 100 billion US dollars per year. There is no doubt that damage and other negative effects of floods can be avoided. However, this has a price: financially and politically. Costs and benefits can be estimated through risk assessments. Questions about the location and frequency of floods, about the objects that could be affected and their vulnerability are of importance for flood risk managers, insurance companies and politicians. Thus, both variables and factors from the fields of hydrology and sociol-economics play a role with multi-layered connections. One example are dikes along a river, which on the one hand contain floods, but on the other hand, by narrowing the natural floodplains, accelerate the flood discharge and increase the danger of flooding for the residents downstream. Such larger connections must be included in the assessment of flood risk. However, in current procedures this is accompanied by simplifying assumptions. Risk assessments are therefore fuzzy and associated with uncertainties.
This thesis investigates the benefits and possibilities of new data sources for improving flood risk assessment. New methods and models are developed, which take the mentioned interrelations better into account and also quantify the existing uncertainties of the model results, and thus enable statements about the reliability of risk estimates. For this purpose, data on flood events from various sources are collected and evaluated. This includes precipitation and flow records at measuring stations as well as for instance images from social media, which can help to delineate the flooded areas and estimate flood damage with location information. Machine learning methods have been successfully used to recognize and understand correlations between floods and impacts from a wide range of data and to develop improved models.
Risk models help to develop and evaluate strategies to reduce flood risk. These tools also provide advanced insights into the interplay of various factors and on the expected consequences of flooding. This work shows progress in terms of an improved assessment of flood risks by using diverse data from different sources with innovative methods as well as by the further development of models. Flood risk is variable due to economic and climatic changes, and other drivers of risk. In order to keep the knowledge about flood risks up-to-date, robust, efficient and adaptable methods as proposed in this thesis are of increasing importance.
The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners.
The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies.
This technical report presents results of research projects executed in 2017. Selected projects have presented their results on April 25th and November 15th 2017 at the Future SOC Lab Day events.
Successfully completing any data science project demands careful consideration across its whole process. Although the focus is often put on later phases of the process, in practice, experts spend more time in earlier phases, preparing data, to make them consistent with the systems' requirements or to improve their models' accuracies. Duplicate detection is typically applied during the data cleaning phase, which is dedicated to removing data inconsistencies and improving the overall quality and usability of data. While data cleaning involves a plethora of approaches to perform specific operations, such as schema alignment and data normalization, the task of detecting and removing duplicate records is particularly challenging. Duplicates arise when multiple records representing the same entities exist in a database. Due to numerous reasons, spanning from simple typographical errors to different schemas and formats of integrated databases. Keeping a database free of duplicates is crucial for most use-cases, as their existence causes false negatives and false positives when matching queries against it. These two data quality issues have negative implications for tasks, such as hotel booking, where users may erroneously select a wrong hotel, or parcel delivery, where a parcel can get delivered to the wrong address. Identifying the variety of possible data issues to eliminate duplicates demands sophisticated approaches.
While research in duplicate detection is well-established and covers different aspects of both efficiency and effectiveness, our work in this thesis focuses on the latter. We propose novel approaches to improve data quality before duplicate detection takes place and apply the latter in datasets even when prior labeling is not available. Our experiments show that improving data quality upfront can increase duplicate classification results by up to 19%. To this end, we propose two novel pipelines that select and apply generic as well as address-specific data preparation steps with the purpose of maximizing the success of duplicate detection. Generic data preparation, such as the removal of special characters, can be applied to any relation with alphanumeric attributes. When applied, data preparation steps are selected only for attributes where there are positive effects on pair similarities, which indirectly affect classification, or on classification directly. Our work on addresses is twofold; first, we consider more domain-specific approaches to improve the quality of values, and, second, we experiment with known and modified versions of similarity measures to select the most appropriate per address attribute, e.g., city or country.
To facilitate duplicate detection in applications where gold standard annotations are not available and obtaining them is not possible or too expensive, we propose MDedup. MDedup is a novel, rule-based, and fully automatic duplicate detection approach that is based on matching dependencies. These dependencies can be used to detect duplicates and can be discovered using state-of-the-art algorithms efficiently and without any prior labeling. MDedup uses two pipelines to first train on datasets with known labels, learning to identify useful matching dependencies, and then be applied on unseen datasets, regardless of any existing gold standard. Finally, our work is accompanied by open source code to enable repeatability of our research results and application of our approaches to other datasets.
Numerous studies investigate which sanctioning institutions prevent cartel formation but little is known as to how these sanctions work. We contribute to understanding the inner workings of cartels by studying experimentally the effect of sanctioning institutions on firms’ communication. Using machine learning to organize the chat communication into topics, we find that firms are significantly less likely to communicate explicitly about price fixing when sanctioning institutions are present. At the same time, average prices are lower when communication is less explicit. A mediation analysis suggests that sanctions are effective in hindering cartel formation not only because they introduce a risk of being fined but also by reducing the prevalence of explicit price communication.
Machine learning (ML) algorithms are being increasingly used in Earth and Environmental modeling studies owing to the ever-increasing availability of diverse data sets and computational resources as well as advancement in ML algorithms. Despite advances in their predictive accuracy, the usefulness of ML algorithms for inference remains elusive. In this study, we employ two popular ML algorithms, artificial neural networks and random forest, to analyze a large data set of flood events across Germany with the goals to analyze their predictive accuracy and their usability to provide insights to hydrologic system functioning. The results of the ML algorithms are contrasted against a parametric approach based on multiple linear regression. For analysis, we employ a model-agnostic framework named Permuted Feature Importance to derive the influence of models' predictors. This allows us to compare the results of different algorithms for the first time in the context of hydrology. Our main findings are that (1) the ML models achieve higher prediction accuracy than linear regression, (2) the results reflect basic hydrological principles, but (3) further inference is hindered by the heterogeneity of results across algorithms. Thus, we conclude that the problem of equifinality as known from classical hydrological modeling also exists for ML and severely hampers its potential for inference. To account for the observed problems, we propose that when employing ML for inference, this should be made by using multiple algorithms and multiple methods, of which the latter should be embedded in a cross-validation routine.