TY - JOUR A1 - Kühn, Daniela A1 - Hainzl, Sebastian A1 - Dahm, Torsten A1 - Richter, Gudrun A1 - Vera Rodriguez, Ismael T1 - A review of source models to further the understanding of the seismicity of the Groningen field JF - Netherlands journal of geosciences : NJG N2 - The occurrence of felt earthquakes due to gas production in Groningen has initiated numerous studies and model attempts to understand and quantify induced seismicity in this region. The whole bandwidth of available models spans the range from fully deterministic models to purely empirical and stochastic models. In this article, we summarise the most important model approaches, describing their main achievements and limitations. In addition, we discuss remaining open questions and potential future directions of development. KW - deterministic KW - empirical KW - hybrid KW - machine learning KW - seismicity model Y1 - 2022 U6 - https://doi.org/10.1017/njg.2022.7 SN - 0016-7746 SN - 1573-9708 VL - 101 PB - Cambridge Univ. Press CY - Cambridge ER - TY - THES A1 - Chen, Junchao T1 - A self-adaptive resilient method for implementing and managing the high-reliability processing system T1 - Eine selbstadaptive belastbare Methode zum Implementieren und Verwalten von hochzuverlässigen Verarbeitungssysteme N2 - As a result of CMOS scaling, radiation-induced Single-Event Effects (SEEs) in electronic circuits became a critical reliability issue for modern Integrated Circuits (ICs) operating under harsh radiation conditions. SEEs can be triggered in combinational or sequential logic by the impact of high-energy particles, leading to destructive or non-destructive faults, resulting in data corruption or even system failure. Typically, the SEE mitigation methods are deployed statically in processing architectures based on the worst-case radiation conditions, which is most of the time unnecessary and results in a resource overhead. Moreover, the space radiation conditions are dynamically changing, especially during Solar Particle Events (SPEs). The intensity of space radiation can differ over five orders of magnitude within a few hours or days, resulting in several orders of magnitude fault probability variation in ICs during SPEs. This thesis introduces a comprehensive approach for designing a self-adaptive fault resilient multiprocessing system to overcome the static mitigation overhead issue. This work mainly addresses the following topics: (1) Design of on-chip radiation particle monitor for real-time radiation environment detection, (2) Investigation of space environment predictor, as support for solar particle events forecast, (3) Dynamic mode configuration in the resilient multiprocessing system. Therefore, according to detected and predicted in-flight space radiation conditions, the target system can be configured to use no mitigation or low-overhead mitigation during non-critical periods of time. The redundant resources can be used to improve system performance or save power. On the other hand, during increased radiation activity periods, such as SPEs, the mitigation methods can be dynamically configured appropriately depending on the real-time space radiation environment, resulting in higher system reliability. Thus, a dynamic trade-off in the target system between reliability, performance and power consumption in real-time can be achieved. All results of this work are evaluated in a highly reliable quad-core multiprocessing system that allows the self-adaptive setting of optimal radiation mitigation mechanisms during run-time. Proposed methods can serve as a basis for establishing a comprehensive self-adaptive resilient system design process. Successful implementation of the proposed design in the quad-core multiprocessor shows its application perspective also in the other designs. N2 - Infolge der CMOS-Skalierung wurden strahleninduzierte Einzelereignis-Effekte (SEEs) in elektronischen Schaltungen zu einem kritischen Zuverlässigkeitsproblem für moderne integrierte Schaltungen (ICs), die unter rauen Strahlungsbedingungen arbeiten. SEEs können in der kombinatorischen oder sequentiellen Logik durch den Aufprall hochenergetischer Teilchen ausgelöst werden, was zu destruktiven oder nicht-destruktiven Fehlern und damit zu Datenverfälschungen oder sogar Systemausfällen führt. Normalerweise werden die Methoden zur Abschwächung von SEEs statisch in Verarbeitungsarchitekturen auf der Grundlage der ungünstigsten Strahlungsbedingungen eingesetzt, was in den meisten Fällen unnötig ist und zu einem Ressourcen-Overhead führt. Darüber hinaus ändern sich die Strahlungsbedingungen im Weltraum dynamisch, insbesondere während Solar Particle Events (SPEs). Die Intensität der Weltraumstrahlung kann sich innerhalb weniger Stunden oder Tage um mehr als fünf Größenordnungen ändern, was zu einer Variation der Fehlerwahrscheinlichkeit in ICs während SPEs um mehrere Größenordnungen führt. In dieser Arbeit wird ein umfassender Ansatz für den Entwurf eines selbstanpassenden, fehlerresistenten Multiprozessorsystems vorgestellt, um das Problem des statischen Mitigation-Overheads zu überwinden. Diese Arbeit befasst sich hauptsächlich mit den folgenden Themen: (1) Entwurf eines On-Chip-Strahlungsteilchen Monitors zur Echtzeit-Erkennung von Strahlung Umgebungen, (2) Untersuchung von Weltraumumgebungsprognosen zur Unterstützung der Vorhersage von solaren Teilchen Ereignissen, (3) Konfiguration des dynamischen Modus in einem belastbaren Multiprozessorsystem. Daher kann das Zielsystem je nach den erkannten und vorhergesagten Strahlungsbedingungen während des Fluges so konfiguriert werden, dass es während unkritischer Zeiträume keine oder nur eine geringe Strahlungsminderung vornimmt. Die redundanten Ressourcen können genutzt werden, um die Systemleistung zu verbessern oder Energie zu sparen. In Zeiten erhöhter Strahlungsaktivität, wie z. B. während SPEs, können die Abschwächungsmethoden dynamisch und in Abhängigkeit von der Echtzeit-Strahlungsumgebung im Weltraum konfiguriert werden, was zu einer höheren Systemzuverlässigkeit führt. Auf diese Weise kann im Zielsystem ein dynamischer Kompromiss zwischen Zuverlässigkeit, Leistung und Stromverbrauch in Echtzeit erreicht werden. Alle Ergebnisse dieser Arbeit wurden in einem hochzuverlässigen Quad-Core-Multiprozessorsystem evaluiert, das die selbstanpassende Einstellung optimaler Strahlungsschutzmechanismen während der Laufzeit ermöglicht. Die vorgeschlagenen Methoden können als Grundlage für die Entwicklung eines umfassenden, selbstanpassenden und belastbaren Systementwurfsprozesses dienen. Die erfolgreiche Implementierung des vorgeschlagenen Entwurfs in einem Quad-Core-Multiprozessor zeigt, dass er auch für andere Entwürfe geeignet ist. KW - single event upset KW - solar particle event KW - machine learning KW - self-adaptive multiprocessing system KW - maschinelles Lernen KW - selbstanpassendes Multiprozessorsystem KW - strahleninduzierte Einzelereignis-Effekte KW - Sonnenteilchen-Ereignis Y1 - 2023 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-583139 ER - TY - THES A1 - Flöter, André T1 - Analyzing biological expression data based on decision tree induction T1 - Analyse biologischer Expressionsdaten mit Hilfe von Entscheidungsbauminduktion N2 - Modern biological analysis techniques supply scientists with various forms of data. One category of such data are the so called "expression data". These data indicate the quantities of biochemical compounds present in tissue samples. Recently, expression data can be generated at a high speed. This leads in turn to amounts of data no longer analysable by classical statistical techniques. Systems biology is the new field that focuses on the modelling of this information. At present, various methods are used for this purpose. One superordinate class of these meth­ods is machine learning. Methods of this kind had, until recently, predominantly been used for classification and prediction tasks. This neglected a powerful secondary benefit: the ability to induce interpretable models. Obtaining such models from data has become a key issue within Systems biology. Numerous approaches have been proposed and intensively discussed. This thesis focuses on the examination and exploitation of one basic technique: decision trees. The concept of comparing sets of decision trees is developed. This method offers the pos­sibility of identifying significant thresholds in continuous or discrete valued attributes through their corresponding set of decision trees. Finding significant thresholds in attributes is a means of identifying states in living organisms. Knowing about states is an invaluable clue to the un­derstanding of dynamic processes in organisms. Applied to metabolite concentration data, the proposed method was able to identify states which were not found with conventional techniques for threshold extraction. A second approach exploits the structure of sets of decision trees for the discovery of com­binatorial dependencies between attributes. Previous work on this issue has focused either on expensive computational methods or the interpretation of single decision trees ­ a very limited exploitation of the data. This has led to incomplete or unstable results. That is why a new method is developed that uses sets of decision trees to overcome these limitations. Both the introduced methods are available as software tools. They can be applied consecu­tively or separately. That way they make up a package of analytical tools that usefully supplement existing methods. By means of these tools, the newly introduced methods were able to confirm existing knowl­edge and to suggest interesting and new relationships between metabolites. N2 - Neuere biologische Analysetechniken liefern Forschern verschiedenste Arten von Daten. Eine Art dieser Daten sind die so genannten "Expressionsdaten". Sie geben die Konzentrationen biochemischer Inhaltsstoffe in Gewebeproben an. Neuerdings können Expressionsdaten sehr schnell erzeugt werden. Das führt wiederum zu so großen Datenmengen, dass sie nicht mehr mit klassischen statistischen Verfahren analysiert werden können. "System biology" ist eine neue Disziplin, die sich mit der Modellierung solcher Information befasst. Zur Zeit werden dazu verschiedenste Methoden benutzt. Eine Superklasse dieser Methoden ist das maschinelle Lernen. Dieses wurde bis vor kurzem ausschließlich zum Klassifizieren und zum Vorhersagen genutzt. Dabei wurde eine wichtige zweite Eigenschaft vernachlässigt, nämlich die Möglichkeit zum Erlernen von interpretierbaren Modellen. Die Erstellung solcher Modelle hat mittlerweile eine Schlüsselrolle in der "Systems biology" erlangt. Es sind bereits zahlreiche Methoden dazu vorgeschlagen und diskutiert worden. Die vorliegende Arbeit befasst sich mit der Untersuchung und Nutzung einer ganz grundlegenden Technik: den Entscheidungsbäumen. Zunächst wird ein Konzept zum Vergleich von Baummengen entwickelt, welches das Erkennen bedeutsamer Schwellwerte in reellwertigen Daten anhand ihrer zugehörigen Entscheidungswälder ermöglicht. Das Erkennen solcher Schwellwerte dient dem Verständnis von dynamischen Abläufen in lebenden Organismen. Bei der Anwendung dieser Technik auf metabolische Konzentrationsdaten wurden bereits Zustände erkannt, die nicht mit herkömmlichen Techniken entdeckt werden konnten. Ein zweiter Ansatz befasst sich mit der Auswertung der Struktur von Entscheidungswäldern zur Entdeckung von kombinatorischen Abhängigkeiten zwischen Attributen. Bisherige Arbeiten hierzu befassten sich vornehmlich mit rechenintensiven Verfahren oder mit einzelnen Entscheidungsbäumen, eine sehr eingeschränkte Ausbeutung der Daten. Das führte dann entweder zu unvollständigen oder instabilen Ergebnissen. Darum wird hier eine Methode entwickelt, die Mengen von Entscheidungsbäumen nutzt, um diese Beschränkungen zu überwinden. Beide vorgestellten Verfahren gibt es als Werkzeuge für den Computer, die entweder hintereinander oder einzeln verwendet werden können. Auf diese Weise stellen sie eine sinnvolle Ergänzung zu vorhandenen Analyswerkzeugen dar. Mit Hilfe der bereitgestellten Software war es möglich, bekanntes Wissen zu bestätigen und interessante neue Zusammenhänge im Stoffwechsel von Pflanzen aufzuzeigen. KW - Molekulare Bioinformatik KW - Maschinelles Lernen KW - Entscheidungsbäume KW - machine learning KW - decision trees KW - computational biology Y1 - 2005 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-6416 ER - TY - JOUR A1 - Ghafarian, Fatemeh A1 - Wieland, Ralf A1 - Lüttschwager, Dietmar A1 - Nendel, Claas T1 - Application of extreme gradient boosting and Shapley Additive explanations to predict temperature regimes inside forests from standard open-field meteorological data JF - Environmental modelling & software with environment data news N2 - Forest microclimate can buffer biotic responses to summer heat waves, which are expected to become more extreme under climate warming. Prediction of forest microclimate is limited because meteorological observation standards seldom include situations inside forests. We use eXtreme Gradient Boosting - a Machine Learning technique - to predict the microclimate of forest sites in Brandenburg, Germany, using seasonal data comprising weather features. The analysis was amended by applying a SHapley Additive explanation to show the interaction effect of variables and individualised feature attributions. We evaluate model performance in comparison to artificial neural networks, random forest, support vector machine, and multi-linear regression. After implementing a feature selection, an ensemble approach was applied to combine individual models for each forest and improve robustness over a given single prediction model. The resulting model can be applied to translate climate change scenarios into temperatures inside forests to assess temperature-related ecosystem services provided by forests. KW - cooling effect KW - machine learning KW - ensemble method KW - ecosystem services Y1 - 2022 U6 - https://doi.org/10.1016/j.envsoft.2022.105466 SN - 1364-8152 SN - 1873-6726 VL - 156 PB - Elsevier CY - Oxford ER - TY - THES A1 - Brill, Fabio Alexander T1 - Applications of machine learning and open geospatial data in flood risk modelling N2 - Der technologische Fortschritt erlaubt es, zunehmend komplexe Vorhersagemodelle auf Basis immer größerer Datensätze zu produzieren. Für das Risikomanagement von Naturgefahren sind eine Vielzahl von Modellen als Entscheidungsgrundlage notwendig, z.B. in der Auswertung von Beobachtungsdaten, für die Vorhersage von Gefahrenszenarien, oder zur statistischen Abschätzung der zu erwartenden Schäden. Es stellt sich also die Frage, inwiefern moderne Modellierungsansätze wie das maschinelle Lernen oder Data-Mining in diesem Themenbereich sinnvoll eingesetzt werden können. Zusätzlich ist im Hinblick auf die Datenverfügbarkeit und -zugänglichkeit ein Trend zur Öffnung (open data) zu beobachten. Thema dieser Arbeit ist daher, die Möglichkeiten und Grenzen des maschinellen Lernens und frei verfügbarer Geodaten auf dem Gebiet der Hochwasserrisikomodellierung im weiteren Sinne zu untersuchen. Da dieses übergeordnete Thema sehr breit ist, werden einzelne relevante Aspekte herausgearbeitet und detailliert betrachtet. Eine prominente Datenquelle im Bereich Hochwasser ist die satellitenbasierte Kartierung von Überflutungsflächen, die z.B. über den Copernicus Service der Europäischen Union frei zur Verfügung gestellt werden. Große Hoffnungen werden in der wissenschaftlichen Literatur in diese Produkte gesetzt, sowohl für die akute Unterstützung der Einsatzkräfte im Katastrophenfall, als auch in der Modellierung mittels hydrodynamischer Modelle oder zur Schadensabschätzung. Daher wurde ein Fokus in dieser Arbeit auf die Untersuchung dieser Flutmasken gelegt. Aus der Beobachtung, dass die Qualität dieser Produkte in bewaldeten und urbanen Gebieten unzureichend ist, wurde ein Verfahren zur nachträglichenVerbesserung mittels maschinellem Lernen entwickelt. Das Verfahren basiert auf einem Klassifikationsalgorithmus der nur Trainingsdaten von einer vorherzusagenden Klasse benötigt, im konkreten Fall also Daten von Überflutungsflächen, nicht jedoch von der negativen Klasse (trockene Gebiete). Die Anwendung für Hurricane Harvey in Houston zeigt großes Potenzial der Methode, abhängig von der Qualität der ursprünglichen Flutmaske. Anschließend wird anhand einer prozessbasierten Modellkette untersucht, welchen Einfluss implementierte physikalische Prozessdetails auf das vorhergesagte statistische Risiko haben. Es wird anschaulich gezeigt, was eine Risikostudie basierend auf etablierten Modellen leisten kann. Solche Modellketten sind allerdings bereits für Flusshochwasser sehr komplex, und für zusammengesetzte oder kaskadierende Ereignisse mit Starkregen, Sturzfluten, und weiteren Prozessen, kaum vorhanden. Im vierten Kapitel dieser Arbeit wird daher getestet, ob maschinelles Lernen auf Basis von vollständigen Schadensdaten einen direkteren Weg zur Schadensmodellierung ermöglicht, der die explizite Konzeption einer solchen Modellkette umgeht. Dazu wird ein staatlich erhobener Datensatz der geschädigten Gebäude während des schweren El Niño Ereignisses 2017 in Peru verwendet. In diesem Kontext werden auch die Möglichkeiten des Data-Mining zur Extraktion von Prozessverständnis ausgelotet. Es kann gezeigt werden, dass diverse frei verfügbare Geodaten nützliche Informationen für die Gefahren- und Schadensmodellierung von komplexen Flutereignissen liefern, z.B. satellitenbasierte Regenmessungen, topographische und hydrographische Information, kartierte Siedlungsflächen, sowie Indikatoren aus Spektraldaten. Zudem zeigen sich Erkenntnisse zu den Schädigungsprozessen, die im Wesentlichen mit den vorherigen Erwartungen in Einklang stehen. Die maximale Regenintensität wirkt beispielsweise in Städten und steilen Schluchten stärker schädigend, während die Niederschlagssumme in tiefliegenden Flussgebieten und bewaldeten Regionen als aussagekräftiger befunden wurde. Ländliche Gebiete in Peru weisen in der präsentierten Studie eine höhere Vulnerabilität als die Stadtgebiete auf. Jedoch werden auch die grundsätzlichen Grenzen der Methodik und die Abhängigkeit von spezifischen Datensätzen and Algorithmen offenkundig. In der übergreifenden Diskussion werden schließlich die verschiedenen Methoden – prozessbasierte Modellierung, prädiktives maschinelles Lernen, und Data-Mining – mit Blick auf die Gesamtfragestellungen evaluiert. Im Bereich der Gefahrenbeobachtung scheint eine Fokussierung auf neue Algorithmen sinnvoll. Im Bereich der Gefahrenmodellierung, insbesondere für Flusshochwasser, wird eher die Verbesserung von physikalischen Modellen, oder die Integration von prozessbasierten und statistischen Verfahren angeraten. In der Schadensmodellierung fehlen nach wie vor die großen repräsentativen Datensätze, die für eine breite Anwendung von maschinellem Lernen Voraussetzung ist. Daher ist die Verbesserung der Datengrundlage im Bereich der Schäden derzeit als wichtiger einzustufen als die Auswahl der Algorithmen. N2 - Technological progress allows for producing ever more complex predictive models on the basis of increasingly big datasets. For risk management of natural hazards, a multitude of models is needed as basis for decision-making, e.g. in the evaluation of observational data, for the prediction of hazard scenarios, or for statistical estimates of expected damage. The question arises, how modern modelling approaches like machine learning or data-mining can be meaningfully deployed in this thematic field. In addition, with respect to data availability and accessibility, the trend is towards open data. Topic of this thesis is therefore to investigate the possibilities and limitations of machine learning and open geospatial data in the field of flood risk modelling in the broad sense. As this overarching topic is broad in scope, individual relevant aspects are identified and inspected in detail. A prominent data source in the flood context is satellite-based mapping of inundated areas, for example made openly available by the Copernicus service of the European Union. Great expectations are directed towards these products in scientific literature, both for acute support of relief forces during emergency response action, and for modelling via hydrodynamic models or for damage estimation. Therefore, a focus of this work was set on evaluating these flood masks. From the observation that the quality of these products is insufficient in forested and built-up areas, a procedure for subsequent improvement via machine learning was developed. This procedure is based on a classification algorithm that only requires training data from a particular class to be predicted, in this specific case data of flooded areas, but not of the negative class (dry areas). The application for hurricane Harvey in Houston shows the high potential of this method, which depends on the quality of the initial flood mask. Next, it is investigated how much the predicted statistical risk from a process-based model chain is dependent on implemented physical process details. Thereby it is demonstrated what a risk study based on established models can deliver. Even for fluvial flooding, such model chains are already quite complex, though, and are hardly available for compound or cascading events comprising torrential rainfall, flash floods, and other processes. In the fourth chapter of this thesis it is therefore tested whether machine learning based on comprehensive damage data can offer a more direct path towards damage modelling, that avoids explicit conception of such a model chain. For that purpose, a state-collected dataset of damaged buildings from the severe El Niño event 2017 in Peru is used. In this context, the possibilities of data-mining for extracting process knowledge are explored as well. It can be shown that various openly available geodata sources contain useful information for flood hazard and damage modelling for complex events, e.g. satellite-based rainfall measurements, topographic and hydrographic information, mapped settlement areas, as well as indicators from spectral data. Further, insights on damaging processes are discovered, which mainly are in line with prior expectations. The maximum intensity of rainfall, for example, acts stronger in cities and steep canyons, while the sum of rain was found more informative in low-lying river catchments and forested areas. Rural areas of Peru exhibited higher vulnerability in the presented study compared to urban areas. However, the general limitations of the methods and the dependence on specific datasets and algorithms also become obvious. In the overarching discussion, the different methods – process-based modelling, predictive machine learning, and data-mining – are evaluated with respect to the overall research questions. In the case of hazard observation it seems that a focus on novel algorithms makes sense for future research. In the subtopic of hazard modelling, especially for river floods, the improvement of physical models and the integration of process-based and statistical procedures is suggested. For damage modelling the large and representative datasets necessary for the broad application of machine learning are still lacking. Therefore, the improvement of the data basis in the field of damage is currently regarded as more important than the selection of algorithms. KW - flood risk KW - machine learning KW - open data KW - damage modelling KW - data-mining KW - Schadensmodellierung KW - Data-Mining KW - Hochwasserrisiko KW - maschinelles Lernen KW - offene Daten Y1 - 2022 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-555943 ER - TY - JOUR A1 - Brandes, Stefanie A1 - Sicks, Florian A1 - Berger, Anne T1 - Behaviour classification on giraffes (Giraffa camelopardalis) using machine learning algorithms on triaxial acceleration data of two commonly used GPS devices and its possible application for their management and conservation JF - Sensors N2 - Averting today's loss of biodiversity and ecosystem services can be achieved through conservation efforts, especially of keystone species. Giraffes (Giraffa camelopardalis) play an important role in sustaining Africa's ecosystems, but are 'vulnerable' according to the IUCN Red List since 2016. Monitoring an animal's behavior in the wild helps to develop and assess their conservation management. One mechanism for remote tracking of wildlife behavior is to attach accelerometers to animals to record their body movement. We tested two different commercially available high-resolution accelerometers, e-obs and Africa Wildlife Tracking (AWT), attached to the top of the heads of three captive giraffes and analyzed the accuracy of automatic behavior classifications, focused on the Random Forests algorithm. For both accelerometers, behaviors of lower variety in head and neck movements could be better predicted (i.e., feeding above eye level, mean prediction accuracy e-obs/AWT: 97.6%/99.7%; drinking: 96.7%/97.0%) than those with a higher variety of body postures (such as standing: 90.7-91.0%/75.2-76.7%; rumination: 89.6-91.6%/53.5-86.5%). Nonetheless both devices come with limitations and especially the AWT needs technological adaptations before applying it on animals in the wild. Nevertheless, looking at the prediction results, both are promising accelerometers for behavioral classification of giraffes. Therefore, these devices when applied to free-ranging animals, in combination with GPS tracking, can contribute greatly to the conservation of giraffes. KW - giraffe KW - triaxial acceleration KW - machine learning KW - random forests KW - behavior classification KW - giraffe conservation Y1 - 2021 U6 - https://doi.org/10.3390/s21062229 SN - 1424-8220 VL - 21 IS - 6 PB - MDPI CY - Basel ER - TY - JOUR A1 - Hampf, Anna A1 - Nendel, Claas A1 - Strey, Simone A1 - Strey, Robert T1 - Biotic yield losses in the Southern Amazon, Brazil BT - making use of smartphone-assisted plant disease diagnosis data JF - Frontiers in plant science : FPLS N2 - Pathogens and animal pests (P&A) are a major threat to global food security as they directly affect the quantity and quality of food. The Southern Amazon, Brazil's largest domestic region for soybean, maize and cotton production, is particularly vulnerable to the outbreak of P&A due to its (sub)tropical climate and intensive farming systems. However, little is known about the spatial distribution of P&A and the related yield losses. Machine learning approaches for the automated recognition of plant diseases can help to overcome this research gap. The main objectives of this study are to (1) evaluate the performance of Convolutional Neural Networks (ConvNets) in classifying P&A, (2) map the spatial distribution of P&A in the Southern Amazon, and (3) quantify perceived yield and economic losses for the main soybean and maize P&A. The objectives were addressed by making use of data collected with the smartphone application Plantix. The core of the app's functioning is the automated recognition of plant diseases via ConvNets. Data on expected yield losses were gathered through a short survey included in an "expert" version of the application, which was distributed among agronomists. Between 2016 and 2020, Plantix users collected approximately 78,000 georeferenced P&A images in the Southern Amazon. The study results indicate a high performance of the trained ConvNets in classifying 420 different crop-disease combinations. Spatial distribution maps and expert-based yield loss estimates indicate that maize rust, bacterial stalk rot and the fall armyworm are among the most severe maize P&A, whereas soybean is mainly affected by P&A like anthracnose, downy mildew, frogeye leaf spot, stink bugs and brown spot. Perceived soybean and maize yield losses amount to 12 and 16%, respectively, resulting in annual yield losses of approximately 3.75 million tonnes for each crop and economic losses of US$2 billion for both crops together. The high level of accuracy of the trained ConvNets, when paired with widespread use from following a citizen-science approach, results in a data source that will shed new light on yield loss estimates, e.g., for the analysis of yield gaps and the development of measures to minimise them. KW - plant pathology KW - animal pests KW - pathogens KW - machine learning KW - digital KW - image processing KW - disease diagnosis KW - crowdsourcing KW - crop losses Y1 - 2021 U6 - https://doi.org/10.3389/fpls.2021.621168 SN - 1664-462X VL - 12 PB - Frontiers Media CY - Lausanne ER - TY - JOUR A1 - Frommhold, Martin A1 - Heim, Arend A1 - Barabanov, Mikhail A1 - Maier, Franziska A1 - Mühle, Ralf-Udo A1 - Smirenski, Sergei M. A1 - Heim, Wieland T1 - Breeding habitat and nest-site selection by an obligatory "nest-cleptoparasite", the Amur Falcon Falco amurensis JF - Ecology and evolution N2 - The selection of a nest site is crucial for successful reproduction of birds. Animals which re-use or occupy nest sites constructed by other species often have limited choice. Little is known about the criteria of nest-stealing species to choose suitable nesting sites and habitats. Here, we analyze breeding-site selection of an obligatory "nest-cleptoparasite", the Amur Falcon Falco amurensis. We collected data on nest sites at Muraviovka Park in the Russian Far East, where the species breeds exclusively in nests of the Eurasian Magpie Pica pica. We sampled 117 Eurasian Magpie nests, 38 of which were occupied by Amur Falcons. Nest-specific variables were assessed, and a recently developed habitat classification map was used to derive landscape metrics. We found that Amur Falcons chose a wide range of nesting sites, but significantly preferred nests with a domed roof. Breeding pairs of Eurasian Hobby Falco subbuteo and Eurasian Magpie were often found to breed near the nest in about the same distance as neighboring Amur Falcon pairs. Additionally, the occurrence of the species was positively associated with bare soil cover, forest cover, and shrub patches within their home range and negatively with the distance to wetlands. Areas of wetlands and fallow land might be used for foraging since Amur Falcons mostly depend on an insect diet. Additionally, we found that rarely burned habitats were preferred. Overall, the effect of landscape variables on the choice of actual nest sites appeared to be rather small. We used different classification methods to predict the probability of occurrence, of which the Random forest method showed the highest accuracy. The areas determined as suitable habitat showed a high concordance with the actual nest locations. We conclude that Amur Falcons prefer to occupy newly built (domed) nests to ensure high nest quality, as well as nests surrounded by available feeding habitats. KW - cleptoparasitism KW - fire KW - habitat use KW - machine learning KW - magpie KW - nest-site selection KW - random forest Y1 - 2019 U6 - https://doi.org/10.1002/ece3.5878 SN - 2045-7758 VL - 9 IS - 24 SP - 14430 EP - 14441 PB - Wiley CY - Hoboken ER - TY - JOUR A1 - Schmidt, Lennart A1 - Hesse, Falk A1 - Attinger, Sabine A1 - Kumar, Rohini T1 - Challenges in applying machine learning models for hydrological inference BT - a case study for flooding events across Germany JF - Water resources research N2 - Machine learning (ML) algorithms are being increasingly used in Earth and Environmental modeling studies owing to the ever-increasing availability of diverse data sets and computational resources as well as advancement in ML algorithms. Despite advances in their predictive accuracy, the usefulness of ML algorithms for inference remains elusive. In this study, we employ two popular ML algorithms, artificial neural networks and random forest, to analyze a large data set of flood events across Germany with the goals to analyze their predictive accuracy and their usability to provide insights to hydrologic system functioning. The results of the ML algorithms are contrasted against a parametric approach based on multiple linear regression. For analysis, we employ a model-agnostic framework named Permuted Feature Importance to derive the influence of models' predictors. This allows us to compare the results of different algorithms for the first time in the context of hydrology. Our main findings are that (1) the ML models achieve higher prediction accuracy than linear regression, (2) the results reflect basic hydrological principles, but (3) further inference is hindered by the heterogeneity of results across algorithms. Thus, we conclude that the problem of equifinality as known from classical hydrological modeling also exists for ML and severely hampers its potential for inference. To account for the observed problems, we propose that when employing ML for inference, this should be made by using multiple algorithms and multiple methods, of which the latter should be embedded in a cross-validation routine. KW - machine learning KW - inference KW - floods Y1 - 2020 U6 - https://doi.org/10.1029/2019WR025924 SN - 0043-1397 SN - 1944-7973 VL - 56 IS - 5 PB - American Geophysical Union CY - Washington ER - TY - GEN A1 - Schmidt, Lennart A1 - Heße, Falk A1 - Attinger, Sabine A1 - Kumar, Rohini T1 - Challenges in applying machine learning models for hydrological inference: a case study for flooding events across Germany T2 - Postprints der Universität Potsdam : Mathematisch-Naturwissenschaftliche Reihe N2 - Machine learning (ML) algorithms are being increasingly used in Earth and Environmental modeling studies owing to the ever-increasing availability of diverse data sets and computational resources as well as advancement in ML algorithms. Despite advances in their predictive accuracy, the usefulness of ML algorithms for inference remains elusive. In this study, we employ two popular ML algorithms, artificial neural networks and random forest, to analyze a large data set of flood events across Germany with the goals to analyze their predictive accuracy and their usability to provide insights to hydrologic system functioning. The results of the ML algorithms are contrasted against a parametric approach based on multiple linear regression. For analysis, we employ a model-agnostic framework named Permuted Feature Importance to derive the influence of models' predictors. This allows us to compare the results of different algorithms for the first time in the context of hydrology. Our main findings are that (1) the ML models achieve higher prediction accuracy than linear regression, (2) the results reflect basic hydrological principles, but (3) further inference is hindered by the heterogeneity of results across algorithms. Thus, we conclude that the problem of equifinality as known from classical hydrological modeling also exists for ML and severely hampers its potential for inference. To account for the observed problems, we propose that when employing ML for inference, this should be made by using multiple algorithms and multiple methods, of which the latter should be embedded in a cross-validation routine. T3 - Zweitveröffentlichungen der Universität Potsdam : Mathematisch-Naturwissenschaftliche Reihe - 1193 KW - machine learning KW - inference KW - floods Y1 - 2019 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-523843 SN - 1866-8372 IS - 5 ER - TY - JOUR A1 - Schmidt, Lennart A1 - Heße, Falk A1 - Attinger, Sabine A1 - Kumar, Rohini T1 - Challenges in applying machine learning models for hydrological inference: a case study for flooding events across Germany JF - Water Resources Research N2 - Machine learning (ML) algorithms are being increasingly used in Earth and Environmental modeling studies owing to the ever-increasing availability of diverse data sets and computational resources as well as advancement in ML algorithms. Despite advances in their predictive accuracy, the usefulness of ML algorithms for inference remains elusive. In this study, we employ two popular ML algorithms, artificial neural networks and random forest, to analyze a large data set of flood events across Germany with the goals to analyze their predictive accuracy and their usability to provide insights to hydrologic system functioning. The results of the ML algorithms are contrasted against a parametric approach based on multiple linear regression. For analysis, we employ a model-agnostic framework named Permuted Feature Importance to derive the influence of models' predictors. This allows us to compare the results of different algorithms for the first time in the context of hydrology. Our main findings are that (1) the ML models achieve higher prediction accuracy than linear regression, (2) the results reflect basic hydrological principles, but (3) further inference is hindered by the heterogeneity of results across algorithms. Thus, we conclude that the problem of equifinality as known from classical hydrological modeling also exists for ML and severely hampers its potential for inference. To account for the observed problems, we propose that when employing ML for inference, this should be made by using multiple algorithms and multiple methods, of which the latter should be embedded in a cross-validation routine. KW - machine learning KW - inference KW - floods Y1 - 2019 VL - 56 IS - 5 PB - John Wiley & Sons, Inc. CY - New Jersey ER - TY - RPRT A1 - Andres, Maximilian A1 - Bruttel, Lisa Verena A1 - Friedrichsen, Jana T1 - Choosing between explicit cartel formation and tacit collusion – An experiment T2 - CEPA Discussion Papers N2 - Numerous studies investigate which sanctioning institutions prevent cartel formation but little is known as to how these sanctions work. We contribute to understanding the inner workings of cartels by studying experimentally the effect of sanctioning institutions on firms’ communication. Using machine learning to organize the chat communication into topics, we find that firms are significantly less likely to communicate explicitly about price fixing when sanctioning institutions are present. At the same time, average prices are lower when communication is less explicit. A mediation analysis suggests that sanctions are effective in hindering cartel formation not only because they introduce a risk of being fined but also by reducing the prevalence of explicit price communication. T3 - CEPA Discussion Papers - 19 KW - cartel KW - collusion KW - communication KW - machine learning KW - experiment Y1 - 2020 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-473885 SN - 2628-653X IS - 19 ER - TY - GEN A1 - Ayzel, Georgy A1 - Izhitskiy, Alexander T1 - Climate change impact assessment on freshwater inflow into the Small Aral Sea T2 - Postprints der Universität Potsdam : Mathematisch-Naturwissenschaftliche Reihe N2 - During the last few decades, the rapid separation of the Small Aral Sea from the isolated basin has changed its hydrological and ecological conditions tremendously. In the present study, we developed and validated the hybrid model for the Syr Darya River basin based on a combination of state-of-the-art hydrological and machine learning models. Climate change impact on freshwater inflow into the Small Aral Sea for the projection period 2007–2099 has been quantified based on the developed hybrid model and bias corrected and downscaled meteorological projections simulated by four General Circulation Models (GCM) for each of three Representative Concentration Pathway scenarios (RCP). The developed hybrid model reliably simulates freshwater inflow for the historical period with a Nash–Sutcliffe efficiency of 0.72 and a Kling–Gupta efficiency of 0.77. Results of the climate change impact assessment showed that the freshwater inflow projections produced by different GCMs are misleading by providing contradictory results for the projection period. However, we identified that the relative runoff changes are expected to be more pronounced in the case of more aggressive RCP scenarios. The simulated projections of freshwater inflow provide a basis for further assessment of climate change impacts on hydrological and ecological conditions of the Small Aral Sea in the 21st Century. T3 - Zweitveröffentlichungen der Universität Potsdam : Mathematisch-Naturwissenschaftliche Reihe - 1071 KW - Small Aral Sea KW - hydrology KW - climate change KW - modeling KW - machine learning Y1 - 2021 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-472794 SN - 1866-8372 IS - 1071 ER - TY - JOUR A1 - Ayzel, Georgy A1 - Izhitskiy, Alexander T1 - Climate Change Impact Assessment on Freshwater Inflow into the Small Aral Sea JF - Water N2 - During the last few decades, the rapid separation of the Small Aral Sea from the isolated basin has changed its hydrological and ecological conditions tremendously. In the present study, we developed and validated the hybrid model for the Syr Darya River basin based on a combination of state-of-the-art hydrological and machine learning models. Climate change impact on freshwater inflow into the Small Aral Sea for the projection period 2007-2099 has been quantified based on the developed hybrid model and bias corrected and downscaled meteorological projections simulated by four General Circulation Models (GCM) for each of three Representative Concentration Pathway scenarios (RCP). The developed hybrid model reliably simulates freshwater inflow for the historical period with a Nash-Sutcliffe efficiency of 0.72 and a Kling-Gupta efficiency of 0.77. Results of the climate change impact assessment showed that the freshwater inflow projections produced by different GCMs are misleading by providing contradictory results for the projection period. However, we identified that the relative runoff changes are expected to be more pronounced in the case of more aggressive RCP scenarios. The simulated projections of freshwater inflow provide a basis for further assessment of climate change impacts on hydrological and ecological conditions of the Small Aral Sea in the 21st Century. KW - Small Aral Sea KW - hydrology KW - climate change KW - modeling KW - machine learning Y1 - 2019 U6 - https://doi.org/10.3390/w11112377 SN - 2073-4441 VL - 11 IS - 11 PB - MDPI CY - Basel ER - TY - RPRT A1 - Andres, Maximilian A1 - Bruttel, Lisa T1 - Communicating Cartel Intentions T2 - CEPA Discussion Papers N2 - While the economic harm of cartels is caused by their price-increasing effect, sanctioning by courts rather targets at the underlying process of firms reaching a price-fixing agreement. This paper provides experimental evidence on the question whether such sanctioning meets the economic target, i.e., whether evidence of a collusive meeting of the firms and of the content of their communication reliably predicts subsequent prices. We find that already the mere mutual agreement to meet predicts a strong increase in prices. Conversely, express distancing from communication completely nullifies its otherwise price-increasing effect. Using machine learning, we show that communication only increases prices if it is very explicit about how the cartel plans to behave. T3 - CEPA Discussion Papers - 77 KW - cartel KW - collusion KW - communication KW - machine learning KW - experiment Y1 - 2024 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-638469 SN - 2628-653X IS - 77 ER - TY - JOUR A1 - Wulff, Peter A1 - Buschhüter, David A1 - Westphal, Andrea A1 - Nowak, Anna A1 - Becker, Lisa A1 - Robalino, Hugo A1 - Stede, Manfred A1 - Borowski, Andreas T1 - Computer-based classification of preservice physics teachers’ written reflections JF - Journal of science education and technology N2 - Reflecting in written form on one's teaching enactments has been considered a facilitator for teachers' professional growth in university-based preservice teacher education. Writing a structured reflection can be facilitated through external feedback. However, researchers noted that feedback in preservice teacher education often relies on holistic, rather than more content-based, analytic feedback because educators oftentimes lack resources (e.g., time) to provide more analytic feedback. To overcome this impediment to feedback for written reflection, advances in computer technology can be of use. Hence, this study sought to utilize techniques of natural language processing and machine learning to train a computer-based classifier that classifies preservice physics teachers' written reflections on their teaching enactments in a German university teacher education program. To do so, a reflection model was adapted to physics education. It was then tested to what extent the computer-based classifier could accurately classify the elements of the reflection model in segments of preservice physics teachers' written reflections. Multinomial logistic regression using word count as a predictor was found to yield acceptable average human-computer agreement (F1-score on held-out test dataset of 0.56) so that it might fuel further development towards an automated feedback tool that supplements existing holistic feedback for written reflections with data-based, analytic feedback. KW - reflection KW - teacher professional development KW - hatural language KW - processing KW - machine learning Y1 - 2020 U6 - https://doi.org/10.1007/s10956-020-09865-1 SN - 1059-0145 SN - 1573-1839 VL - 30 IS - 1 SP - 1 EP - 15 PB - Springer CY - Dordrecht ER - TY - JOUR A1 - Levy, Jessica A1 - Mussack, Dominic A1 - Brunner, Martin A1 - Keller, Ulrich A1 - Cardoso-Leite, Pedro A1 - Fischbach, Antoine T1 - Contrasting classical and machine learning approaches in the estimation of value-added scores in large-scale educational data JF - Frontiers in psychology N2 - There is no consensus on which statistical model estimates school value-added (VA) most accurately. To date, the two most common statistical models used for the calculation of VA scores are two classical methods: linear regression and multilevel models. These models have the advantage of being relatively transparent and thus understandable for most researchers and practitioners. However, these statistical models are bound to certain assumptions (e.g., linearity) that might limit their prediction accuracy. Machine learning methods, which have yielded spectacular results in numerous fields, may be a valuable alternative to these classical models. Although big data is not new in general, it is relatively new in the realm of social sciences and education. New types of data require new data analytical approaches. Such techniques have already evolved in fields with a long tradition in crunching big data (e.g., gene technology). The objective of the present paper is to competently apply these "imported" techniques to education data, more precisely VA scores, and assess when and how they can extend or replace the classical psychometrics toolbox. The different models include linear and non-linear methods and extend classical models with the most commonly used machine learning methods (i.e., random forest, neural networks, support vector machines, and boosting). We used representative data of 3,026 students in 153 schools who took part in the standardized achievement tests of the Luxembourg School Monitoring Program in grades 1 and 3. Multilevel models outperformed classical linear and polynomial regressions, as well as different machine learning models. However, it could be observed that across all schools, school VA scores from different model types correlated highly. Yet, the percentage of disagreements as compared to multilevel models was not trivial and real-life implications for individual schools may still be dramatic depending on the model type used. Implications of these results and possible ethical concerns regarding the use of machine learning methods for decision-making in education are discussed. KW - value-added modeling KW - school effectiveness KW - machine learning KW - model KW - comparison KW - longitudinal data Y1 - 2020 U6 - https://doi.org/10.3389/fpsyg.2020.02190 SN - 1664-1078 VL - 11 PB - Frontiers Research Foundation CY - Lausanne ER - TY - THES A1 - Koumarelas, Ioannis T1 - Data preparation and domain-agnostic duplicate detection N2 - Successfully completing any data science project demands careful consideration across its whole process. Although the focus is often put on later phases of the process, in practice, experts spend more time in earlier phases, preparing data, to make them consistent with the systems' requirements or to improve their models' accuracies. Duplicate detection is typically applied during the data cleaning phase, which is dedicated to removing data inconsistencies and improving the overall quality and usability of data. While data cleaning involves a plethora of approaches to perform specific operations, such as schema alignment and data normalization, the task of detecting and removing duplicate records is particularly challenging. Duplicates arise when multiple records representing the same entities exist in a database. Due to numerous reasons, spanning from simple typographical errors to different schemas and formats of integrated databases. Keeping a database free of duplicates is crucial for most use-cases, as their existence causes false negatives and false positives when matching queries against it. These two data quality issues have negative implications for tasks, such as hotel booking, where users may erroneously select a wrong hotel, or parcel delivery, where a parcel can get delivered to the wrong address. Identifying the variety of possible data issues to eliminate duplicates demands sophisticated approaches. While research in duplicate detection is well-established and covers different aspects of both efficiency and effectiveness, our work in this thesis focuses on the latter. We propose novel approaches to improve data quality before duplicate detection takes place and apply the latter in datasets even when prior labeling is not available. Our experiments show that improving data quality upfront can increase duplicate classification results by up to 19%. To this end, we propose two novel pipelines that select and apply generic as well as address-specific data preparation steps with the purpose of maximizing the success of duplicate detection. Generic data preparation, such as the removal of special characters, can be applied to any relation with alphanumeric attributes. When applied, data preparation steps are selected only for attributes where there are positive effects on pair similarities, which indirectly affect classification, or on classification directly. Our work on addresses is twofold; first, we consider more domain-specific approaches to improve the quality of values, and, second, we experiment with known and modified versions of similarity measures to select the most appropriate per address attribute, e.g., city or country. To facilitate duplicate detection in applications where gold standard annotations are not available and obtaining them is not possible or too expensive, we propose MDedup. MDedup is a novel, rule-based, and fully automatic duplicate detection approach that is based on matching dependencies. These dependencies can be used to detect duplicates and can be discovered using state-of-the-art algorithms efficiently and without any prior labeling. MDedup uses two pipelines to first train on datasets with known labels, learning to identify useful matching dependencies, and then be applied on unseen datasets, regardless of any existing gold standard. Finally, our work is accompanied by open source code to enable repeatability of our research results and application of our approaches to other datasets. N2 - Die erfolgreiche Durchführung eines datenwissenschaftlichen Projekts erfordert eine Reihe sorgfältiger Abwägungen, die während des gesamten Prozessesverlaufs zu treffen sind. Obwohl sich der Schwerpunkt oft auf spätere Prozessphasen konzentriert, verbringen Experten in der Praxis jedoch einen Großteil ihrer Zeit in frühen Projektphasen in denen sie Daten aufbereiten, um sie mit den Anforderungen vorhandener Systeme in Einklang zu bringen oder die Genauigkeit ihrer Modelle zu verbessern. Die Duplikaterkennung wird üblicherweise während der Datenbereinigungsphase durchgeführt, sie dient der Beseitigung von Dateninkonsistenzen und somit der Verbesserung von Gesamtqualität und Benutzerfreundlichkeit der Daten. Während die Datenbereinigung eine Vielzahl von Ansätzen zur Durchführung spezifischer Operationen wie etwa dem Schema-Abgleich und der Datennormalisierung umfasst, stellt die Identifizierung und Entfernung doppelter Datensätze eine besondere Herausforderung dar. Dabei entstehen Duplikate, wenn mehrere Datensätze, welche die gleichen Entitäten repräsentieren, in einer Datenbank vorhanden sind. Die Gründe dafür sind vielfältig und reichen von einfachen Schreibfehlern bis hin zu unterschiedlichen Schemata und Formaten integrierter Datenbanken. Eine Datenbank duplikatfrei zu halten, ist für die meisten Anwendungsfälle von entscheidender Bedeutung, da ihre Existenz zu falschen Negativ- und Falsch-Positiv-Abfragen führt. So können sich derartige Datenqualitätsprobleme negativ auf Aufgaben wie beispielsweise Hotelbuchungen oder Paketzustellungen auswirken, was letztlich dazu führen kann, dass Benutzer ein falsches Hotel buchen, oder Pakete an eine falsche Adresse geliefert werden. Um ein breites Spektrum potenzieller Datenprobleme zu identifizieren, deren Lösung die Beseitigung von Duplikaten erleichtert, sind eine Reihe ausgefeilter Ansätze erforderlich. Obgleich der Forschungsbereich der Duplikaterkennung mit der Untersuchung verschiedenster Effizienz und Effektivitätsaspekte bereits gut etabliert ist, konzentriert sich diese Arbeit auf letztgenannte Aspekte. Wir schlagen neue Ansätze zur Verbesserung der Datenqualität vor, die vor der Duplikaterkennung erfolgen, und wenden letztere auf Datensätze an, selbst wenn diese über keine im Vorfeld erstellten Annotationen verfügen. Unsere Experimente zeigen, dass durch eine im Vorfeld verbesserte Datenqualität die Ergebnisse der sich anschließenden Duplikatklassifizierung um bis zu 19% verbessert werden können. Zu diesem Zweck schlagen wir zwei neuartige Pipelines vor, die sowohl generische als auch adressspezifische Datenaufbereitungsschritte auswählen und anwenden, um den Erfolg der Duplikaterkennung zu maximieren. Die generische Datenaufbereitung, wie z.B. die Entfernung von Sonderzeichen, kann auf jede Relation mit alphanumerischen Attributen angewendet werden. Bei entsprechender Anwendung werden Datenaufbereitungsschritte nur für Attribute ausgewählt, bei denen sich positive Auswirkungen auf Paarähnlichkeiten ergeben, welche sich direkt oder indirekt auf die Klassifizierung auswirken. Unsere Arbeit an Adressen umfasst zwei Aspekte: erstens betrachten wir mehr domänenspezifische Ansätze zur Verbesserung der Adressqualität, zweitens experimentieren wir mit bekannten und modifizierten Versionen verschiedener Ähnlichkeitsmaße, um infolgedessen das am besten geeignete Ähnlichkeitsmaß für jedes Adressattribut, z.B. Stadt oder Land, zu bestimmen. Um die Erkennung von Duplikaten bei Anwendungen zu erleichtern, in denen Goldstandard-Annotationen nicht zur Verfügung stehen und deren Beschaffung aus Kostengründen nicht möglich ist, schlagen wir MDedup vor. MDedup ist ein neuartiger, regelbasierter und vollautomatischer Ansatz zur Dublikaterkennung, der auf Matching Dependencies beruht. Diese Abhängigkeiten können zur Erkennung von Duplikaten genutzt und mit Hilfe modernster Algorithmen effizient ohne vorhergehenden Annotationsaufwand entdeckt werden. MDedup verwendet zwei Pipelines, um zunächst auf annotierten Datensätzen zu trainieren, wobei die Identifizierung nützlicher Matching-Abhängigkeiten erlernt wird, welche dann unabhängig von einem bestehenden Goldstandard auf ungesehenen Datensätzen angewendet werden können. Schließlich stellen wir den im Rahmen dieser Arbeit entstehenden Quellcode zur Verfügung, wodurch sowohl die Wiederholbarkeit unserer Forschungsergebnisse als auch die Anwendung unserer Ansätze auf anderen Datensätzen gewährleistet werden soll. T2 - Datenaufbereitung und domänenagnostische Duplikaterkennung KW - duplicate detection KW - data cleaning KW - entity resolution KW - record linkage KW - data preparation KW - data matching KW - address normalization KW - machine learning KW - matching dependencies KW - Adressnormalisierung KW - Datenbereinigung KW - Datenabgleich KW - Datenaufbereitung KW - Duplikaterkennung KW - Entitätsauflösung KW - Maschinelles Lernen KW - Abgleich von Abhängigkeiten KW - Datensatzverknüpfung Y1 - 2020 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-489131 ER - TY - JOUR A1 - Baumgart, Lene A1 - Boos, Pauline A1 - Eckstein, Bernd T1 - Datafication and algorithmic contingency BT - how agile organisations deal with technical systems JF - Work organisation, labour & globalisation N2 - In the context of persistent images of self-perpetuated technologies, we discuss the interplay of digital technologies and organisational dynamics against the backdrop of systems theory. Building on the case of an international corporation that, during an agile reorganisation, introduced an AI-based personnel management platform, we show how technical systems produce a form of algorithmic contingency that subsequently leads to the emergence of formal and informal interaction systems. Using the concept of datafication, we explain how these interactions are barriers to the self-perpetuation of data-based decision-making, making it possible to take into consideration further decision factors and complementing the output of the platform. The research was carried out within the scope of the research project ‘Organisational Implications of Digitalisation: The Development of (Post-)Bureaucratic Organisational Structures in the Context of Digital Transformation’ funded by the German Research Foundation (DFG). KW - digitalisation KW - datafication KW - organisation KW - agile KW - technical system KW - systems theory KW - interaction KW - algorithmic contingency KW - machine learning KW - platform Y1 - 2023 U6 - https://doi.org/10.13169/workorgalaboglob.17.1.0061 SN - 1745-641X SN - 1745-6428 VL - 17 IS - 1 SP - 61 EP - 73 PB - Pluto Journals CY - London ER - TY - THES A1 - Hoang, Yen T1 - De novo binning strategy to analyze and visualize multi-dimensional cytometric data T1 - De novo Binning-Ansatz zur Untersuchung und Visualisierung von multidimensionalen Zytrometriedaten BT - engineering of combinatorial variables for supervised learning approaches N2 - Since half a century, cytometry has been a major scientific discipline in the field of cytomics - the study of system’s biology at single cell level. It enables the investigation of physiological processes, functional characteristics and rare events with proteins by analysing multiple parameters on an individual cell basis. In the last decade, mass cytometry has been established which increased the parallel measurement to up to 50 proteins. This has shifted the analysis strategy from conventional consecutive manual gates towards multi-dimensional data processing. Novel algorithms have been developed to tackle these high-dimensional protein combinations in the data. They are mainly based on clustering or non-linear dimension reduction techniques, or both, often combined with an upstream downsampling procedure. However, these tools have obstacles either in comprehensible interpretability, reproducibility, computational complexity or in comparability between samples and groups. To address this bottleneck, a reproducible, semi-automated cytometric data mining workflow PRI (pattern recognition of immune cells) is proposed which combines three main steps: i) data preparation and storage; ii) bin-based combinatorial variable engineering of three protein markers, the so called triploTs, and subsequent sectioning of these triploTs in four parts; and iii) deployment of a data-driven supervised learning algorithm, the cross-validated elastic-net regularized logistic regression, with these triploT sections as input variables. As a result, the selected variables from the models are ranked by their prevalence, which potentially have discriminative value. The purpose is to significantly facilitate the identification of meaningful subpopulations, which are most distinguish between two groups. The proposed workflow PRI is exemplified by a recently published public mass cytometry data set. The authors found a T cell subpopulation which is discriminative between effective and ineffective treatment of breast carcinomas in mice. With PRI, that subpopulation was not only validated, but was further narrowed down as a particular Th1 cell population. Moreover, additional insights of combinatorial protein expressions are revealed in a traceable manner. An essential element in the workflow is the reproducible variable engineering. These variables serve as basis for a clearly interpretable visualization, for a structured variable exploration and as input layers in neural network constructs. PRI facilitates the determination of marker levels in a semi-continuous manner. Jointly with the combinatorial display, it allows a straightforward observation of correlating patterns, and thus, the dominant expressed markers and cell hierarchies. Furthermore, it enables the identification and complex characterization of discriminating subpopulations due to its reproducible and pseudo-multi-parametric pattern presentation. This endorses its applicability as a tool for unbiased investigations on cell subsets within multi-dimensional cytometric data sets. N2 - Massen- und Durchflusszytometrie-Messungen ermöglichen die detaillierte Einteilung von Zellgruppen nach Eigenschaften vor allem in der Diagnostik und in der Grundlagenforschung anhand der Erfassung von biologischen Informationen auf Einzelzellebene. Sie unterstützen die detaillierte Analyse von komplexen, zellulären Zusammenhängen, um physiologische und pathophysiologische Prozesse zu erkennen, und funktionelle oder krankheitsspezifische Characteristika und rare Zellgruppen genauer zu spezifizieren und zu extrahieren. In den letzten Jahren haben zytometrische Technologien einen enormen Innovationssprung erfahren, sodass heutzutage bis zu 50 Proteine pro Zelle parallel gemessen werden können. Und das mit einem Durchsatz von Hunderttausenden bis mehreren Millionen von Zellen aus einer Probe. Bei der Zunahme der Messparameter steigen jedoch die Dimensionen der kombinierten Parameter exponentiell, sodass eine komplexe Kombinatorik entsteht, die mit konventionellen, manuellen Untersuchungen von bi-axialen Diagrammen nicht mehr durchführbar sind. Derzeit gibt es schon viele neue Datenanalyse-Ansätze, die vorranging auf Cluster- bzw. Dimensionsreduktionstechniken basieren und meist mit einem vorgeschalteten Downsampling in Kombination eingesetzt werden. Diese Tools produzieren aber komplexe Ergebnisse, die größtenteils nicht reproduzierbar sind oder Proben- und Gruppenvergleiche erschweren. Um dieses Problem anzugehen wurde in dieser Dissertation ein reproduzierbarer, halbautomatisierter Datenanalyse-Workflow namens PRI entwickelt, was für pattern recognition of immune cells (Mustererkennung von Immunzellen) steht. Dieser Workflow ist in drei Hauptteile untergliedert: die Datenvorbereitung und -Ablage; die Entwicklung innovativer, bin-basierter Merkmale von drei kombinierten Parametern namens TriploTs und dessen weiterführende Einteilung in vier gleich große TriploT-Areale; und die Anwendung von einem maschinellen Lernansatz basierend auf der Information von diesen Arealen. Als Ergebnis bekommt man eine Selektion der Areale, die am häufigsten von den überwachten Modellen ausgewählt wurden. Dies soll dem Wissenschaftler entscheidend dabei helfen, Zellpopulationen zu identifizieren, die am besten zwischen zwei Gruppen unterscheiden. Der vorgestellte Workflow PRI ist exemplarisch an einem kürzlich veröffentlichten Massenzytometrie-Datensatz validiert worden. Die von den Originalautoren hervorgehobene Zellpopulation konnte nicht nur identifiziert werden, sondern sogar wesentlich weiter spezifiziert werden. Außerdem wurden weitere Erkenntnisse von relevanten, kombinatorischen Proteinexpressionen festgestellt. Die Entwicklung der reproduzierbaren TriploTs führt dazu, dass sie als Basis für verständliche und leicht interpretierbare Visualisierungen, für eine strukturierte Erforschung der Daten mithilfe der Selektion der Areale, und für neuronale Netzwerkkonstrukte genutzt werden können. PRI ermöglicht eine optimierte, semi-kontinuierliche Bestimmung der Expressionsstufen, die die Identifizierung von dominant vorherrschenden und diskriminierenden Proteinen in Zellsubpopulationen wesentlich erleichtert. Darüberhinaus erlaubt es die intuitive Erfassung von korrelierenden Mustern durch die innovative, reproduzierbare Darstellung der Proteinkombinationen und hilft bei der Erforschung von Zellsubpopulationen. KW - machine learning KW - feature engineering KW - machinelles Lernen KW - Feature Engineering Y1 - 2019 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-443078 ER - TY - JOUR A1 - Ayzel, Georgy T1 - Deep neural networks in hydrology BT - the new generation of universal and efficient models BT - новое поколение универсальных и эффективных моделей JF - Vestnik of Saint Petersburg University. Earth Sciences N2 - For around a decade, deep learning - the sub-field of machine learning that refers to artificial neural networks comprised of many computational layers - modifies the landscape of statistical model development in many research areas, such as image classification, machine translation, and speech recognition. Geoscientific disciplines in general and the field of hydrology in particular, also do not stand aside from this movement. Recently, the proliferation of modern deep learning-based techniques and methods has been actively gaining popularity for solving a wide range of hydrological problems: modeling and forecasting of river runoff, hydrological model parameters regionalization, assessment of available water resources. identification of the main drivers of the recent change in water balance components. This growing popularity of deep neural networks is primarily due to their high universality and efficiency. The presented qualities, together with the rapidly growing amount of accumulated environmental information, as well as increasing availability of computing facilities and resources, allow us to speak about deep neural networks as a new generation of mathematical models designed to, if not to replace existing solutions, but significantly enrich the field of geophysical processes modeling. This paper provides a brief overview of the current state of the field of development and application of deep neural networks in hydrology. Also in the following study, the qualitative long-term forecast regarding the development of deep learning technology for managing the corresponding hydrological modeling challenges is provided based on the use of "Gartner Hype Curve", which in the general details describes a life cycle of modern technologies. N2 - В течение последнего десятилетия глубокое обучение - область машинного обучения, относящаяся к искусственным нейронным сетям, состоящим из множества вычислительных слоев, - изменяет ландшафт развития статистических моделей во многих областях исследований, таких как классификация изображений, машинный перевод, распознавание речи. Географические науки, а также входящая в их состав область исследования гидрологии суши, не стоят в стороне от этого движения. В последнее время применение современных технологий и методов глубокого обучения активно набирает популярность для решения широкого спектра гидрологических задач: моделирования и прогнозирования речного стока, районирования модельных параметров, оценки располагаемых водных ресурсов, идентификации факторов, влияющих на современные изменения водного режима. Такой рост популярности глубоких нейронных сетей продиктован прежде всего их высокой универсальностью и эффективностью. Представленные качества в совокупности с быстрорастущим количеством накопленной информации о состоянии окружающей среды, а также ростом доступности вычислительных средств и ресурсов, позволяют говорить о глубоких нейронных сетях как о новом поколении математических моделей, призванных если не заменить существующие решения, то значительно обогатить область моделирования геофизических процессов. В данной работе представлен краткий обзор текущего состояния области разработки и применения глубоких нейронных сетей в гидрологии. Также в работе предложен качественный долгосрочный прогноз развития технологии глубокого обучения для решения задач гидрологического моделирования на основе использования «кривой ажиотажа Гартнера», в общих чертах описывающей жизненный цикл современных технологий. T2 - Глубокие нейронные сети в гидрологии KW - deep neural networks KW - deep learning KW - machine learning KW - hydrology KW - modeling KW - глубокие нейронные сети KW - глубокое обучение KW - машинное обучение KW - гидрология KW - моделирование Y1 - 2021 U6 - https://doi.org/10.21638/spbu07.2021.101 SN - 2541-9668 SN - 2587-585X VL - 66 IS - 1 SP - 5 EP - 18 PB - Univ. Press CY - St. Petersburg ER - TY - CHAP A1 - Panzer, Marcel A1 - Bender, Benedict A1 - Gronau, Norbert T1 - Deep reinforcement learning in production planning and control BT - A systematic literature review T2 - Proceedings of the Conference on Production Systems and Logistics N2 - Increasingly fast development cycles and individualized products pose major challenges for today's smart production systems in times of industry 4.0. The systems must be flexible and continuously adapt to changing conditions while still guaranteeing high throughputs and robustness against external disruptions. Deep rein- forcement learning (RL) algorithms, which already reached impressive success with Google DeepMind's AlphaGo, are increasingly transferred to production systems to meet related requirements. Unlike supervised and unsupervised machine learning techniques, deep RL algorithms learn based on recently collected sensor- and process-data in direct interaction with the environment and are able to perform decisions in real-time. As such, deep RL algorithms seem promising given their potential to provide decision support in complex environments, as production systems, and simultaneously adapt to changing circumstances. While different use-cases for deep RL emerged, a structured overview and integration of findings on their application are missing. To address this gap, this contribution provides a systematic literature review of existing deep RL applications in the field of production planning and control as well as production logistics. From a performance perspective, it became evident that deep RL can beat heuristics significantly in their overall performance and provides superior solutions to various industrial use-cases. Nevertheless, safety and reliability concerns must be overcome before the widespread use of deep RL is possible which presumes more intensive testing of deep RL in real world applications besides the already ongoing intensive simulations. KW - deep reinforcement learning KW - machine learning KW - production planning KW - production control KW - systematic literature review Y1 - 2021 U6 - https://doi.org/10.15488/11238 SN - 2701-6277 SP - 535 EP - 545 PB - publish-Ing. CY - Hannover ER - TY - GEN A1 - Panzer, Marcel A1 - Bender, Benedict A1 - Gronau, Norbert T1 - Deep reinforcement learning in production planning and control BT - A systematic literature review T2 - Zweitveröffentlichungen der Universität Potsdam : Wirtschafts- und Sozialwissenschaftliche Reihe N2 - Increasingly fast development cycles and individualized products pose major challenges for today's smart production systems in times of industry 4.0. The systems must be flexible and continuously adapt to changing conditions while still guaranteeing high throughputs and robustness against external disruptions. Deep reinforcement learning (RL) algorithms, which already reached impressive success with Google DeepMind's AlphaGo, are increasingly transferred to production systems to meet related requirements. Unlike supervised and unsupervised machine learning techniques, deep RL algorithms learn based on recently collected sensorand process-data in direct interaction with the environment and are able to perform decisions in real-time. As such, deep RL algorithms seem promising given their potential to provide decision support in complex environments, as production systems, and simultaneously adapt to changing circumstances. While different use-cases for deep RL emerged, a structured overview and integration of findings on their application are missing. To address this gap, this contribution provides a systematic literature review of existing deep RL applications in the field of production planning and control as well as production logistics. From a performance perspective, it became evident that deep RL can beat heuristics significantly in their overall performance and provides superior solutions to various industrial use-cases. Nevertheless, safety and reliability concerns must be overcome before the widespread use of deep RL is possible which presumes more intensive testing of deep RL in real world applications besides the already ongoing intensive simulations. T3 - Zweitveröffentlichungen der Universität Potsdam : Wirtschafts- und Sozialwissenschaftliche Reihe - 198 KW - deep reinforcement learning KW - machine learning KW - production planning KW - production control KW - systematic literature review Y1 - 2021 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-605722 SN - 2701-6277 SN - 1867-5808 ER - TY - THES A1 - Rezaei, Mina T1 - Deep representation learning from imbalanced medical imaging N2 - Medical imaging plays an important role in disease diagnosis, treatment planning, and clinical monitoring. One of the major challenges in medical image analysis is imbalanced training data, in which the class of interest is much rarer than the other classes. Canonical machine learning algorithms suppose that the number of samples from different classes in the training dataset is roughly similar or balance. Training a machine learning model on an imbalanced dataset can introduce unique challenges to the learning problem. A model learned from imbalanced training data is biased towards the high-frequency samples. The predicted results of such networks have low sensitivity and high precision. In medical applications, the cost of misclassification of the minority class could be more than the cost of misclassification of the majority class. For example, the risk of not detecting a tumor could be much higher than referring to a healthy subject to a doctor. The current Ph.D. thesis introduces several deep learning-based approaches for handling class imbalanced problems for learning multi-task such as disease classification and semantic segmentation. At the data-level, the objective is to balance the data distribution through re-sampling the data space: we propose novel approaches to correct internal bias towards fewer frequency samples. These approaches include patient-wise batch sampling, complimentary labels, supervised and unsupervised minority oversampling using generative adversarial networks for all. On the other hand, at algorithm-level, we modify the learning algorithm to alleviate the bias towards majority classes. In this regard, we propose different generative adversarial networks for cost-sensitive learning, ensemble learning, and mutual learning to deal with highly imbalanced imaging data. We show evidence that the proposed approaches are applicable to different types of medical images of varied sizes on different applications of routine clinical tasks, such as disease classification and semantic segmentation. Our various implemented algorithms have shown outstanding results on different medical imaging challenges. N2 - Medizinische Bildanalyse spielt eine wichtige Rolle bei der Diagnose von Krankheiten, der Behandlungsplanung, und der klinischen Überwachung. Eines der großen Probleme in der medizinischen Bildanalyse ist das Vorhandensein von nicht ausbalancierten Trainingsdaten, bei denen die Anzahl der Datenpunkte der Zielklasse in der Unterzahl ist. Die Aussagen eines Modells, welches auf einem unbalancierten Datensatz trainiert wurde, tendieren dazu Datenpunkte in die Klasse mit der Mehrzahl an Trainingsdaten einzuordnen. Die Aussagen eines solchen Modells haben eine geringe Sensitivität aber hohe Genauigkeit. Im medizinischen Anwendungsbereich kann die Einordnung eines Datenpunktes in eine falsche Klasse Schwerwiegende Ergebnisse mit sich bringen. In die Nichterkennung eines Tumors Beispielsweise brigt ein viel höheres Risiko für einen Patienten, als wenn ein gesunder Patient zum Artz geschickt wird. Das Problem des Lernens unter Nutzung von nicht ausbalancierten Trainingsdaten wird erst seit Kurzem bei der Klassifizierung von Krankheiten, der Entdeckung von Tumoren und beider Segmentierung von Tumoren untersucht. In der Literatur wird hier zwischen zwei verschiedenen Ansätzen unterschieden: datenbasierte und algorithmische Ansätze. Die vorliegende Arbeit behandelt das Lernen unter Nutzung von unbalancierten medizinischen Bilddatensätzen mittels datenbasierter und algorithmischer Ansätze. Bei den datenbasierten Ansätzen ist es unser Ziel, die Datenverteilung durch gezieltes Nutzen der vorliegenden Datenbasis auszubalancieren. Dazu schlagen wir neuartige Ansätze vor, um eine ausgeglichene Einordnung der Daten aus seltenen Klassen vornehmen zu können. Diese Ansätze sind unter anderem synthesize minority class sampling, patient-wise batch normalization, und die Erstellung von komplementären Labels unter Nutzung von generative adversarial networks. Auf der Seite der algorithmischen Ansätze verändern wir den Trainingsalgorithmus, um die Tendenz in Richtung der Klasse mit der Mehrzahl an Trainingsdaten zu verringern. Dafür schlagen wir verschiedene Algorithmen im Bereich des kostenintensiven Lernens, Ensemble-Lernens und des gemeinsamen Lernens vor, um mit stark unbalancierten Trainingsdaten umgehen zu können. Wir zeigen, dass unsere vorgeschlagenen Ansätze für verschiedenste Typen von medizinischen Bildern, mit variierender Größe, auf verschiedene Anwendungen im klinischen Alltag, z. B. Krankheitsklassifizierung, oder semantische Segmentierung, anwendbar sind. Weiterhin haben unsere Algorithmen hervorragende Ergebnisse bei unterschiedlichen Wettbewerben zur medizinischen Bildanalyse gezeigt. KW - machine learning KW - deep learning KW - computer vision KW - imbalanced learning KW - medical image analysis KW - Maschinenlernen KW - tiefes Lernen KW - unbalancierter Datensatz KW - Computervision KW - medizinische Bildanalyse Y1 - 2019 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-442759 ER - TY - THES A1 - Lilienkamp, Henning T1 - Enhanced computational approaches for data-driven characterization of earthquake ground motion and rapid earthquake impact assessment T1 - Fortgeschrittene Berechnungsansätze für die datengestützte Charakterisierung von Erdbeben-Bodenbewegungen und die schnelle Einschätzung von Erdbebenauswirkungen N2 - Rapidly growing seismic and macroseismic databases and simplified access to advanced machine learning methods have in recent years opened up vast opportunities to address challenges in engineering and strong motion seismology from novel, datacentric perspectives. In this thesis, I explore the opportunities of such perspectives for the tasks of ground motion modeling and rapid earthquake impact assessment, tasks with major implications for long-term earthquake disaster mitigation. In my first study, I utilize the rich strong motion database from the Kanto basin, Japan, and apply the U-Net artificial neural network architecture to develop a deep learning based ground motion model. The operational prototype provides statistical estimates of expected ground shaking, given descriptions of a specific earthquake source, wave propagation paths, and geophysical site conditions. The U-Net interprets ground motion data in its spatial context, potentially taking into account, for example, the geological properties in the vicinity of observation sites. Predictions of ground motion intensity are thereby calibrated to individual observation sites and earthquake locations. The second study addresses the explicit incorporation of rupture forward directivity into ground motion modeling. Incorporation of this phenomenon, causing strong, pulse like ground shaking in the vicinity of earthquake sources, is usually associated with an intolerable increase in computational demand during probabilistic seismic hazard analysis (PSHA) calculations. I suggest an approach in which I utilize an artificial neural network to efficiently approximate the average, directivity-related adjustment to ground motion predictions for earthquake ruptures from the 2022 New Zealand National Seismic Hazard Model. The practical implementation in an actual PSHA calculation demonstrates the efficiency and operational readiness of my model. In a follow-up study, I present a proof of concept for an alternative strategy in which I target the generalizing applicability to ruptures other than those from the New Zealand National Seismic Hazard Model. In the third study, I address the usability of pseudo-intensity reports obtained from macroseismic observations by non-expert citizens for rapid impact assessment. I demonstrate that the statistical properties of pseudo-intensity collections describing the intensity of shaking are correlated with the societal impact of earthquakes. In a second step, I develop a probabilistic model that, within minutes of an event, quantifies the probability of an earthquake to cause considerable societal impact. Under certain conditions, such a quick and preliminary method might be useful to support decision makers in their efforts to organize auxiliary measures for earthquake disaster response while results from more elaborate impact assessment frameworks are not yet available. The application of machine learning methods to datasets that only partially reveal characteristics of Big Data, qualify the majority of results obtained in this thesis as explorative insights rather than ready-to-use solutions to real world problems. The practical usefulness of this work will be better assessed in the future by applying the approaches developed to growing and increasingly complex data sets. N2 - Das rapide Wachstum seismischer und makroseismischer Datenbanken und der vereinfachte Zugang zu fortschrittlichen Methoden aus dem Bereich des maschinellen Lernens haben in den letzen Jahren die datenfokussierte Betrachtung von Fragestellungen in der Seismologie ermöglicht. In dieser Arbeit erforsche ich das Potenzial solcher Betrachtungsweisen im Hinblick auf die Modellierung erdbebenbedingter Bodenerschütterungen und der raschen Einschätzung von gesellschaftlichen Erdbebenauswirkungen, Disziplinen von erheblicher Bedeutung für den langfristigen Erdbebenkatastrophenschutz in seismisch aktiven Regionen. In meiner ersten Studie nutze ich die Vielzahl an Bodenbewegungsdaten aus der Kanto Region in Japan, sowie eine spezielle neuronale Netzwerkarchitektur (U-Net) um ein Bodenbewegungsmodell zu entwickeln. Der einsatzbereite Prototyp liefert auf Basis der Charakterisierung von Erdbebenherden, Wellenausbreitungspfaden und Bodenbeschaffenheiten statistische Schätzungen der zu erwartenden Bodenerschütterungen. Das U-Net interpretiert Bodenbewegungsdaten im räumlichen Kontext, sodass etwa die geologischen Beschaffenheiten in der Umgebung von Messstationen mit einbezogen werden können. Auch die absoluten Koordinaten von Erdbebenherden und Messstationen werden berücksichtigt. Die zweite Studie behandelt die explizite Berücksichtigung richtungsabhängiger Verstärkungseffekte in der Bodenbewegungsmodellierung. Obwohl solche Effekte starke, impulsartige Erschütterungen in der Nähe von Erdbebenherden erzeugen, die eine erhebliche seismische Beanspruchung von Gebäuden darstellen, wird deren explizite Modellierung in der seismischen Gefährdungsabschätzung aufgrund des nicht vertretbaren Rechenaufwandes ausgelassen. Mit meinem, auf einem neuronalen Netzwerk basierenden, Ansatz schlage ich eine Methode vor, umdieses Vorhaben effizient für Erdbebenszenarien aus dem neuseeländischen seismischen Gefährdungsmodell für 2022 (NSHM) umzusetzen. Die Implementierung in einer seismischen Gefährdungsrechnung unterstreicht die Praktikabilität meines Modells. In einer anschließenden Machbarkeitsstudie untersuche ich einen alternativen Ansatz der auf die Anwendbarkeit auf beliebige Erdbebeszenarien abzielt. Die abschließende dritte Studie befasst sich mit dem potenziellen Nutzen der von makroseismischen Beobachtungen abgeleiteten pseudo-Erschütterungsintensitäten für die rasche Abschätzung von gesellschaftlichen Erdbebenauswirkungen. Ich zeige, dass sich aus den Merkmalen solcher Daten Schlussfolgerungen über die gesellschaftlichen Folgen eines Erdbebens ableiten lassen. Basierend darauf formuliere ich ein statistisches Modell, welches innerhalb weniger Minuten nach einem Erdbeben die Wahrscheinlichkeit für das Auftreten beachtlicher gesellschaftlicher Auswirkungen liefert. Ich komme zu dem Schluss, dass ein solches Modell, unter bestimmten Bedingungen, hilfreich sein könnte, um EntscheidungsträgerInnen in ihren Bestrebungen Hilfsmaßnahmen zu organisieren zu unterstützen. Die Anwendung von Methoden des maschinellen Lernens auf Datensätze die sich nur begrenzt als Big Data charakterisieren lassen, qualifizieren die Mehrheit der Ergebnisse dieser Arbeit als explorative Einblicke und weniger als einsatzbereite Lösungen für praktische Fragestellungen. Der praktische Nutzen dieser Arbeit wird sich in erst in Zukunft an der Anwendung der erarbeiteten Ansätze auf wachsende und zunehmend komplexe Datensätze final abschätzen lassen. KW - seismology KW - machine learning KW - deep learning KW - ground motion modeling KW - seismic hazard KW - rapid earthquake impact assessment KW - geophysics KW - Deep Learning KW - Geophysik KW - Bodenbewegungsmodellierung KW - maschinelles Lernen KW - schnelle Einschätzung von Erdbebenauswirkungen KW - seismische Gefährdung KW - Seismologie Y1 - 2024 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-631954 ER - TY - RPRT A1 - Andres, Maximilian T1 - Equilibrium selection in infinitely repeated games with communication T2 - CEPA Discussion Papers N2 - The present paper proposes a novel approach for equilibrium selection in the infinitely repeated prisoner’s dilemma where players can communicate before choosing their strategies. This approach yields a critical discount factor that makes different predictions for cooperation than the usually considered sub-game perfect or risk dominance critical discount factors. In laboratory experiments, we find that our factor is useful for predicting cooperation. For payoff changes where the usually considered factors and our factor make different predictions, the observed cooperation is consistent with the predictions based on our factor. T3 - CEPA Discussion Papers - 75 KW - cooperation KW - communication KW - infinitely repeated game KW - machine learning Y1 - 2024 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-631800 SN - 2628-653X IS - 75 ER - TY - JOUR A1 - Steinberg, Andreas A1 - Vasyura-Bathke, Hannes A1 - Gaebler, Peter Jost A1 - Ohrnberger, Matthias A1 - Ceranna, Lars T1 - Estimation of seismic moment tensors using variational inference machine learning JF - Journal of geophysical research : Solid earth N2 - We present an approach for rapidly estimating full moment tensors of earthquakes and their parameter uncertainties based on short time windows of recorded seismic waveform data by considering deep learning of Bayesian Neural Networks (BNNs). The individual neural networks are trained on synthetic seismic waveform data and corresponding known earthquake moment-tensor parameters. A monitoring volume has been predefined to form a three-dimensional grid of locations and to train a BNN for each grid point. Variational inference on several of these networks allows us to consider several sources of error and how they affect the estimated full moment-tensor parameters and their uncertainties. In particular, we demonstrate how estimated parameter distributions are affected by uncertainties in the earthquake centroid location in space and time as well as in the assumed Earth structure model. We apply our approach as a proof of concept on seismic waveform recordings of aftershocks of the Ridgecrest 2019 earthquake with moment magnitudes ranging from Mw 2.7 to Mw 5.5. Overall, good agreement has been achieved between inferred parameter ensembles and independently estimated parameters using classical methods. Our developed approach is fast and robust, and therefore, suitable for down-stream analyses that need rapid estimates of the source mechanism for a large number of earthquakes. KW - seismology KW - machine learning KW - earthquake source KW - moment tensor KW - full KW - waveform Y1 - 2021 U6 - https://doi.org/10.1029/2021JB022685 SN - 2169-9313 SN - 2169-9356 VL - 126 IS - 10 PB - American Geophysical Union CY - Washington ER - TY - JOUR A1 - Kappattanavar, Arpita Mallikarjuna A1 - Hecker, Pascal A1 - Moontaha, Sidratul A1 - Steckhan, Nico A1 - Arnrich, Bert T1 - Food choices after cognitive load BT - an affective computing approach JF - Sensors N2 - Psychology and nutritional science research has highlighted the impact of negative emotions and cognitive load on calorie consumption behaviour using subjective questionnaires. Isolated studies in other domains objectively assess cognitive load without considering its effects on eating behaviour. This study aims to explore the potential for developing an integrated eating behaviour assistant system that incorporates cognitive load factors. Two experimental sessions were conducted using custom-developed experimentation software to induce different stimuli. During these sessions, we collected 30 h of physiological, food consumption, and affective states questionnaires data to automatically detect cognitive load and analyse its effect on food choice. Utilising grid search optimisation and leave-one-subject-out cross-validation, a support vector machine model achieved a mean classification accuracy of 85.12% for the two cognitive load tasks using eight relevant features. Statistical analysis was performed on calorie consumption and questionnaire data. Furthermore, 75% of the subjects with higher negative affect significantly increased consumption of specific foods after high-cognitive-load tasks. These findings offer insights into the intricate relationship between cognitive load, affective states, and food choice, paving the way for an eating behaviour assistant system to manage food choices during cognitive load. Future research should enhance system capabilities and explore real-world applications. KW - cognitive load KW - eating behaviour KW - machine learning KW - physiological signals KW - photoplethysmography KW - electrodermal activity KW - sensors Y1 - 2023 U6 - https://doi.org/10.3390/s23146597 SN - 1424-8220 VL - 23 IS - 14 PB - MDPI CY - Basel ER - TY - JOUR A1 - Döllner, Jürgen Roland Friedrich T1 - Geospatial artificial intelligence BT - potentials of machine learning for 3D point clouds and geospatial digital twins JF - Journal of photogrammetry, remote sensing and geoinformation science : PFG : Photogrammetrie, Fernerkundung, Geoinformation N2 - Artificial intelligence (AI) is changing fundamentally the way how IT solutions are implemented and operated across all application domains, including the geospatial domain. This contribution outlines AI-based techniques for 3D point clouds and geospatial digital twins as generic components of geospatial AI. First, we briefly reflect on the term "AI" and outline technology developments needed to apply AI to IT solutions, seen from a software engineering perspective. Next, we characterize 3D point clouds as key category of geodata and their role for creating the basis for geospatial digital twins; we explain the feasibility of machine learning (ML) and deep learning (DL) approaches for 3D point clouds. In particular, we argue that 3D point clouds can be seen as a corpus with similar properties as natural language corpora and formulate a "Naturalness Hypothesis" for 3D point clouds. In the main part, we introduce a workflow for interpreting 3D point clouds based on ML/DL approaches that derive domain-specific and application-specific semantics for 3D point clouds without having to create explicit spatial 3D models or explicit rule sets. Finally, examples are shown how ML/DL enables us to efficiently build and maintain base data for geospatial digital twins such as virtual 3D city models, indoor models, or building information models. N2 - Georäumliche Künstliche Intelligenz: Potentiale des Maschinellen Lernens für 3D-Punktwolken und georäumliche digitale Zwillinge. Künstliche Intelligenz (KI) verändert grundlegend die Art und Weise, wie IT-Lösungen in allen Anwendungsbereichen, einschließlich dem Geoinformationsbereich, implementiert und betrieben werden. In diesem Beitrag stellen wir KI-basierte Techniken für 3D-Punktwolken als einen Baustein der georäumlichen KI vor. Zunächst werden kurz der Begriff "KI” und die technologischen Entwicklungen skizziert, die für die Anwendung von KI auf IT-Lösungen aus der Sicht der Softwaretechnik erforderlich sind. Als nächstes charakterisieren wir 3D-Punktwolken als Schlüsselkategorie von Geodaten und ihre Rolle für den Aufbau von räumlichen digitalen Zwillingen; wir erläutern die Machbarkeit der Ansätze für Maschinelles Lernen (ML) und Deep Learning (DL) in Bezug auf 3D-Punktwolken. Insbesondere argumentieren wir, dass 3D-Punktwolken als Korpus mit ähnlichen Eigenschaften wie natürlichsprachliche Korpusse gesehen werden können und formulieren eine "Natürlichkeitshypothese” für 3D-Punktwolken. Im Hauptteil stellen wir einen Workflow zur Interpretation von 3D-Punktwolken auf der Grundlage von ML/DL-Ansätzen vor, die eine domänenspezifische und anwendungsspezifische Semantik für 3D-Punktwolken ableiten, ohne explizite räumliche 3D-Modelle oder explizite Regelsätze erstellen zu müssen. Abschließend wird an Beispielen gezeigt, wie ML/DL es ermöglichen, Basisdaten für räumliche digitale Zwillinge, wie z.B. für virtuelle 3D-Stadtmodelle, Innenraummodelle oder Gebäudeinformationsmodelle, effizient aufzubauen und zu pflegen. KW - geospatial artificial intelligence KW - machine learning KW - deep learning KW - 3D KW - point clouds KW - geospatial digital twins KW - 3D city models Y1 - 2020 U6 - https://doi.org/10.1007/s41064-020-00102-3 SN - 2512-2789 SN - 2512-2819 VL - 88 IS - 1 SP - 15 EP - 24 PB - Springer International Publishing CY - Cham ER - TY - THES A1 - Sapegin, Andrey T1 - High-Speed Security Log Analytics Using Hybrid Outlier Detection N2 - The rapid development and integration of Information Technologies over the last decades influenced all areas of our life, including the business world. Yet not only the modern enterprises become digitalised, but also security and criminal threats move into the digital sphere. To withstand these threats, modern companies must be aware of all activities within their computer networks. The keystone for such continuous security monitoring is a Security Information and Event Management (SIEM) system that collects and processes all security-related log messages from the entire enterprise network. However, digital transformations and technologies, such as network virtualisation and widespread usage of mobile communications, lead to a constantly increasing number of monitored devices and systems. As a result, the amount of data that has to be processed by a SIEM system is increasing rapidly. Besides that, in-depth security analysis of the captured data requires the application of rather sophisticated outlier detection algorithms that have a high computational complexity. Existing outlier detection methods often suffer from performance issues and are not directly applicable for high-speed and high-volume analysis of heterogeneous security-related events, which becomes a major challenge for modern SIEM systems nowadays. This thesis provides a number of solutions for the mentioned challenges. First, it proposes a new SIEM system architecture for high-speed processing of security events, implementing parallel, in-memory and in-database processing principles. The proposed architecture also utilises the most efficient log format for high-speed data normalisation. Next, the thesis offers several novel high-speed outlier detection methods, including generic Hybrid Outlier Detection that can efficiently be used for Big Data analysis. Finally, the special User Behaviour Outlier Detection is proposed for better threat detection and analysis of particular user behaviour cases. The proposed architecture and methods were evaluated in terms of both performance and accuracy, as well as compared with classical architecture and existing algorithms. These evaluations were performed on multiple data sets, including simulated data, well-known public intrusion detection data set, and real data from the large multinational enterprise. The evaluation results have proved the high performance and efficacy of the developed methods. All concepts proposed in this thesis were integrated into the prototype of the SIEM system, capable of high-speed analysis of Big Security Data, which makes this integrated SIEM platform highly relevant for modern enterprise security applications. N2 - In den letzten Jahrzehnten hat die schnelle Weiterentwicklung und Integration der Informationstechnologien alle Bereich unseres Lebens beeinflusst, nicht zuletzt auch die Geschäftswelt. Aus der zunehmenden Digitalisierung des modernen Unternehmens ergeben sich jedoch auch neue digitale Sicherheitsrisiken und kriminelle Bedrohungen. Um sich vor diesen Bedrohungen zu schützen, muss das digitale Unternehmen alle Aktivitäten innerhalb seines Firmennetzes verfolgen. Der Schlüssel zur kontinuierlichen Überwachung aller sicherheitsrelevanten Informationen ist ein sogenanntes Security Information und Event Management (SIEM) System, das alle Meldungen innerhalb des Firmennetzwerks zentral sammelt und verarbeitet. Jedoch führt die digitale Transformation der Unternehmen sowie neue Technologien, wie die Netzwerkvirtualisierung und mobile Endgeräte, zu einer konstant steigenden Anzahl zu überwachender Geräte und Systeme. Dies wiederum hat ein kontinuierliches Wachstum der Datenmengen zur Folge, die das SIEM System verarbeiten muss. Innerhalb eines möglichst kurzen Zeitraumes muss somit eine sehr große Datenmenge (Big Data) analysiert werden, um auf Bedrohungen zeitnah reagieren zu können. Eine gründliche Analyse der sicherheitsrelevanten Aspekte der aufgezeichneten Daten erfordert den Einsatz fortgeschrittener Algorithmen der Anomalieerkennung, die eine hohe Rechenkomplexität aufweisen. Existierende Methoden der Anomalieerkennung haben oftmals Geschwindigkeitsprobleme und sind deswegen nicht anwendbar für die sehr schnelle Analyse sehr großer Mengen heterogener sicherheitsrelevanter Ereignisse. Diese Arbeit schlägt eine Reihe möglicher Lösungen für die benannten Herausforderungen vor. Zunächst wird eine neuartige SIEM Architektur vorgeschlagen, die es erlaubt Ereignisse mit sehr hoher Geschwindigkeit zu verarbeiten. Das System basiert auf den Prinzipien der parallelen Programmierung, sowie der In-Memory und In-Database Datenverarbeitung. Die vorgeschlagene Architektur verwendet außerdem das effizienteste Datenformat zur Vereinheitlichung der Daten in sehr hoher Geschwindigkeit. Des Weiteren wurden im Rahmen dieser Arbeit mehrere neuartige Hochgeschwindigkeitsverfahren zur Anomalieerkennung entwickelt. Eines ist die Hybride Anomalieerkennung (Hybrid Outlier Detection), die sehr effizient auf Big Data eingesetzt werden kann. Abschließend wird eine spezifische Anomalieerkennung für Nutzerverhaltens (User Behaviour Outlier Detection) vorgeschlagen, die eine verbesserte Bedrohungsanalyse von spezifischen Verhaltensmustern der Benutzer erlaubt. Die entwickelte Systemarchitektur und die Algorithmen wurden sowohl mit Hinblick auf Geschwindigkeit, als auch Genauigkeit evaluiert und mit traditionellen Architekturen und existierenden Algorithmen verglichen. Die Evaluation wurde auf mehreren Datensätzen durchgeführt, unter anderem simulierten Daten, gut erforschten öffentlichen Datensätzen und echten Daten großer internationaler Konzerne. Die Resultate der Evaluation belegen die Geschwindigkeit und Effizienz der entwickelten Methoden. Alle Konzepte dieser Arbeit wurden in den Prototyp des SIEM Systems integriert, das in der Lage ist Big Security Data mit sehr hoher Geschwindigkeit zu analysieren. Dies zeigt das diese integrierte SIEM Plattform eine hohe praktische Relevanz für moderne Sicherheitsanwendungen besitzt. T2 - Sicherheitsanalyse in Hochgeschwindigkeit mithilfe der Hybride Anomalieerkennung KW - intrusion detection KW - security KW - machine learning KW - anomaly detection KW - outlier detection KW - novelty detection KW - in-memory KW - SIEM KW - IDS KW - Angriffserkennung KW - Sicherheit KW - Machinelles Lernen KW - Anomalieerkennung KW - In-Memory KW - SIEM KW - IDS Y1 - 2018 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-426118 ER - TY - JOUR A1 - Andres, Maximilian A1 - Bruttel, Lisa A1 - Friedrichsen, Jana T1 - How communication makes the difference between a cartel and tacit collusion BT - a machine learning approach JF - European economic review N2 - This paper sheds new light on the role of communication for cartel formation. Using machine learning to evaluate free-form chat communication among firms in a laboratory experiment, we identify typical communication patterns for both explicit cartel formation and indirect attempts to collude tacitly. We document that firms are less likely to communicate explicitly about price fixing and more likely to use indirect messages when sanctioning institutions are present. This effect of sanctions on communication reinforces the direct cartel-deterring effect of sanctions as collusion is more difficult to reach and sustain without an explicit agreement. Indirect messages have no, or even a negative, effect on prices. KW - cartel KW - collusion KW - communication KW - machine learning KW - experiment Y1 - 2023 U6 - https://doi.org/10.1016/j.euroecorev.2022.104331 SN - 0014-2921 SN - 1873-572X VL - 152 SP - 1 EP - 18 PB - Elsevier CY - Amsterdam ER - TY - RPRT A1 - Andres, Maximilian A1 - Bruttel, Lisa Verena A1 - Friedrichsen, Jana T1 - How communication makes the difference between a cartel and tacit collusion BT - a machine learning approach T2 - CEPA Discussion Papers N2 - This paper sheds new light on the role of communication for cartel formation. Using machine learning to evaluate free-form chat communication among firms in a laboratory experiment, we identify typical communication patterns for both explicit cartel formation and indirect attempts to collude tacitly. We document that firms are less likely to communicate explicitly about price fixing and more likely to use indirect messages when sanctioning institutions are present. This effect of sanctions on communication reinforces the direct cartel-deterring effect of sanctions as collusion is more difficult to reach and sustain without an explicit agreement. Indirect messages have no, or even a negative, effect on prices. T3 - CEPA Discussion Papers - 53 KW - cartel KW - collusion KW - communication KW - machine learning KW - experiment Y1 - 2022 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-562234 SN - 2628-653X ER - TY - CHAP ED - Meinel, Christoph ED - Polze, Andreas ED - Oswald, Gerhard ED - Strotmann, Rolf ED - Seibold, Ulrich ED - Schulzki, Bernhard T1 - HPI Future SOC Lab BT - Proceedings 2016 N2 - The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industrial partners. Its mission is to enable and promote exchange and interaction between the research community and the industrial partners. The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies. This technical report presents results of research projects executed in 2016. Selected projects have presented their results on April 5th and November 3th 2016 at the Future SOC Lab Day events. N2 - Das Future SOC Lab am HPI ist eine Kooperation des Hasso-Plattner-Instituts mit verschiedenen Industriepartnern. Seine Aufgabe ist die Ermöglichung und Förderung des Austausches zwischen Forschungsgemeinschaft und Industrie. Am Lab wird interessierten Wissenschaftlern eine Infrastruktur von neuester Hard- und Software kostenfrei für Forschungszwecke zur Verfügung gestellt. Dazu zählen teilweise noch nicht am Markt verfügbare Technologien, die im normalen Hochschulbereich in der Regel nicht zu finanzieren wären, bspw. Server mit bis zu 64 Cores und 2 TB Hauptspeicher. Diese Angebote richten sich insbesondere an Wissenschaftler in den Gebieten Informatik und Wirtschaftsinformatik. Einige der Schwerpunkte sind Cloud Computing, Parallelisierung und In-Memory Technologien. In diesem Technischen Bericht werden die Ergebnisse der Forschungsprojekte des Jahres 2016 vorgestellt. Ausgewählte Projekte stellten ihre Ergebnisse am 5. April 2016 und 3. November 2016 im Rahmen der Future SOC Lab Tag Veranstaltungen vor. KW - Future SOC Lab KW - research projects KW - multicore architectures KW - In-Memory technology KW - cloud computing KW - machine learning KW - artifical intelligence KW - Future SOC Lab KW - Forschungsprojekte KW - Multicore Architekturen KW - In-Memory Technologie KW - Cloud Computing KW - maschinelles Lernen KW - künstliche Intelligenz Y1 - 2016 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-406787 ER - TY - BOOK A1 - Zhang, Shuhao A1 - Plauth, Max A1 - Eberhardt, Felix A1 - Polze, Andreas A1 - Lehmann, Jens A1 - Sejdiu, Gezim A1 - Jabeen, Hajira A1 - Servadei, Lorenzo A1 - Möstl, Christian A1 - Bär, Florian A1 - Netzeband, André A1 - Schmidt, Rainer A1 - Knigge, Marlene A1 - Hecht, Sonja A1 - Prifti, Loina A1 - Krcmar, Helmut A1 - Sapegin, Andrey A1 - Jaeger, David A1 - Cheng, Feng A1 - Meinel, Christoph A1 - Friedrich, Tobias A1 - Rothenberger, Ralf A1 - Sutton, Andrew M. A1 - Sidorova, Julia A. A1 - Lundberg, Lars A1 - Rosander, Oliver A1 - Sköld, Lars A1 - Di Varano, Igor A1 - van der Walt, Estée A1 - Eloff, Jan H. P. A1 - Fabian, Benjamin A1 - Baumann, Annika A1 - Ermakova, Tatiana A1 - Kelkel, Stefan A1 - Choudhary, Yash A1 - Cooray, Thilini A1 - Rodríguez, Jorge A1 - Medina-Pérez, Miguel Angel A1 - Trejo, Luis A. A1 - Barrera-Animas, Ari Yair A1 - Monroy-Borja, Raúl A1 - López-Cuevas, Armando A1 - Ramírez-Márquez, José Emmanuel A1 - Grohmann, Maria A1 - Niederleithinger, Ernst A1 - Podapati, Sasidhar A1 - Schmidt, Christopher A1 - Huegle, Johannes A1 - de Oliveira, Roberto C. L. A1 - Soares, Fábio Mendes A1 - van Hoorn, André A1 - Neumer, Tamas A1 - Willnecker, Felix A1 - Wilhelm, Mathias A1 - Kuster, Bernhard ED - Meinel, Christoph ED - Polze, Andreas ED - Beins, Karsten ED - Strotmann, Rolf ED - Seibold, Ulrich ED - Rödszus, Kurt ED - Müller, Jürgen T1 - HPI Future SOC Lab – Proceedings 2017 T1 - HPI Future SOC Lab – Proceedings 2017 N2 - The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners. The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies. This technical report presents results of research projects executed in 2017. Selected projects have presented their results on April 25th and November 15th 2017 at the Future SOC Lab Day events. N2 - Das Future SOC Lab am HPI ist eine Kooperation des Hasso-Plattner-Instituts mit verschiedenen Industriepartnern. Seine Aufgabe ist die Ermöglichung und Förderung des Austausches zwischen Forschungsgemeinschaft und Industrie. Am Lab wird interessierten Wissenschaftlern eine Infrastruktur von neuester Hard- und Software kostenfrei für Forschungszwecke zur Verfügung gestellt. Dazu zählen teilweise noch nicht am Markt verfügbare Technologien, die im normalen Hochschulbereich in der Regel nicht zu finanzieren wären, bspw. Server mit bis zu 64 Cores und 2 TB Hauptspeicher. Diese Angebote richten sich insbesondere an Wissenschaftler in den Gebieten Informatik und Wirtschaftsinformatik. Einige der Schwerpunkte sind Cloud Computing, Parallelisierung und In-Memory Technologien. In diesem Technischen Bericht werden die Ergebnisse der Forschungsprojekte des Jahres 2017 vorgestellt. Ausgewählte Projekte stellten ihre Ergebnisse am 25. April und 15. November 2017 im Rahmen der Future SOC Lab Tag Veranstaltungen vor. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 130 KW - Future SOC Lab KW - research projects KW - multicore architectures KW - In-Memory technology KW - cloud computing KW - machine learning KW - artifical intelligence KW - Future SOC Lab KW - Forschungsprojekte KW - Multicore Architekturen KW - In-Memory Technologie KW - Cloud Computing KW - maschinelles Lernen KW - Künstliche Intelligenz Y1 - 2020 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-433100 SN - 978-3-86956-475-3 SN - 1613-5652 SN - 2191-1665 IS - 130 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - BOOK A1 - Rana, Kaushik A1 - Mohapatra, Durga Prasad A1 - Sidorova, Julia A1 - Lundberg, Lars A1 - Sköld, Lars A1 - Lopes Grim, Luís Fernando A1 - Sampaio Gradvohl, André Leon A1 - Cremerius, Jonas A1 - Siegert, Simon A1 - Weltzien, Anton von A1 - Baldi, Annika A1 - Klessascheck, Finn A1 - Kalancha, Svitlana A1 - Lichtenstein, Tom A1 - Shaabani, Nuhad A1 - Meinel, Christoph A1 - Friedrich, Tobias A1 - Lenzner, Pascal A1 - Schumann, David A1 - Wiese, Ingmar A1 - Sarna, Nicole A1 - Wiese, Lena A1 - Tashkandi, Araek Sami A1 - van der Walt, Estée A1 - Eloff, Jan H. P. A1 - Schmidt, Christopher A1 - Hügle, Johannes A1 - Horschig, Siegfried A1 - Uflacker, Matthias A1 - Najafi, Pejman A1 - Sapegin, Andrey A1 - Cheng, Feng A1 - Stojanovic, Dragan A1 - Stojnev Ilić, Aleksandra A1 - Djordjevic, Igor A1 - Stojanovic, Natalija A1 - Predic, Bratislav A1 - González-Jiménez, Mario A1 - de Lara, Juan A1 - Mischkewitz, Sven A1 - Kainz, Bernhard A1 - van Hoorn, André A1 - Ferme, Vincenzo A1 - Schulz, Henning A1 - Knigge, Marlene A1 - Hecht, Sonja A1 - Prifti, Loina A1 - Krcmar, Helmut A1 - Fabian, Benjamin A1 - Ermakova, Tatiana A1 - Kelkel, Stefan A1 - Baumann, Annika A1 - Morgenstern, Laura A1 - Plauth, Max A1 - Eberhard, Felix A1 - Wolff, Felix A1 - Polze, Andreas A1 - Cech, Tim A1 - Danz, Noel A1 - Noack, Nele Sina A1 - Pirl, Lukas A1 - Beilharz, Jossekin Jakob A1 - De Oliveira, Roberto C. L. A1 - Soares, Fábio Mendes A1 - Juiz, Carlos A1 - Bermejo, Belen A1 - Mühle, Alexander A1 - Grüner, Andreas A1 - Saxena, Vageesh A1 - Gayvoronskaya, Tatiana A1 - Weyand, Christopher A1 - Krause, Mirko A1 - Frank, Markus A1 - Bischoff, Sebastian A1 - Behrens, Freya A1 - Rückin, Julius A1 - Ziegler, Adrian A1 - Vogel, Thomas A1 - Tran, Chinh A1 - Moser, Irene A1 - Grunske, Lars A1 - Szárnyas, Gábor A1 - Marton, József A1 - Maginecz, János A1 - Varró, Dániel A1 - Antal, János Benjamin ED - Meinel, Christoph ED - Polze, Andreas ED - Beins, Karsten ED - Strotmann, Rolf ED - Seibold, Ulrich ED - Rödszus, Kurt ED - Müller, Jürgen T1 - HPI Future SOC Lab – Proceedings 2018 N2 - The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners. The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies. This technical report presents results of research projects executed in 2018. Selected projects have presented their results on April 17th and November 14th 2017 at the Future SOC Lab Day events. N2 - Das Future SOC Lab am HPI ist eine Kooperation des Hasso-Plattner-Instituts mit verschiedenen Industriepartnern. Seine Aufgabe ist die Ermöglichung und Förderung des Austausches zwischen Forschungsgemeinschaft und Industrie. Am Lab wird interessierten Wissenschaftler:innen eine Infrastruktur von neuester Hard- und Software kostenfrei für Forschungszwecke zur Verfügung gestellt. Dazu zählen Systeme, die im normalen Hochschulbereich in der Regel nicht zu finanzieren wären, bspw. Server mit bis zu 64 Cores und 2 TB Hauptspeicher. Diese Angebote richten sich insbesondere an Wissenschaftler:innen in den Gebieten Informatik und Wirtschaftsinformatik. Einige der Schwerpunkte sind Cloud Computing, Parallelisierung und In-Memory Technologien. In diesem Technischen Bericht werden die Ergebnisse der Forschungsprojekte des Jahres 2018 vorgestellt. Ausgewählte Projekte stellten ihre Ergebnisse am 17. April und 14. November 2018 im Rahmen des Future SOC Lab Tags vor. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 151 KW - Future SOC Lab KW - research projects KW - multicore architectures KW - in-memory technology KW - cloud computing KW - machine learning KW - artifical intelligence KW - Future SOC Lab KW - Forschungsprojekte KW - Multicore Architekturen KW - In-Memory Technologie KW - Cloud Computing KW - maschinelles Lernen KW - künstliche Intelligenz Y1 - 2022 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-563712 SN - 978-3-86956-547-7 SN - 1613-5652 SN - 2191-1665 IS - 151 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - BOOK A1 - Kuban, Robert A1 - Rotta, Randolf A1 - Nolte, Jörg A1 - Chromik, Jonas A1 - Beilharz, Jossekin Jakob A1 - Pirl, Lukas A1 - Friedrich, Tobias A1 - Lenzner, Pascal A1 - Weyand, Christopher A1 - Juiz, Carlos A1 - Bermejo, Belen A1 - Sauer, Joao A1 - Coelh, Leandro dos Santos A1 - Najafi, Pejman A1 - Pünter, Wenzel A1 - Cheng, Feng A1 - Meinel, Christoph A1 - Sidorova, Julia A1 - Lundberg, Lars A1 - Vogel, Thomas A1 - Tran, Chinh A1 - Moser, Irene A1 - Grunske, Lars A1 - Elsaid, Mohamed Esameldin Mohamed A1 - Abbas, Hazem M. A1 - Rula, Anisa A1 - Sejdiu, Gezim A1 - Maurino, Andrea A1 - Schmidt, Christopher A1 - Hügle, Johannes A1 - Uflacker, Matthias A1 - Nozza, Debora A1 - Messina, Enza A1 - Hoorn, André van A1 - Frank, Markus A1 - Schulz, Henning A1 - Alhosseini Almodarresi Yasin, Seyed Ali A1 - Nowicki, Marek A1 - Muite, Benson K. A1 - Boysan, Mehmet Can A1 - Bianchi, Federico A1 - Cremaschi, Marco A1 - Moussa, Rim A1 - Abdel-Karim, Benjamin M. A1 - Pfeuffer, Nicolas A1 - Hinz, Oliver A1 - Plauth, Max A1 - Polze, Andreas A1 - Huo, Da A1 - Melo, Gerard de A1 - Mendes Soares, Fábio A1 - Oliveira, Roberto Célio Limão de A1 - Benson, Lawrence A1 - Paul, Fabian A1 - Werling, Christian A1 - Windheuser, Fabian A1 - Stojanovic, Dragan A1 - Djordjevic, Igor A1 - Stojanovic, Natalija A1 - Stojnev Ilic, Aleksandra A1 - Weidmann, Vera A1 - Lowitzki, Leon A1 - Wagner, Markus A1 - Ifa, Abdessatar Ben A1 - Arlos, Patrik A1 - Megia, Ana A1 - Vendrell, Joan A1 - Pfitzner, Bjarne A1 - Redondo, Alberto A1 - Ríos Insua, David A1 - Albert, Justin Amadeus A1 - Zhou, Lin A1 - Arnrich, Bert A1 - Szabó, Ildikó A1 - Fodor, Szabina A1 - Ternai, Katalin A1 - Bhowmik, Rajarshi A1 - Campero Durand, Gabriel A1 - Shevchenko, Pavlo A1 - Malysheva, Milena A1 - Prymak, Ivan A1 - Saake, Gunter ED - Meinel, Christoph ED - Polze, Andreas ED - Beins, Karsten ED - Strotmann, Rolf ED - Seibold, Ulrich ED - Rödszus, Kurt ED - Müller, Jürgen T1 - HPI Future SOC Lab – Proceedings 2019 N2 - The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners. The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies. This technical report presents results of research projects executed in 2019. Selected projects have presented their results on April 9th and November 12th 2019 at the Future SOC Lab Day events. N2 - Das Future SOC Lab am HPI ist eine Kooperation des Hasso-Plattner-Instituts mit verschiedenen Industriepartnern. Seine Aufgabe ist die Ermöglichung und Förderung des Austausches zwischen Forschungsgemeinschaft und Industrie. Am Lab wird interessierten Wissenschaftlern eine Infrastruktur von neuester Hard- und Software kostenfrei für Forschungszwecke zur Verfügung gestellt. Dazu zählen teilweise noch nicht am Markt verfügbare Technologien, die im normalen Hochschulbereich in der Regel nicht zu finanzieren wären, bspw. Server mit bis zu 64 Cores und 2 TB Hauptspeicher. Diese Angebote richten sich insbesondere an Wissenschaftler in den Gebieten Informatik und Wirtschaftsinformatik. Einige der Schwerpunkte sind Cloud Computing, Parallelisierung und In-Memory Technologien. In diesem Technischen Bericht werden die Ergebnisse der Forschungsprojekte des Jahres 2019 vorgestellt. Ausgewählte Projekte stellten ihre Ergebnisse am 09. April und 12. November 2019 im Rahmen des Future SOC Lab Tags vor. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 158 KW - Future SOC Lab KW - research projects KW - multicore architectures KW - in-memory technology KW - cloud computing KW - machine learning KW - artifical intelligence KW - Future SOC Lab KW - Forschungsprojekte KW - Multicore Architekturen KW - In-Memory Technologie KW - Cloud Computing KW - maschinelles Lernen KW - künstliche Intelligenz Y1 - 2023 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-597915 SN - 978-3-86956-564-4 SN - 1613-5652 SN - 2191-1665 IS - 158 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - BOOK A1 - Weber, Benedikt T1 - Human pose estimation for decubitus prophylaxis T1 - Verwendung von Posenabschätzung zur Dekubitusprophylaxe N2 - Decubitus is one of the most relevant diseases in nursing and the most expensive to treat. It is caused by sustained pressure on tissue, so it particularly affects bed-bound patients. This work lays a foundation for pressure mattress-based decubitus prophylaxis by implementing a solution to the single-frame 2D Human Pose Estimation problem. For this, methods of Deep Learning are employed. Two approaches are examined, a coarse-to-fine Convolutional Neural Network for direct regression of joint coordinates and a U-Net for the derivation of probability distribution heatmaps. We conclude that training our models on a combined dataset of the publicly available Bodies at Rest and SLP data yields the best results. Furthermore, various preprocessing techniques are investigated, and a hyperparameter optimization is performed to discover an improved model architecture. Another finding indicates that the heatmap-based approach outperforms direct regression. This model achieves a mean per-joint position error of 9.11 cm for the Bodies at Rest data and 7.43 cm for the SLP data. We find that it generalizes well on data from mattresses other than those seen during training but has difficulties detecting the arms correctly. Additionally, we give a brief overview of the medical data annotation tool annoto we developed in the bachelor project and furthermore conclude that the Scrum framework and agile practices enhanced our development workflow. N2 - Dekubitus ist eine der relevantesten Krankheiten in der Krankenpflege und die kostspieligste in der Behandlung. Sie wird durch anhaltenden Druck auf Gewebe verursacht, betrifft also insbesondere bettlägerige Patienten. Diese Arbeit legt eine Grundlage für druckmatratzenbasierte Dekubitusprophylaxe, indem eine Lösung für das Einzelbild-2D-Posenabschätzungsproblem implementiert wird. Dafür werden Methoden des tiefen Lernens verwendet. Zwei Ansätze, basierend auf einem Gefalteten Neuronalen grob-zu-fein Netzwerk zur direkten Regression der Gelenkkoordinaten und auf einem U-Netzwerk zur Ableitung von Wahrscheinlichkeitsverteilungsbildern, werden untersucht. Wir schlussfolgern, dass das Training unserer Modelle auf einem kombinierten Datensatz, bestehend aus den frei verfügbaren Bodies at Rest und SLP Daten, die besten Ergebnisse liefert. Weiterhin werden diverse Vorverarbeitungsverfahren untersucht und eine Hyperparameteroptimierung zum Finden einer verbesserten Modellarchitektur durchgeführt. Der wahrscheinlichkeitsverteilungsbasierte Ansatz übertrifft die direkte Regression. Dieses Modell erreicht einen durchschnittlichen Pro-Gelenk-Positionsfehler von 9,11 cm auf den Bodies at Rest und von 7,43 cm auf den SLP Daten. Wir sehen, dass es gut auf Daten anderer als der im Training verwendeten Matratzen funktioniert, aber Schwierigkeiten mit der korrekten Erkennung der Arme hat. Weiterhin geben wir eine kurze Übersicht des medizinischen Datenannotationstools annoto, welches wir im Zusammenhang mit dem Bachelorprojekt entwickelt haben, und schlussfolgern außerdem, dass Scrum und agile Praktiken unseren Entwicklungsprozess verbessert haben. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 153 KW - machine learning KW - deep learning KW - convolutional neural networks KW - pose estimation KW - decubitus KW - telemedicine KW - maschinelles Lernen KW - tiefes Lernen KW - gefaltete neuronale Netze KW - Posenabschätzung KW - Dekubitus KW - Telemedizin Y1 - 2023 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-567196 SN - 978-3-86956-551-4 SN - 1613-5652 SN - 2191-1665 IS - 153 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - THES A1 - Schröter, Kai T1 - Improved flood risk assessment BT - new data sources and methods for flood risk modelling N2 - Rivers have always flooded their floodplains. Over 2.5 billion people worldwide have been affected by flooding in recent decades. The economic damage is also considerable, averaging 100 billion US dollars per year. There is no doubt that damage and other negative effects of floods can be avoided. However, this has a price: financially and politically. Costs and benefits can be estimated through risk assessments. Questions about the location and frequency of floods, about the objects that could be affected and their vulnerability are of importance for flood risk managers, insurance companies and politicians. Thus, both variables and factors from the fields of hydrology and sociol-economics play a role with multi-layered connections. One example are dikes along a river, which on the one hand contain floods, but on the other hand, by narrowing the natural floodplains, accelerate the flood discharge and increase the danger of flooding for the residents downstream. Such larger connections must be included in the assessment of flood risk. However, in current procedures this is accompanied by simplifying assumptions. Risk assessments are therefore fuzzy and associated with uncertainties. This thesis investigates the benefits and possibilities of new data sources for improving flood risk assessment. New methods and models are developed, which take the mentioned interrelations better into account and also quantify the existing uncertainties of the model results, and thus enable statements about the reliability of risk estimates. For this purpose, data on flood events from various sources are collected and evaluated. This includes precipitation and flow records at measuring stations as well as for instance images from social media, which can help to delineate the flooded areas and estimate flood damage with location information. Machine learning methods have been successfully used to recognize and understand correlations between floods and impacts from a wide range of data and to develop improved models. Risk models help to develop and evaluate strategies to reduce flood risk. These tools also provide advanced insights into the interplay of various factors and on the expected consequences of flooding. This work shows progress in terms of an improved assessment of flood risks by using diverse data from different sources with innovative methods as well as by the further development of models. Flood risk is variable due to economic and climatic changes, and other drivers of risk. In order to keep the knowledge about flood risks up-to-date, robust, efficient and adaptable methods as proposed in this thesis are of increasing importance. N2 - Flüsse haben seit jeher ihre Auen überflutet. In den vergangenen Jahrzehnten waren weltweit über 2,5 Milliarden Menschen durch Hochwasser betroffen. Auch der ökonomische Schaden ist mit durchschnittlich 100 Milliarden US Dollar pro Jahr erheblich. Zweifelsohne können Schäden und andere negative Auswirkungen von Hochwasser vermieden werden. Allerdings hat dies einen Preis: finanziell und politisch. Kosten und Nutzen lassen sich durch Risikobewertungen abschätzen. Dabei werden in der Wasserwirtschaft, von Versicherungen und der Politik Fragen nach dem Ort und der Häufigkeit von Überflutungen, nach den Dingen, die betroffen sein könnten und deren Anfälligkeit untersucht. Somit spielen sowohl Größen und Faktoren aus den Bereichen der Hydrologie und Sozioökonmie mit vielschichtigen Zusammenhängen eine Rolle. Ein anschauliches Beispiel sind Deiche entlang eines Flusses, die einerseits in ihrem Abschnitt Überflutungen eindämmen, andererseits aber durch die Einengung der natürlichen Vorländer den Hochwasserabfluss beschleunigen und die Gefährdung für die Anlieger flussab verschärfen. Solche größeren Zusammenhänge müssen in der Bewertung des Hochwasserrisikos einbezogen werden. In derzeit gängigen Verfahren geht dies mit vereinfachenden Annahmen einher. Risikoabschätzungen sind daher unscharf und mit Unsicherheiten verbunden. Diese Arbeit untersucht den Nutzen und die Möglichkeiten neuer Datensätze für eine Verbesserung der Hochwasserrisikoabschätzung. Es werden neue Methoden und Modelle entwickelt, die die angesprochenen Zusammenhänge stärker berücksichtigen und auch die bestehenden Unsicherheiten der Modellergebnisse beziffern und somit die Verlässlichkeit der getroffenen Aussagen einordnen lassen. Dafür werden Daten zu Hochwasserereignissen aus verschiedenen Quellen erfasst und ausgewertet. Dazu zählen neben Niederschlags-und Durchflussaufzeichnungen an Messstationen beispielsweise auch Bilder aus sozialen Medien, die mit Ortsangaben und Bildinhalten helfen können, die Überflutungsflächen abzugrenzen und Hochwasserschäden zu schätzen. Verfahren des Maschinellen Lernens wurden erfolgreich eingesetzt, um aus vielfältigen Daten, Zusammenhänge zwischen Hochwasser und Auswirkungen zu erkennen, besser zu verstehen und verbesserte Modelle zu entwickeln. Solche Risikomodelle helfen bei der Entwicklung und Bewertung von Strategien zur Minderung des Hochwasserrisikos. Diese Werkzeuge ermöglichen darüber hinaus Einblicke in das Zusammenspiel verschiedener Faktoren sowie Aussagen zu den zu erwartenden Folgen auch von Hochwassern, die das bisher bekannte Ausmaß übersteigen. Diese Arbeit verzeichnet Fortschritte in Bezug auf eine verbesserte Bewertung von Hochwasserrisiken durch die Nutzung vielfältiger Daten aus unterschiedlichen Quellen mit innovativen Verfahren sowie der Weiterentwicklung von Modellen. Das Hochwasserrisiko unterliegt durch wirtschaftliche Entwicklungen und klimatische Veränderungen einem steten Wandel. Um das Wissen über Risiken aktuell zu halten sind robuste, leistungs- und anpassungsfähige Verfahren wie sie in dieser Arbeit vorgestellt werden von zunehmender Bedeutung. T2 - Verbesserte Hochwasserrisikobewertung: Neue Datenquellen und Methoden für die Risikomodellierung KW - flood KW - risk KW - vulnerability KW - machine learning KW - uncertainty KW - Hochwasser KW - Risiko KW - Vulnerabilität KW - Maschinelles Lernen KW - Unsicherheiten Y1 - 2020 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-480240 ER - TY - GEN A1 - Konak, Orhan A1 - Wegner, Pit A1 - Arnrich, Bert T1 - IMU-Based Movement Trajectory Heatmaps for Human Activity Recognition T2 - Postprints der Universität Potsdam : Reihe der Digital Engineering Fakultät N2 - Recent trends in ubiquitous computing have led to a proliferation of studies that focus on human activity recognition (HAR) utilizing inertial sensor data that consist of acceleration, orientation and angular velocity. However, the performances of such approaches are limited by the amount of annotated training data, especially in fields where annotating data is highly time-consuming and requires specialized professionals, such as in healthcare. In image classification, this limitation has been mitigated by powerful oversampling techniques such as data augmentation. Using this technique, this work evaluates to what extent transforming inertial sensor data into movement trajectories and into 2D heatmap images can be advantageous for HAR when data are scarce. A convolutional long short-term memory (ConvLSTM) network that incorporates spatiotemporal correlations was used to classify the heatmap images. Evaluation was carried out on Deep Inertial Poser (DIP), a known dataset composed of inertial sensor data. The results obtained suggest that for datasets with large numbers of subjects, using state-of-the-art methods remains the best alternative. However, a performance advantage was achieved for small datasets, which is usually the case in healthcare. Moreover, movement trajectories provide a visual representation of human activities, which can help researchers to better interpret and analyze motion patterns. T3 - Zweitveröffentlichungen der Universität Potsdam : Reihe der Digital Engineering Fakultät - 4 KW - human activity recognition KW - image processing KW - machine learning KW - sensor data Y1 - 2021 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-487799 IS - 4 ER - TY - JOUR A1 - Konak, Orhan A1 - Wegner, Pit A1 - Arnrich, Bert T1 - IMU-Based Movement Trajectory Heatmaps for Human Activity Recognition JF - Sensors N2 - Recent trends in ubiquitous computing have led to a proliferation of studies that focus on human activity recognition (HAR) utilizing inertial sensor data that consist of acceleration, orientation and angular velocity. However, the performances of such approaches are limited by the amount of annotated training data, especially in fields where annotating data is highly time-consuming and requires specialized professionals, such as in healthcare. In image classification, this limitation has been mitigated by powerful oversampling techniques such as data augmentation. Using this technique, this work evaluates to what extent transforming inertial sensor data into movement trajectories and into 2D heatmap images can be advantageous for HAR when data are scarce. A convolutional long short-term memory (ConvLSTM) network that incorporates spatiotemporal correlations was used to classify the heatmap images. Evaluation was carried out on Deep Inertial Poser (DIP), a known dataset composed of inertial sensor data. The results obtained suggest that for datasets with large numbers of subjects, using state-of-the-art methods remains the best alternative. However, a performance advantage was achieved for small datasets, which is usually the case in healthcare. Moreover, movement trajectories provide a visual representation of human activities, which can help researchers to better interpret and analyze motion patterns. KW - human activity recognition KW - image processing KW - machine learning KW - sensor data Y1 - 2020 U6 - https://doi.org/10.3390/s20247179 SN - 1424-8220 VL - 20 IS - 24 PB - MDPI CY - Basel ER - TY - JOUR A1 - Cope, Justin L. A1 - Baukmann, Hannes A. A1 - Klinger, Jörn E. A1 - Ravarani, Charles N. J. A1 - Böttinger, Erwin A1 - Konigorski, Stefan A1 - Schmidt, Marco F. T1 - Interaction-based feature selection algorithm outperforms polygenic risk score in predicting Parkinson’s Disease status JF - Frontiers in genetics N2 - Polygenic risk scores (PRS) aggregating results from genome-wide association studies are the state of the art in the prediction of susceptibility to complex traits or diseases, yet their predictive performance is limited for various reasons, not least of which is their failure to incorporate the effects of gene-gene interactions. Novel machine learning algorithms that use large amounts of data promise to find gene-gene interactions in order to build models with better predictive performance than PRS. Here, we present a data preprocessing step by using data-mining of contextual information to reduce the number of features, enabling machine learning algorithms to identify gene-gene interactions. We applied our approach to the Parkinson's Progression Markers Initiative (PPMI) dataset, an observational clinical study of 471 genotyped subjects (368 cases and 152 controls). With an AUC of 0.85 (95% CI = [0.72; 0.96]), the interaction-based prediction model outperforms the PRS (AUC of 0.58 (95% CI = [0.42; 0.81])). Furthermore, feature importance analysis of the model provided insights into the mechanism of Parkinson's disease. For instance, the model revealed an interaction of previously described drug target candidate genes TMEM175 and GAPDHP25. These results demonstrate that interaction-based machine learning models can improve genetic prediction models and might provide an answer to the missing heritability problem. KW - epistasis KW - machine learning KW - feature selection KW - parkinson's disease KW - PPMI (parkinson's progression markers initiative) Y1 - 2021 U6 - https://doi.org/10.3389/fgene.2021.744557 SN - 1664-8021 VL - 12 PB - Frontiers Media CY - Lausanne ER - TY - JOUR A1 - Wulff, Peter A1 - Mientus, Lukas A1 - Nowak, Anna A1 - Borowski, Andreas T1 - KI-basierte Auswertung von schriftlichen Unterrichtsreflexionen im Fach Physik und automatisierte Rückmeldung JF - PSI-Potsdam: Ergebnisbericht zu den Aktivitäten im Rahmen der Qualitätsoffensive Lehrerbildung (2019-2023) (Potsdamer Beiträge zur Lehrerbildung und Bildungsforschung ; 3) N2 - Für die Entwicklung professioneller Handlungskompetenzen angehender Lehrkräfte stellt die Unterrichtsreflexion ein wichtiges Instrument dar, um Theoriewissen und Praxiserfahrungen in Beziehung zu setzen. Die Auswertung von Unterrichtsreflexionen und eine entsprechende Rückmeldung stellt Forschende und Dozierende allerdings vor praktische wie theoretische Herausforderungen. Im Kontext der Forschung zu Künstlicher Intelligenz (KI) entwickelte Methoden bieten hier neue Potenziale. Der Beitrag stellt überblicksartig zwei Teilstudien vor, die mit Hilfe von KI-Methoden wie dem maschinellen Lernen untersuchen, inwieweit eine Auswertung von Unterrichtsreflexionen angehender Physiklehrkräfte auf Basis eines theoretisch abgeleiteten Reflexionsmodells und die automatisierte Rückmeldung hierzu möglich sind. Dabei wurden unterschiedliche Ansätze des maschinellen Lernens verwendet, um modellbasierte Klassifikation und Exploration von Themen in Unterrichtsreflexionen umzusetzen. Die Genauigkeit der Ergebnisse wurde vor allem durch sog. Große Sprachmodelle gesteigert, die auch den Transfer auf andere Standorte und Fächer ermöglichen. Für die fachdidaktische Forschung bedeuten sie jedoch wiederum neue Herausforderungen, wie etwa systematische Verzerrungen und Intransparenz von Entscheidungen. Dennoch empfehlen wir, die Potenziale der KI-basierten Methoden gründlicher zu erforschen und konsequent in der Praxis (etwa in Form von Webanwendungen) zu implementieren. N2 - For the development of professional competencies in pre-service teachers, reflection on teaching experiences is proposed as an important tool to link theoretical knowledge and practice. However, evaluating reflections and providing appropriate feedback poses challenges of both theoretical and practical nature to researchers and educators. Methods associated with artificial intelligence research offer new potentials to discover patterns in complex datasets like reflections, as well as to evaluate these automatically and create feedback. In this article, we provide an overview of two sub-studies that investigate, using artificial intelligence methods such as machine learning, to what extent an evaluation of reflections of pre-service physics teachers based on a theoretically derived reflection model and automated feedback are possible. Across the sub-studies, different machine learning approaches were used to implement model-based classification and exploration of topics in reflections. Large language models in particular increase the accuracy of the results and allow for transfer to other locations and disciplines. However, entirely new challenges arise for educational research in relation to large language models, such as systematic biases and lack of transparency in decisions. Despite these uncertainties, we recommend further exploring the potentials of artificial intelligence-based methods and implementing them consistently in practice (for example, in the form of web applications). KW - Künstliche Intelligenz KW - Maschinelles Lernen KW - Natural Language Processing KW - Reflexion KW - Professionalisierung KW - artificial intelligence KW - machine learning KW - natural language processing KW - reflexion KW - professionalization Y1 - 2023 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-616363 SN - 978-3-86956-568-2 SN - 2626-3556 SN - 2626-4722 IS - 3 SP - 103 EP - 115 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - THES A1 - Loster, Michael T1 - Knowledge base construction with machine learning methods T1 - Aufbau von Wissensbasen mit Methoden des maschinellen Lernens N2 - Modern knowledge bases contain and organize knowledge from many different topic areas. Apart from specific entity information, they also store information about their relationships amongst each other. Combining this information results in a knowledge graph that can be particularly helpful in cases where relationships are of central importance. Among other applications, modern risk assessment in the financial sector can benefit from the inherent network structure of such knowledge graphs by assessing the consequences and risks of certain events, such as corporate insolvencies or fraudulent behavior, based on the underlying network structure. As public knowledge bases often do not contain the necessary information for the analysis of such scenarios, the need arises to create and maintain dedicated domain-specific knowledge bases. This thesis investigates the process of creating domain-specific knowledge bases from structured and unstructured data sources. In particular, it addresses the topics of named entity recognition (NER), duplicate detection, and knowledge validation, which represent essential steps in the construction of knowledge bases. As such, we present a novel method for duplicate detection based on a Siamese neural network that is able to learn a dataset-specific similarity measure which is used to identify duplicates. Using the specialized network architecture, we design and implement a knowledge transfer between two deduplication networks, which leads to significant performance improvements and a reduction of required training data. Furthermore, we propose a named entity recognition approach that is able to identify company names by integrating external knowledge in the form of dictionaries into the training process of a conditional random field classifier. In this context, we study the effects of different dictionaries on the performance of the NER classifier. We show that both the inclusion of domain knowledge as well as the generation and use of alias names results in significant performance improvements. For the validation of knowledge represented in a knowledge base, we introduce Colt, a framework for knowledge validation based on the interactive quality assessment of logical rules. In its most expressive implementation, we combine Gaussian processes with neural networks to create Colt-GP, an interactive algorithm for learning rule models. Unlike other approaches, Colt-GP uses knowledge graph embeddings and user feedback to cope with data quality issues of knowledge bases. The learned rule model can be used to conditionally apply a rule and assess its quality. Finally, we present CurEx, a prototypical system for building domain-specific knowledge bases from structured and unstructured data sources. Its modular design is based on scalable technologies, which, in addition to processing large datasets, ensures that the modules can be easily exchanged or extended. CurEx offers multiple user interfaces, each tailored to the individual needs of a specific user group and is fully compatible with the Colt framework, which can be used as part of the system. We conduct a wide range of experiments with different datasets to determine the strengths and weaknesses of the proposed methods. To ensure the validity of our results, we compare the proposed methods with competing approaches. N2 - Moderne Wissensbasen enthalten und organisieren das Wissen vieler unterschiedlicher Themengebiete. So speichern sie neben bestimmten Entitätsinformationen auch Informationen über deren Beziehungen untereinander. Kombiniert man diese Informationen, ergibt sich ein Wissensgraph, der besonders in Anwendungsfällen hilfreich sein kann, in denen Entitätsbeziehungen von zentraler Bedeutung sind. Neben anderen Anwendungen, kann die moderne Risikobewertung im Finanzsektor von der inhärenten Netzwerkstruktur solcher Wissensgraphen profitieren, indem Folgen und Risiken bestimmter Ereignisse, wie z.B. Unternehmensinsolvenzen oder betrügerisches Verhalten, auf Grundlage des zugrundeliegenden Netzwerks bewertet werden. Da öffentliche Wissensbasen oft nicht die notwendigen Informationen zur Analyse solcher Szenarien enthalten, entsteht die Notwendigkeit, spezielle domänenspezifische Wissensbasen zu erstellen und zu pflegen. Diese Arbeit untersucht den Erstellungsprozess von domänenspezifischen Wissensdatenbanken aus strukturierten und unstrukturierten Datenquellen. Im speziellen befasst sie sich mit den Bereichen Named Entity Recognition (NER), Duplikaterkennung sowie Wissensvalidierung, die wesentliche Prozessschritte beim Aufbau von Wissensbasen darstellen. Wir stellen eine neuartige Methode zur Duplikaterkennung vor, die auf Siamesischen Neuronalen Netzwerken basiert und in der Lage ist, ein datensatz-spezifisches Ähnlichkeitsmaß zu erlernen, welches wir zur Identifikation von Duplikaten verwenden. Unter Verwendung einer speziellen Netzwerkarchitektur entwerfen und setzen wir einen Wissenstransfer zwischen Deduplizierungsnetzwerken um, der zu erheblichen Leistungsverbesserungen und einer Reduktion der benötigten Trainingsdaten führt. Weiterhin schlagen wir einen Ansatz zur Erkennung benannter Entitäten (Named Entity Recognition (NER)) vor, der in der Lage ist, Firmennamen zu identifizieren, indem externes Wissen in Form von Wörterbüchern in den Trainingsprozess eines Conditional Random Field Klassifizierers integriert wird. In diesem Zusammenhang untersuchen wir die Auswirkungen verschiedener Wörterbücher auf die Leistungsfähigkeit des NER-Klassifikators und zeigen, dass sowohl die Einbeziehung von Domänenwissen als auch die Generierung und Verwendung von Alias-Namen zu einer signifikanten Leistungssteigerung führt. Zur Validierung der in einer Wissensbasis enthaltenen Fakten stellen wir mit COLT ein Framework zur Wissensvalidierung vor, dass auf der interaktiven Qualitätsbewertung von logischen Regeln basiert. In seiner ausdrucksstärksten Implementierung kombinieren wir Gauß'sche Prozesse mit neuronalen Netzen, um so COLT-GP, einen interaktiven Algorithmus zum Erlernen von Regelmodellen, zu erzeugen. Im Gegensatz zu anderen Ansätzen verwendet COLT-GP Knowledge Graph Embeddings und Nutzer-Feedback, um Datenqualitätsprobleme des zugrunde liegenden Wissensgraphen zu behandeln. Das von COLT-GP erlernte Regelmodell kann sowohl zur bedingten Anwendung einer Regel als auch zur Bewertung ihrer Qualität verwendet werden. Schließlich stellen wir mit CurEx, ein prototypisches System zum Aufbau domänenspezifischer Wissensbasen aus strukturierten und unstrukturierten Datenquellen, vor. Sein modularer Aufbau basiert auf skalierbaren Technologien, die neben der Verarbeitung großer Datenmengen auch die einfache Austausch- und Erweiterbarkeit einzelner Module gewährleisten. CurEx bietet mehrere Benutzeroberflächen, die jeweils auf die individuellen Bedürfnisse bestimmter Benutzergruppen zugeschnitten sind. Darüber hinaus ist es vollständig kompatibel zum COLT-Framework, was als Teil des Systems verwendet werden kann. Wir führen eine Vielzahl von Experimenten mit unterschiedlichen Datensätzen durch, um die Stärken und Schwächen der vorgeschlagenen Methoden zu ermitteln. Zudem vergleichen wir die vorgeschlagenen Methoden mit konkurrierenden Ansätzen, um die Validität unserer Ergebnisse sicherzustellen. KW - machine learning KW - deep kernel learning KW - knowledge base construction KW - knowledge base KW - knowledge graph KW - deduplication KW - siamese neural networks KW - duplicate detection KW - entity resolution KW - transfer learning KW - knowledge transfer KW - entity linking KW - knowledge validation KW - logic rules KW - named entity recognition KW - curex KW - Curex KW - Deduplikation KW - Deep Kernel Learning KW - Duplikaterkennung KW - Entitätsverknüpfung KW - Entitätsauflösung KW - Wissensbasis KW - Konstruktion von Wissensbasen KW - Wissensgraph KW - Wissenstransfer KW - Wissensvalidierung KW - logische Regeln KW - maschinelles Lernen KW - named entity recognition KW - Siamesische Neuronale Netzwerke KW - Transferlernen Y1 - 2021 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-501459 ER - TY - THES A1 - Najafi, Pejman T1 - Leveraging data science & engineering for advanced security operations T1 - Der Einsatz von Data Science & Engineering für fortschrittliche Security Operations N2 - The Security Operations Center (SOC) represents a specialized unit responsible for managing security within enterprises. To aid in its responsibilities, the SOC relies heavily on a Security Information and Event Management (SIEM) system that functions as a centralized repository for all security-related data, providing a comprehensive view of the organization's security posture. Due to the ability to offer such insights, SIEMS are considered indispensable tools facilitating SOC functions, such as monitoring, threat detection, and incident response. Despite advancements in big data architectures and analytics, most SIEMs fall short of keeping pace. Architecturally, they function merely as log search engines, lacking the support for distributed large-scale analytics. Analytically, they rely on rule-based correlation, neglecting the adoption of more advanced data science and machine learning techniques. This thesis first proposes a blueprint for next-generation SIEM systems that emphasize distributed processing and multi-layered storage to enable data mining at a big data scale. Next, with the architectural support, it introduces two data mining approaches for advanced threat detection as part of SOC operations. First, a novel graph mining technique that formulates threat detection within the SIEM system as a large-scale graph mining and inference problem, built on the principles of guilt-by-association and exempt-by-reputation. The approach entails the construction of a Heterogeneous Information Network (HIN) that models shared characteristics and associations among entities extracted from SIEM-related events/logs. Thereon, a novel graph-based inference algorithm is used to infer a node's maliciousness score based on its associations with other entities in the HIN. Second, an innovative outlier detection technique that imitates a SOC analyst's reasoning process to find anomalies/outliers. The approach emphasizes explainability and simplicity, achieved by combining the output of simple context-aware univariate submodels that calculate an outlier score for each entry. Both approaches were tested in academic and real-world settings, demonstrating high performance when compared to other algorithms as well as practicality alongside a large enterprise's SIEM system. This thesis establishes the foundation for next-generation SIEM systems that can enhance today's SOCs and facilitate the transition from human-centric to data-driven security operations. N2 - In einem Security Operations Center (SOC) werden alle sicherheitsrelevanten Prozesse, Daten und Personen einer Organisation zusammengefasst. Das Herzstück des SOCs ist ein Security Information and Event Management (SIEM)-System, welches als zentraler Speicher aller sicherheitsrelevanten Daten fungiert und einen Überblick über die Sicherheitslage einer Organisation geben kann. SIEM-Systeme sind unverzichtbare Werkzeuge für viele SOC-Funktionen wie Monitoring, Threat Detection und Incident Response. Trotz der Fortschritte bei Big-Data-Architekturen und -Analysen können die meisten SIEMs nicht mithalten. Sie fungieren nur als Protokollsuchmaschine und unterstützen keine verteilte Data Mining und Machine Learning. In dieser Arbeit wird zunächst eine Blaupause für die nächste Generation von SIEM-Systemen vorgestellt, welche Daten verteilt, verarbeitet und in mehreren Schichten speichert, damit auch Data Mining im großen Stil zu ermöglichen. Zudem werden zwei Data Mining-Ansätze vorgeschlagen, mit denen auch anspruchsvolle Bedrohungen erkannt werden können. Der erste Ansatz ist eine neue Graph-Mining-Technik, bei der SIEM-Daten als Graph strukturiert werden und Reputationsinferenz mithilfe der Prinzipien guiltby-association (Kontaktschuld) und exempt-by-reputation (Reputationsbefreiung) implementiert wird. Der Ansatz nutzt ein heterogenes Informationsnetzwerk (HIN), welches gemeinsame Eigenschaften und Assoziationen zwischen Entitäten aus Event Logs verknüpft. Des Weiteren ermöglicht ein neuer Inferenzalgorithmus die Bestimmung der Schädlichkeit eines Kontos anhand seiner Verbindungen zu anderen Entitäten im HIN. Der zweite Ansatz ist eine innovative Methode zur Erkennung von Ausreißern, die den Entscheidungsprozess eines SOC-Analysten imitiert. Diese Methode ist besonders einfach und interpretierbar, da sie einzelne univariate Teilmodelle kombiniert, die sich jeweils auf eine kontextualisierte Eigenschaft einer Entität beziehen. Beide Ansätze wurden sowohl akademisch als auch in der Praxis getestet und haben im Vergleich mit anderen Methoden auch in großen Unternehmen eine hohe Qualität bewiesen. Diese Arbeit bildet die Grundlage für die nächste Generation von SIEM-Systemen, welche den Übergang von einer personalzentrischen zu einer datenzentrischen Perspektive auf SOCs ermöglichen. KW - cybersecurity KW - endpoint security KW - threat detection KW - intrusion detection KW - apt KW - advanced threats KW - advanced persistent threat KW - zero-day KW - security analytics KW - data-driven KW - data mining KW - data science KW - anomaly detection KW - outlier detection KW - graph mining KW - graph inference KW - machine learning KW - Advanced Persistent Threats KW - fortschrittliche Angriffe KW - Anomalieerkennung KW - APT KW - Cyber-Sicherheit KW - Data-Mining KW - Data-Science KW - datengetrieben KW - Endpunktsicherheit KW - Graphableitung KW - Graph-Mining KW - Einbruchserkennung KW - Machine-Learning KW - Ausreißererkennung KW - Sicherheitsanalyse KW - Bedrohungserkennung KW - 0-day Y1 - 2023 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-612257 ER - TY - JOUR A1 - Tong, Hao A1 - Nikoloski, Zoran T1 - Machine learning approaches for crop improvement BT - leveraging phenotypic and genotypic big data JF - Journal of plant physiology : biochemistry, physiology, molecular biology and biotechnology of plants N2 - Highly efficient and accurate selection of elite genotypes can lead to dramatic shortening of the breeding cycle in major crops relevant for sustaining present demands for food, feed, and fuel. In contrast to classical approaches that emphasize the need for resource-intensive phenotyping at all stages of artificial selection, genomic selection dramatically reduces the need for phenotyping. Genomic selection relies on advances in machine learning and the availability of genotyping data to predict agronomically relevant phenotypic traits. Here we provide a systematic review of machine learning approaches applied for genomic selection of single and multiple traits in major crops in the past decade. We emphasize the need to gather data on intermediate phenotypes, e.g. metabolite, protein, and gene expression levels, along with developments of modeling techniques that can lead to further improvements of genomic selection. In addition, we provide a critical view of factors that affect genomic selection, with attention to transferability of models between different environments. Finally, we highlight the future aspects of integrating high-throughput molecular phenotypic data from omics technologies with biological networks for crop improvement. KW - genomic selection KW - genomic prediction KW - machine learning KW - multiple KW - traits KW - multi-omics KW - GxE interaction Y1 - 2020 U6 - https://doi.org/10.1016/j.jplph.2020.153354 SN - 0176-1617 SN - 1618-1328 VL - 257 PB - Elsevier CY - München ER - TY - JOUR A1 - Vaid, Akhil A1 - Somani, Sulaiman A1 - Russak, Adam J. A1 - De Freitas, Jessica K. A1 - Chaudhry, Fayzan F. A1 - Paranjpe, Ishan A1 - Johnson, Kipp W. A1 - Lee, Samuel J. A1 - Miotto, Riccardo A1 - Richter, Felix A1 - Zhao, Shan A1 - Beckmann, Noam D. A1 - Naik, Nidhi A1 - Kia, Arash A1 - Timsina, Prem A1 - Lala, Anuradha A1 - Paranjpe, Manish A1 - Golden, Eddye A1 - Danieletto, Matteo A1 - Singh, Manbir A1 - Meyer, Dara A1 - O'Reilly, Paul F. A1 - Huckins, Laura A1 - Kovatch, Patricia A1 - Finkelstein, Joseph A1 - Freeman, Robert M. A1 - Argulian, Edgar A1 - Kasarskis, Andrew A1 - Percha, Bethany A1 - Aberg, Judith A. A1 - Bagiella, Emilia A1 - Horowitz, Carol R. A1 - Murphy, Barbara A1 - Nestler, Eric J. A1 - Schadt, Eric E. A1 - Cho, Judy H. A1 - Cordon-Cardo, Carlos A1 - Fuster, Valentin A1 - Charney, Dennis S. A1 - Reich, David L. A1 - Böttinger, Erwin A1 - Levin, Matthew A. A1 - Narula, Jagat A1 - Fayad, Zahi A. A1 - Just, Allan C. A1 - Charney, Alexander W. A1 - Nadkarni, Girish N. A1 - Glicksberg, Benjamin S. T1 - Machine learning to predict mortality and critical events in a cohort of patients with COVID-19 in New York City: model development and validation JF - Journal of medical internet research : international scientific journal for medical research, information and communication on the internet ; JMIR N2 - Background: COVID-19 has infected millions of people worldwide and is responsible for several hundred thousand fatalities. The COVID-19 pandemic has necessitated thoughtful resource allocation and early identification of high-risk patients. However, effective methods to meet these needs are lacking. Objective: The aims of this study were to analyze the electronic health records (EHRs) of patients who tested positive for COVID-19 and were admitted to hospitals in the Mount Sinai Health System in New York City; to develop machine learning models for making predictions about the hospital course of the patients over clinically meaningful time horizons based on patient characteristics at admission; and to assess the performance of these models at multiple hospitals and time points. Methods: We used Extreme Gradient Boosting (XGBoost) and baseline comparator models to predict in-hospital mortality and critical events at time windows of 3, 5, 7, and 10 days from admission. Our study population included harmonized EHR data from five hospitals in New York City for 4098 COVID-19-positive patients admitted from March 15 to May 22, 2020. The models were first trained on patients from a single hospital (n=1514) before or on May 1, externally validated on patients from four other hospitals (n=2201) before or on May 1, and prospectively validated on all patients after May 1 (n=383). Finally, we established model interpretability to identify and rank variables that drive model predictions. Results: Upon cross-validation, the XGBoost classifier outperformed baseline models, with an area under the receiver operating characteristic curve (AUC-ROC) for mortality of 0.89 at 3 days, 0.85 at 5 and 7 days, and 0.84 at 10 days. XGBoost also performed well for critical event prediction, with an AUC-ROC of 0.80 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days. In external validation, XGBoost achieved an AUC-ROC of 0.88 at 3 days, 0.86 at 5 days, 0.86 at 7 days, and 0.84 at 10 days for mortality prediction. Similarly, the unimputed XGBoost model achieved an AUC-ROC of 0.78 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days. Trends in performance on prospective validation sets were similar. At 7 days, acute kidney injury on admission, elevated LDH, tachypnea, and hyperglycemia were the strongest drivers of critical event prediction, while higher age, anion gap, and C-reactive protein were the strongest drivers of mortality prediction. Conclusions: We externally and prospectively trained and validated machine learning models for mortality and critical events for patients with COVID-19 at different time horizons. These models identified at-risk patients and uncovered underlying relationships that predicted outcomes. KW - machine learning KW - COVID-19 KW - electronic health record KW - TRIPOD KW - clinical KW - informatics KW - prediction KW - mortality KW - EHR KW - cohort KW - hospital KW - performance Y1 - 2020 U6 - https://doi.org/10.2196/24018 SN - 1439-4456 SN - 1438-8871 VL - 22 IS - 11 PB - Healthcare World CY - Richmond, Va. ER - TY - JOUR A1 - Ryo, Masahiro A1 - Jeschke, Jonathan M. A1 - Rillig, Matthias C. A1 - Heger, Tina T1 - Machine learning with the hierarchy-of-hypotheses (HoH) approach discovers novel pattern in studies on biological invasions JF - Research synthesis methods N2 - Research synthesis on simple yet general hypotheses and ideas is challenging in scientific disciplines studying highly context-dependent systems such as medical, social, and biological sciences. This study shows that machine learning, equation-free statistical modeling of artificial intelligence, is a promising synthesis tool for discovering novel patterns and the source of controversy in a general hypothesis. We apply a decision tree algorithm, assuming that evidence from various contexts can be adequately integrated in a hierarchically nested structure. As a case study, we analyzed 163 articles that studied a prominent hypothesis in invasion biology, the enemy release hypothesis. We explored if any of the nine attributes that classify each study can differentiate conclusions as classification problem. Results corroborated that machine learning can be useful for research synthesis, as the algorithm could detect patterns that had been already focused in previous narrative reviews. Compared with the previous synthesis study that assessed the same evidence collection based on experts' judgement, the algorithm has newly proposed that the studies focusing on Asian regions mostly supported the hypothesis, suggesting that more detailed investigations in these regions can enhance our understanding of the hypothesis. We suggest that machine learning algorithms can be a promising synthesis tool especially where studies (a) reformulate a general hypothesis from different perspectives, (b) use different methods or variables, or (c) report insufficient information for conducting meta-analyses. KW - artificial intelligence KW - hierarchy-of-hypotheses approach KW - machine learning KW - meta-analysis KW - synthesis KW - systematic review Y1 - 2019 U6 - https://doi.org/10.1002/jrsm.1363 SN - 1759-2879 SN - 1759-2887 VL - 11 IS - 1 SP - 66 EP - 73 PB - Wiley CY - Hoboken ER - TY - GEN A1 - Ryo, Masahiro A1 - Jeschke, Jonathan M. A1 - Rillig, Matthias C. A1 - Heger, Tina T1 - Machine learning with the hierarchy-of-hypotheses (HoH) approach discovers novel pattern in studies on biological invasions T2 - Postprints der Universität Potsdam : Mathematisch-Naturwissenschaftliche Reihe N2 - Research synthesis on simple yet general hypotheses and ideas is challenging in scientific disciplines studying highly context-dependent systems such as medical, social, and biological sciences. This study shows that machine learning, equation-free statistical modeling of artificial intelligence, is a promising synthesis tool for discovering novel patterns and the source of controversy in a general hypothesis. We apply a decision tree algorithm, assuming that evidence from various contexts can be adequately integrated in a hierarchically nested structure. As a case study, we analyzed 163 articles that studied a prominent hypothesis in invasion biology, the enemy release hypothesis. We explored if any of the nine attributes that classify each study can differentiate conclusions as classification problem. Results corroborated that machine learning can be useful for research synthesis, as the algorithm could detect patterns that had been already focused in previous narrative reviews. Compared with the previous synthesis study that assessed the same evidence collection based on experts' judgement, the algorithm has newly proposed that the studies focusing on Asian regions mostly supported the hypothesis, suggesting that more detailed investigations in these regions can enhance our understanding of the hypothesis. We suggest that machine learning algorithms can be a promising synthesis tool especially where studies (a) reformulate a general hypothesis from different perspectives, (b) use different methods or variables, or (c) report insufficient information for conducting meta-analyses. T3 - Zweitveröffentlichungen der Universität Potsdam : Mathematisch-Naturwissenschaftliche Reihe - 1171 KW - artificial intelligence KW - hierarchy-of-hypotheses approach KW - machine learning KW - meta-analysis KW - synthesis KW - systematic review Y1 - 2021 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-517643 SN - 1866-8372 IS - 1171 SP - 66 EP - 73 ER - TY - JOUR A1 - Shams, Boshra A1 - Wang, Ziqian A1 - Roine, Timo A1 - Aydogan, Dogu Baran A1 - Vajkoczy, Peter A1 - Lippert, Christoph A1 - Picht, Thomas A1 - Fekonja, Lucius Samo T1 - Machine learning-based prediction of motor status in glioma patients using diffusion MRI metrics along the corticospinal tract JF - Brain communications N2 - Shams et al. report that glioma patients' motor status is predicted accurately by diffusion MRI metrics along the corticospinal tract based on support vector machine method, reaching an overall accuracy of 77%. They show that these metrics are more effective than demographic and clinical variables. Along tract statistics enables white matter characterization using various diffusion MRI metrics. These diffusion models reveal detailed insights into white matter microstructural changes with development, pathology and function. Here, we aim at assessing the clinical utility of diffusion MRI metrics along the corticospinal tract, investigating whether motor glioma patients can be classified with respect to their motor status. We retrospectively included 116 brain tumour patients suffering from either left or right supratentorial, unilateral World Health Organization Grades II, III and IV gliomas with a mean age of 53.51 +/- 16.32 years. Around 37% of patients presented with preoperative motor function deficits according to the Medical Research Council scale. At group level comparison, the highest non-overlapping diffusion MRI differences were detected in the superior portion of the tracts' profiles. Fractional anisotropy and fibre density decrease, apparent diffusion coefficient axial diffusivity and radial diffusivity increase. To predict motor deficits, we developed a method based on a support vector machine using histogram-based features of diffusion MRI tract profiles (e.g. mean, standard deviation, kurtosis and skewness), following a recursive feature elimination method. Our model achieved high performance (74% sensitivity, 75% specificity, 74% overall accuracy and 77% area under the curve). We found that apparent diffusion coefficient, fractional anisotropy and radial diffusivity contributed more than other features to the model. Incorporating the patient demographics and clinical features such as age, tumour World Health Organization grade, tumour location, gender and resting motor threshold did not affect the model's performance, revealing that these features were not as effective as microstructural measures. These results shed light on the potential patterns of tumour-related microstructural white matter changes in the prediction of functional deficits. KW - machine learning KW - support vector machine KW - tractography KW - diffusion MRI; KW - corticospinal tract Y1 - 2022 U6 - https://doi.org/10.1093/braincomms/fcac141 SN - 2632-1297 VL - 4 IS - 3 PB - Oxford University Press CY - Oxford ER - TY - INPR A1 - Prasse, Paul A1 - Gruben, Gerrit A1 - Machlika, Lukas A1 - Pevny, Tomas A1 - Sofka, Michal A1 - Scheffer, Tobias T1 - Malware Detection by HTTPS Traffic Analysis N2 - In order to evade detection by network-traffic analysis, a growing proportion of malware uses the encrypted HTTPS protocol. We explore the problem of detecting malware on client computers based on HTTPS traffic analysis. In this setting, malware has to be detected based on the host IP address, ports, timestamp, and data volume information of TCP/IP packets that are sent and received by all the applications on the client. We develop a scalable protocol that allows us to collect network flows of known malicious and benign applications as training data and derive a malware-detection method based on a neural networks and sequence classification. We study the method's ability to detect known and new, unknown malware in a large-scale empirical study. KW - machine learning KW - computer security Y1 - 2017 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-100942 ER -