TY  - JOUR
A1  - Gonschorek, Julia
A1  - Langer, Anja
A1  - Bernhardt, Benjamin
A1  - Raebiger, Caroline
T1  - Big Data in the Field of Civil Security Research: Approaches for the Visual Preprocessing of Fire Brigade Operations
JF  - Science
N2  - This article gives insight in a running dissertation at the University in Potsdam. Point of discussion is the spatial and temporal distribution of emergencies of German fire brigades that have not sufficiently been scientifically examined. The challenge is seen in Big Data: enormous amounts of data that exist now (or can be collected in the future) and whose variables are linked to one another. These analyses and visualizations can form a basis for strategic, operational and tactical planning, as well as prevention measures. The user-centered (geo-) visualization of fire brigade data accessible to the general public is a scientific contribution to the research topic 'geovisual analytics and geographical profiling'. It may supplement antiquated methods such as the so-called pinmaps as well as the areas of engagement that are freehand constructions in GIS. Considering police work, there are already numerous scientific projects, publications, and software solutions designed to meet the specific requirements of Crime Analysis and Crime Mapping. By adapting and extending these methods and techniques, civil security research can be tailored to the needs of fire departments. In this paper, a selection of appropriate visualization methods will be presented and discussed.
KW  - Big Data
KW  - Civil Security
KW  - Explorative (Data-) Analysis
KW  - Geovisual Analytics
KW  - Visualization
Y1  - 2016
U6  - https://doi.org/10.4018/IJAEIS.2016010104
SN  - 1947-3192
SN  - 1947-3206
VL  - 7
SP  - 54
EP  - 64
PB  - IGI Global
CY  - Hershey
ER  - 
TY  - JOUR
A1  - Voland, Patrick
A1  - Asche, Hartmut
T1  - Processing and Visualizing Floating Car Data for Human-Centered Traffic and Environment Applications: A Transdisciplinary Approach
JF  - International journal of agricultural and environmental information systems : an official publication of the Information Resources Management Association
N2  - In the era of the Internet of Things and Big Data modern cars have become mobile electronic systems or computers on wheels. Car sensors record a multitude of car and traffic related data as well as environmental parameters outside the vehicle. The data recorded are spatio-temporal by nature (floating car data) and can thus be classified as geodata. Their geospatial potential is, however, not fully exploited so far. In this paper, we present an approach to collect, process and visualize floating car data for traffic-and environment-related applications. It is demonstrated that cartographic visualization, in particular, is as effective means to make the enormous stocks of machine-recorded data available to human perception, exploration and analysis.
KW  - Automotive Electronics
KW  - Big Data
KW  - Geoinformation Science
KW  - Geovisualization
KW  - Process Modelling
KW  - SpatioTemporal Sensor Data
Y1  - 2017
U6  - https://doi.org/10.4018/IJAEIS.2017040103
SN  - 1947-3192
SN  - 1947-3206
VL  - 8
SP  - 32
EP  - 49
PB  - IGI Global
CY  - Hershey
ER  - 
TY  - THES
A1  - Richter, Rico
T1  - Concepts and techniques for processing and rendering of massive 3D point clouds
T1  - Konzepte und Techniken für die Verarbeitung und das Rendering von Massiven 3D-Punktwolken
N2  - Remote sensing technology, such as airborne, mobile, or terrestrial laser scanning, and photogrammetric techniques, are fundamental approaches for efficient, automatic creation of digital representations of spatial environments. For example, they allow us to generate 3D point clouds of landscapes, cities, infrastructure networks, and sites. As essential and universal category of geodata, 3D point clouds are used and processed by a growing number of applications, services, and systems such as in the domains of urban planning, landscape architecture, environmental monitoring, disaster management, virtual geographic environments as well as for spatial analysis and simulation.
While the acquisition processes for 3D point clouds become more and more reliable and widely-used, applications and systems are faced with more and more 3D point cloud data. In addition, 3D point clouds, by their very nature, are raw data, i.e., they do not contain any structural or semantics information. Many processing strategies common to GIS such as deriving polygon-based 3D models generally do not scale for billions of points. GIS typically reduce data density and precision of 3D point clouds to cope with the sheer amount of data, but that results in a significant loss of valuable information at the same time.
This thesis proposes concepts and techniques designed to efficiently store and process massive 3D point clouds. To this end, object-class segmentation approaches are presented to attribute semantics to 3D point clouds, used, for example, to identify building, vegetation, and ground structures and, thus, to enable processing, analyzing, and visualizing 3D point clouds in a more effective and efficient way. Similarly, change detection and updating strategies for 3D point clouds are introduced that allow for reducing storage requirements and incrementally updating 3D point cloud databases. In addition, this thesis presents out-of-core, real-time rendering techniques used to interactively explore 3D point clouds and related analysis results. All techniques have been implemented based on specialized spatial data structures, out-of-core algorithms, and GPU-based processing schemas to cope with massive 3D point clouds having billions of points.  
All proposed techniques have been evaluated and demonstrated their applicability to the field of geospatial applications and systems, in particular for tasks such as classification, processing, and visualization. Case studies for 3D point clouds of entire cities with up to 80 billion points show that the presented approaches open up new ways to manage and apply large-scale, dense, and time-variant 3D point clouds as required by a rapidly growing number of applications and systems.
N2  - Fernerkundungstechnologien wie luftgestütztes, mobiles oder terrestrisches Laserscanning und photogrammetrische Techniken sind grundlegende Ansätze für die effiziente, automatische Erstellung von digitalen Repräsentationen räumlicher Umgebungen. Sie ermöglichen uns zum Beispiel die Erzeugung von 3D-Punktwolken für Landschaften, Städte, Infrastrukturnetze und Standorte. 3D-Punktwolken werden als wesentliche und universelle Kategorie von Geodaten von einer wachsenden Anzahl an Anwendungen, Diensten und Systemen genutzt und verarbeitet, zum Beispiel in den Bereichen Stadtplanung, Landschaftsarchitektur, Umweltüberwachung, Katastrophenmanagement, virtuelle geographische Umgebungen sowie zur räumlichen Analyse und Simulation.
Da die Erfassungsprozesse für 3D-Punktwolken immer zuverlässiger und verbreiteter werden, sehen sich Anwendungen und Systeme mit immer größeren 3D-Punktwolken-Daten konfrontiert. Darüber hinaus enthalten 3D-Punktwolken als Rohdaten von ihrer Art her keine strukturellen oder semantischen Informationen. Viele GIS-übliche Verarbeitungsstrategien, wie die Ableitung polygonaler 3D-Modelle, skalieren in der Regel nicht für Milliarden von Punkten. GIS reduzieren typischerweise die Datendichte und Genauigkeit von 3D-Punktwolken, um mit der immensen Datenmenge umgehen zu können, was aber zugleich zu einem signifikanten Verlust wertvoller Informationen führt.
Diese Arbeit präsentiert Konzepte und Techniken, die entwickelt wurden, um massive 3D-Punktwolken effizient zu speichern und zu verarbeiten. Hierzu werden Ansätze für die Objektklassen-Segmentierung vorgestellt, um 3D-Punktwolken mit Semantik anzureichern; so lassen sich beispielsweise Gebäude-, Vegetations- und Bodenstrukturen identifizieren, wodurch die Verarbeitung, Analyse und Visualisierung von 3D-Punktwolken effektiver und effizienter durchführbar werden. Ebenso werden Änderungserkennungs- und Aktualisierungsstrategien für 3D-Punktwolken vorgestellt, mit denen Speicheranforderungen reduziert und Datenbanken für 3D-Punktwolken inkrementell aktualisiert werden können. Des Weiteren beschreibt diese Arbeit Out-of-Core Echtzeit-Rendering-Techniken zur interaktiven Exploration von 3D-Punktwolken und zugehöriger Analyseergebnisse. Alle Techniken wurden mit Hilfe spezialisierter räumlicher Datenstrukturen, Out-of-Core-Algorithmen und GPU-basierter Verarbeitungs-schemata implementiert, um massiven 3D-Punktwolken mit Milliarden von Punkten gerecht werden zu können.
Alle vorgestellten Techniken wurden evaluiert und die Anwendbarkeit für Anwendungen und Systeme, die mit raumbezogenen Daten arbeiten, wurde insbesondere für Aufgaben wie Klassifizierung, Verarbeitung und Visualisierung demonstriert. Fallstudien für 3D-Punktwolken von ganzen Städten mit bis zu 80 Milliarden Punkten zeigen, dass die vorgestellten Ansätze neue Wege zur Verwaltung und Verwendung von großflächigen, dichten und zeitvarianten 3D-Punktwolken eröffnen, die von einer wachsenden Anzahl an Anwendungen und Systemen benötigt werden.
KW  - 3D point clouds
KW  - 3D-Punktwolken
KW  - real-time rendering
KW  - Echtzeit-Rendering
KW  - 3D visualization
KW  - 3D-Visualisierung
KW  - classification
KW  - Klassifizierung
KW  - change detection
KW  - Veränderungsanalyse
KW  - LiDAR
KW  - LiDAR
KW  - remote sensing
KW  - Fernerkundung
KW  - mobile mapping
KW  - Mobile-Mapping
KW  - Big Data
KW  - Big Data
KW  - GPU
KW  - GPU
KW  - laserscanning
KW  - Laserscanning
Y1  - 2018
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-423304
ER  - 
TY  - THES
A1  - Jaeger, David
T1  - Enabling Big Data security analytics for advanced network attack detection
T1  - Ermöglichung von Big Data Sicherheitsanalysen für erweiterte Angriffserkennung in Netzwerken
N2  - The last years have shown an increasing sophistication of attacks against enterprises. Traditional security solutions like firewalls, anti-virus systems and generally Intrusion Detection Systems (IDSs) are no longer sufficient to protect an enterprise against these advanced attacks. One popular approach to tackle this issue is to collect and analyze events generated across the IT landscape of an enterprise. This task is achieved by the utilization of Security Information and Event Management (SIEM) systems. However, the majority of the currently existing SIEM solutions is not capable of handling the massive volume of data and the diversity of event representations. Even if these solutions can collect the data at a central place, they are neither able to extract all relevant information from the events nor correlate events across various sources. Hence, only rather simple attacks are detected, whereas complex attacks, consisting of multiple stages, remain undetected. Undoubtedly, security operators of large enterprises are faced with a typical Big Data problem.

In this thesis, we propose and implement a prototypical SIEM system named Real-Time Event Analysis and Monitoring System (REAMS) that addresses the Big Data challenges of event data with common paradigms, such as data normalization, multi-threading, in-memory storage, and distributed processing. In particular, a mostly stream-based event processing workflow is proposed that collects, normalizes, persists and analyzes events in near real-time. In this regard, we have made various contributions in the SIEM context. First, we propose a high-performance normalization algorithm that is highly parallelized across threads and distributed across nodes. Second, we are persisting into an in-memory database for fast querying and correlation in the context of attack detection. Third, we propose various analysis layers, such as anomaly- and signature-based detection, that run on top of the normalized and correlated events. As a result, we demonstrate our capabilities to detect previously known as well as unknown attack patterns. Lastly, we have investigated the integration of cyber threat intelligence (CTI) into the analytical process, for instance, for correlating monitored user accounts with previously collected public identity leaks to identify possible compromised user accounts.

In summary, we show that a SIEM system can indeed monitor a large enterprise environment with a massive load of incoming events. As a result, complex attacks spanning across the whole network can be uncovered and mitigated, which is an advancement in comparison to existing SIEM systems on the market.
N2  - Die letzten Jahre haben gezeigt, dass die Komplexität von Angriffen auf Unternehmensnetzwerke stetig zunimmt. Herkömmliche Sicherheitslösungen, wie Firewalls, Antivirus-Programme oder generell Intrusion Detection Systeme (IDS), sind nicht mehr ausreichend, um Unternehmen vor solch ausgefeilten Angriffen zu schützen. Ein verbreiteter Lösungsansatz für dieses Problem ist das Sammeln und Analysieren von Ereignissen innerhalb des betroffenen Unternehmensnetzwerks mittels Security Information and Event Management (SIEM) Systemen. Die Mehrheit der derzeitigen SIEM-Lösungen auf dem Markt ist allerdings nicht in er Lage, das riesige Datenvolumen und die Vielfalt der Ereignisdarstellungen zu bewältigen. Auch wenn diese Lösungen die Daten an einem zentralen Ort sammeln können, können sie weder alle relevanten Informationen aus den Ereignissen extrahieren noch diese über verschiedene Quellen hinweg korrelieren. Aktuell werden daher nur relativ einfache Angriffe erkannt, während komplexe mehrstufige Angriffe unentdeckt bleiben. Zweifellos stehen Sicherheitsverantwortliche großer Unternehmen einem typischen Big Data-Problem gegenüber.

In dieser Arbeit wird ein prototypisches SIEM-System vorgeschlagen und implementiert, welches den Big Data-Anforderungen von Ereignisdaten mit gängigen Paradigmen, wie Datennormalisierung, Multithreading, In-Memory/Speicherung und verteilter Verarbeitung begegnet. Insbesondere wird ein größtenteils stream-basierter Workflow für die Ereignisverarbeitung vorgeschlagen, der Ereignisse in nahezu Echtzeit erfasst, normalisiert, persistiert und analysiert. In diesem Zusammenhang haben wir verschiedene Beiträge im SIEM-Kontext geleistet. Erstens schlagen wir einen Algorithmus für die Hochleistungsnormalisierung vor, der, über Threads hinweg, hochgradig parallelisiert und auf Knoten verteilt ist. Zweitens persistieren wir in eine In-Memory-Datenbank, um im Rahmen der Angriffserkennung eine schnelle Abfrage und Korrelation von Ereignissen zu ermöglichen. Drittens schlagen wir verschiedene Analyseansätze, wie beispielsweise die anomalie- und musterbasierte Erkennung, vor, die auf normalisierten und korrelierten Ereignissen basieren. Damit können wir bereits bekannte als auch bisher unbekannte Arten von Angriffen erkennen. Zuletzt haben wir die Integration von sogenannter Cyber Threat Intelligence (CTI) in den Analyseprozess untersucht. Als Beispiel erfassen wir veröffentlichte Identitätsdiebstähle von großen Dienstanbietern, um Nutzerkonten zu identifizieren, die möglicherweise in nächster Zeit durch den Missbrauch verloren gegangener Zugangsdaten kompromittiert werden könnten.

Zusammenfassend zeigen wir, dass ein SIEM-System tatsächlich ein großes Unternehmensnetzwerk mit einer massiven Menge an eingehenden Ereignissen überwachen kann. Dadurch können komplexe Angriffe, die sich über das gesamte Netzwerk erstrecken, aufgedeckt und abgewehrt werden. Dies ist ein Fortschritt gegenüber den auf dem Markt vorhandenen SIEM-Systemen.
KW  - intrusion detection
KW  - Angriffserkennung
KW  - network security
KW  - Netzwerksicherheit
KW  - Big Data
KW  - Big Data
KW  - event normalization
KW  - Ereignisnormalisierung
KW  - SIEM
KW  - SIEM
KW  - IDS
KW  - IDS
KW  - multi-step attack
KW  - mehrstufiger Angriff
Y1  - 2018
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-435713
ER  - 
TY  - THES
A1  - Shaabani, Nuhad
T1  - On discovering and incrementally updating inclusion dependencies
N2  - In today's world, many applications produce large amounts of data at an enormous rate. Analyzing such datasets for metadata is indispensable for effectively understanding, storing, querying, manipulating, and mining them. Metadata summarizes technical properties of a dataset which rang from basic statistics to complex structures describing data dependencies. One type of dependencies is inclusion dependency (IND), which expresses subset-relationships between attributes of datasets. Therefore, inclusion dependencies are important for many data management applications in terms of data integration, query optimization, schema redesign, or integrity checking. So, the discovery of inclusion dependencies in unknown or legacy datasets is at the core of any data profiling effort.
	
For exhaustively detecting all INDs in large datasets, we developed S-indd++, a new algorithm that eliminates the shortcomings of existing IND-detection algorithms and significantly outperforms them. S-indd++ is based on a novel concept for the attribute clustering for efficiently deriving INDs. Inferring INDs from our attribute clustering eliminates all redundant operations caused by other algorithms. S-indd++ is also based on a novel partitioning strategy that enables discording a large number of candidates in early phases of the discovering process. Moreover, S-indd++ does not require to fit a partition into the main memory--this is a highly appreciable property in the face of ever-growing datasets. S-indd++ reduces up to 50% of the runtime of the state-of-the-art approach.
	
None of the approach for discovering INDs is appropriate for the application on dynamic datasets; they can not update the INDs after an update of the dataset without reprocessing it entirely. To this end, we developed the first approach for incrementally updating INDs in frequently changing datasets. We achieved that by reducing the problem of incrementally updating INDs to the incrementally updating the attribute clustering from which all INDs are efficiently derivable. We realized the update of the clusters by designing new operations to be applied to the clusters after every data update. The incremental update of INDs reduces the time of the complete rediscovery by up to 99.999%.   
	
All existing algorithms for discovering n-ary INDs are based on the principle of candidate generation--they generate candidates and test their validity in the given data instance. The major disadvantage of this technique is the exponentially growing number of database accesses in terms of SQL queries required for validation. We devised Mind2, the first approach for discovering n-ary INDs without candidate generation. Mind2 is based on a new mathematical framework developed in this thesis for computing the maximum INDs from which all other n-ary INDs are derivable. The experiments showed that Mind2 is significantly more scalable and effective than hypergraph-based algorithms.
N2  - Viele Anwendungen produzieren mit schnellem Tempo große Datenmengen. Die Profilierung solcher Datenmengen nach ihren Metadaten ist unabdingbar für ihre effektive Verwaltung und ihre Analyse. Metadaten fassen technische Eigenschaften einer Datenmenge zusammen, welche von einfachen Statistiken bis komplexe und Datenabhängigkeiten beschreibende Strukturen umfassen. Eine Form solcher Abhängigkeiten sind Inklusionsabhängigkeiten (INDs), die Teilmengenbeziehungen zwischen Attributen der Datenmengen ausdrücken. Dies macht INDs wichtig für viele Anwendungen wie Datenintegration, Anfragenoptimierung,  Schemaentwurf und Integritätsprüfung. Somit ist die Entdeckung von INDs in unbekannten Datenmengen eine zentrale Aufgabe der Datenprofilierung.  
	
Ich entwickelte einen neuen Algorithmus namens S-indd++ für die IND-Entdeckung in großen Datenmengen. S-indd++ beseitigt die Defizite existierender Algorithmen für die IND-Entdeckung und somit ist er performanter. S-indd++ berechnet INDs sehr effizient basierend auf einem neuen Clustering der Attribute.  S-indd++ wendet auch eine neue Partitionierungsmethode an, die das Verwerfen einer großen Anzahl von Kandidaten in früheren Phasen des Entdeckungsprozesses ermöglicht. Außerdem setzt S-indd++ nicht voraus, dass eine Datenpartition komplett in den Hauptspeicher passen muss. S-indd++ reduziert die Laufzeit der IND-Entdeckung um bis 50 %.

Keiner der IND-Entdeckungsalgorithmen ist geeignet für die Anwendung auf dynamischen Daten. Zu diesem Zweck entwickelte ich das erste Verfahren für das inkrementelle Update von INDs in häufig geänderten Daten. Ich erreichte dies bei der Reduzierung des Problems des inkrementellen Updates von INDs auf dem inkrementellen Update des Attribute-Clustering, von dem INDs effizient ableitbar sind. Ich realisierte das Update der Cluster beim Entwurf von neuen Operationen, die auf den Clustern nach jedem Update der Daten angewendet werden. Das inkrementelle Update von INDs reduziert die Zeit der statischen IND-Entdeckung um bis 99,999 %.
	
Alle vorhandenen Algorithmen für die n-ary-IND-Entdeckung basieren auf dem Prinzip der Kandidatengenerierung. Der Hauptnachteil dieser Methode ist die exponentiell wachsende Anzahl der SQL-Anfragen, die für die Validierung der Kandidaten nötig sind. Zu diesem Zweck entwickelte ich Mind2, den ersten Algorithmus für n-ary-IND-Entdeckung ohne Kandidatengenerierung. Mind2 basiert auf  einem neuen mathematischen Framework für die Berechnung der maximalen INDs, von denen alle anderen n-ary-INDs ableitbar sind. Die Experimente zeigten, dass Mind2 wesentlich skalierbarer und leistungsfähiger ist als die auf Hypergraphen basierenden Algorithmen.
T2  - Beitrag zur Entdeckung und inkrementellen Aktualisierung von Inklusionsabhängigkeiten
KW  - Inclusion Dependency
KW  - Data Profiling
KW  - Data Mining
KW  - Algorithms
KW  - Inclusion Dependency Discovery
KW  - Incrementally Inclusion Dependencies Discovery
KW  - Metadata Discovery
KW  - S-indd++
KW  - Mind2
KW  - Change Data Capture
KW  - Incremental Discovery
KW  - Big Data
KW  - Data Integration
KW  - Foreign Keys
KW  - Dynamic Data
KW  - Foreign Keys Discovery
KW  - Data Profiling
KW  - Data Mining
KW  - Algorithmen
KW  - Inklusionsabhängigkeiten
KW  - Inklusionsabhängigkeiten Entdeckung
KW  - Datenintegration
KW  - Metadaten Entdeckung
Y1  - 2020
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-471862
ER  - 
TY  - JOUR
A1  - Caruccio, Loredana
A1  - Deufemia, Vincenzo
A1  - Naumann, Felix
A1  - Polese, Giuseppe
T1  - Discovering relaxed functional dependencies based on multi-attribute dominance
JF  - IEEE transactions on knowledge and data engineering
N2  - With the advent of big data and data lakes, data are often integrated from multiple sources. Such integrated data are often of poor quality, due to inconsistencies, errors, and so forth. One way to check the quality of data is to infer functional dependencies (fds). However, in many modern applications it might be necessary to extract properties and relationships that are not captured through fds, due to the necessity to admit exceptions, or to consider similarity rather than equality of data values. Relaxed fds (rfds) have been introduced to meet these needs, but their discovery from data adds further complexity to an already complex problem, also due to the necessity of specifying similarity and validity thresholds. We propose Domino, a new discovery algorithm for rfds that exploits the concept of dominance in order to derive similarity thresholds of attribute values while inferring rfds. An experimental evaluation on real datasets demonstrates the discovery performance and the effectiveness of the proposed algorithm.
KW  - Complexity theory
KW  - Approximation algorithms
KW  - Big Data
KW  - Distributed
KW  - databases
KW  - Semantics
KW  - Lakes
KW  - Functional dependencies
KW  - data profiling
KW  - data cleansing
Y1  - 2020
U6  - https://doi.org/10.1109/TKDE.2020.2967722
SN  - 1041-4347
SN  - 1558-2191
VL  - 33
IS  - 9
SP  - 3212
EP  - 3228
PB  - Institute of Electrical and Electronics Engineers
CY  - New York, NY
ER  -