Refine
Year of publication
- 2019 (3) (remove)
Document Type
- Article (2)
- Doctoral Thesis (1)
Language
- English (3)
Is part of the Bibliography
- yes (3)
Keywords
- clustering (3) (remove)
Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
The structure and organization of river networks has been used for decades to investigate the influence of climate and tectonics on landscapes. The majority of these studies either analyze rivers in profile view by extracting channel steepness or calculate planform metrics such as drainage density. However, these techniques rely on the assumption of homogeneity: that intrinsic and external factors are spatially or temporally invariant over the measured profile. This assumption is violated for the majority of Earth's landscapes, where variations in uplift rate, rock strength, climate, and geomorphic process are almost ubiquitous. We propose a method for classifying river profiles to identify landscape regions with similar characteristics by adapting hierarchical clustering algorithms developed for time series data. We first test our clustering on two landscape evolution scenarios and find that we can successfully cluster regions with different erodibility and detect the transient response to sudden base level fall. We then test our method in two real landscapes: first in Bitterroot National Forest, Idaho, where we demonstrate that our method can detect transient incision waves and the topographic signature of fluvial and debris flow process regimes; and second, on Santa Cruz Island, California, where our technique identifies spatial patterns in lithology not detectable through normalized channel steepness analysis. By calculating channel steepness separately for each cluster, our method allows the extraction of more reliable steepness metrics than if calculated for the landscape as a whole. These examples demonstrate the method's ability to disentangle fluvial morphology in complex lithological and tectonic settings.
In the era of social networks, internet of things and location-based services, many online services produce a huge amount of data that have valuable objective information, such as geographic coordinates and date time. These characteristics (parameters) in the combination with a textual parameter bring the challenge for the discovery of geospatiotemporal knowledge. This challenge requires efficient methods for clustering and pattern mining in spatial, temporal and textual spaces.
In this thesis, we address the challenge of providing methods and frameworks for geospatiotemporal data analytics. As an initial step, we address the challenges of geospatial data processing: data gathering, normalization, geolocation, and storage. That initial step is the basement to tackle the next challenge -- geospatial clustering challenge. The first step of this challenge is to design the method for online clustering of georeferenced data. This algorithm can be used as a server-side clustering algorithm for online maps that visualize massive georeferenced data. As the second step, we develop the extension of this method that considers, additionally, the temporal aspect of data. For that, we propose the density and intensity-based geospatiotemporal clustering algorithm with fixed distance and time radius.
Each version of the clustering algorithm has its own use case that we show in the thesis.
In the next chapter of the thesis, we look at the spatiotemporal analytics from the perspective of the sequential rule mining challenge. We design and implement the framework that transfers data into textual geospatiotemporal data - data that contain geographic coordinates, time and textual parameters. By this way, we address the challenge of applying pattern/rule mining algorithms in geospatiotemporal space. As the applicable use case study, we propose spatiotemporal crime analytics -- discovery spatiotemporal patterns of crimes in publicly available crime data.
The second part of the thesis, we dedicate to the application part and use case studies. We design and implement the application that uses the proposed clustering algorithms to discover knowledge in data. Jointly with the application, we propose the use case studies for analysis of georeferenced data in terms of situational and public safety awareness.