TY - JOUR A1 - Padash, Amin A1 - Aghion, Erez A1 - Schulz, Alexander A1 - Barkai, Eli A1 - Chechkin, Aleksei V. A1 - Metzler, Ralf A1 - Kantz, Holger T1 - Local equilibrium properties of ultraslow diffusion in the Sinai model JF - New journal of physics N2 - We perform numerical studies of a thermally driven, overdamped particle in a random quenched force field, known as the Sinai model. We compare the unbounded motion on an infinite 1-dimensional domain to the motion in bounded domains with reflecting boundaries and show that the unbounded motion is at every time close to the equilibrium state of a finite system of growing size. This is due to time scale separation: inside wells of the random potential, there is relatively fast equilibration, while the motion across major potential barriers is ultraslow. Quantities studied by us are the time dependent mean squared displacement, the time dependent mean energy of an ensemble of particles, and the time dependent entropy of the probability distribution. Using a very fast numerical algorithm, we can explore times up top 10(17) steps and thereby also study finite-time crossover phenomena. KW - Sinai diffusion KW - clustering KW - local equilibrium Y1 - 2022 U6 - https://doi.org/10.1088/1367-2630/ac7df8 SN - 1367-2630 VL - 24 IS - 7 PB - IOP Publishing CY - Bristol ER - TY - JOUR A1 - Draisbach, Uwe A1 - Christen, Peter A1 - Naumann, Felix T1 - Transforming pairwise duplicates to entity clusters for high-quality duplicate detection JF - ACM Journal of Data and Information Quality N2 - Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result.
We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available. KW - Record linkage KW - data matching KW - entity resolution KW - deduplication KW - clustering Y1 - 2019 U6 - https://doi.org/10.1145/3352591 SN - 1936-1955 SN - 1936-1963 VL - 12 IS - 1 SP - 1 EP - 30 PB - Association for Computing Machinery CY - New York ER - TY - JOUR A1 - Clubb, Fiona J. A1 - Bookhagen, Bodo A1 - Rheinwalt, Aljoscha T1 - Clustering river profiles to classify geomorphic domains JF - Journal of geophysical research : Earth surface N2 - The structure and organization of river networks has been used for decades to investigate the influence of climate and tectonics on landscapes. The majority of these studies either analyze rivers in profile view by extracting channel steepness or calculate planform metrics such as drainage density. However, these techniques rely on the assumption of homogeneity: that intrinsic and external factors are spatially or temporally invariant over the measured profile. This assumption is violated for the majority of Earth's landscapes, where variations in uplift rate, rock strength, climate, and geomorphic process are almost ubiquitous. We propose a method for classifying river profiles to identify landscape regions with similar characteristics by adapting hierarchical clustering algorithms developed for time series data. We first test our clustering on two landscape evolution scenarios and find that we can successfully cluster regions with different erodibility and detect the transient response to sudden base level fall. We then test our method in two real landscapes: first in Bitterroot National Forest, Idaho, where we demonstrate that our method can detect transient incision waves and the topographic signature of fluvial and debris flow process regimes; and second, on Santa Cruz Island, California, where our technique identifies spatial patterns in lithology not detectable through normalized channel steepness analysis. By calculating channel steepness separately for each cluster, our method allows the extraction of more reliable steepness metrics than if calculated for the landscape as a whole. These examples demonstrate the method's ability to disentangle fluvial morphology in complex lithological and tectonic settings. KW - clustering KW - river networks KW - topographic analysis KW - landscape evolution modeling Y1 - 2019 U6 - https://doi.org/10.1029/2019JF005025 SN - 2169-9003 SN - 2169-9011 VL - 124 IS - 6 SP - 1417 EP - 1439 PB - American Geophysical Union CY - Hoboken ER - TY - JOUR A1 - Cesca, Simone A1 - Sen, Ali Tolga A1 - Dahm, Torsten T1 - Seismicity monitoring by cluster analysis of moment tensors JF - Geophysical journal international N2 - We suggest a new clustering approach to classify focal mechanisms from large moment tensor catalogues, with the purpose of automatically identify families of earthquakes with similar source geometry, recognize the orientation of most active faults, and detect temporal variations of the rupture processes. The approach differs in comparison to waveform similarity methods since clusters are detected even if they occur in large spatial distances. This approach is particularly helpful to analyse large moment tensor catalogues, as in microseismicity applications, where a manual analysis and classification is not feasible. A flexible algorithm is here proposed: it can handle different metrics, norms, and focal mechanism representations. In particular, the method can handle full moment tensor or constrained source model catalogues, for which different metrics are suggested. The method can account for variable uncertainties of different moment tensor components. We verify the method with synthetic catalogues. An application to real data from mining induced seismicity illustrates possible applications of the method and demonstrate the cluster detection and event classification performance with different moment tensor catalogues. Results proof that main earthquake source types occur on spatially separated faults, and that temporal changes in the number and characterization of focal mechanism clusters are detected. We suggest that moment tensor clustering can help assessing time dependent hazard in mines. KW - Persistence KW - memory KW - correlations KW - clustering KW - Earthquake source observations Y1 - 2014 U6 - https://doi.org/10.1093/gji/ggt492 SN - 0956-540X SN - 1365-246X VL - 196 IS - 3 SP - 1813 EP - 1826 PB - Oxford Univ. Press CY - Oxford ER - TY - JOUR A1 - Feher, Kristen A1 - Whelan, James A1 - Müller, Samuel T1 - Exploring multicollinearity using a random matrix theory approach JF - Statistical applications in genetics and molecular biology N2 - Clustering of gene expression data is often done with the latent aim of dimension reduction, by finding groups of genes that have a common response to potentially unknown stimuli. However, what is poorly understood to date is the behaviour of a low dimensional signal embedded in high dimensions. This paper introduces a multicollinear model which is based on random matrix theory results, and shows potential for the characterisation of a gene cluster's correlation matrix. This model projects a one dimensional signal into many dimensions and is based on the spiked covariance model, but rather characterises the behaviour of the corresponding correlation matrix. The eigenspectrum of the correlation matrix is empirically examined by simulation, under the addition of noise to the original signal. The simulation results are then used to propose a dimension estimation procedure of clusters from data. Moreover, the simulation results warn against considering pairwise correlations in isolation, as the model provides a mechanism whereby a pair of genes with 'low' correlation may simply be due to the interaction of high dimension and noise. Instead, collective information about all the variables is given by the eigenspectrum. KW - random matrix theory KW - clustering KW - dimension reduction KW - inverse correlation estimation Y1 - 2012 U6 - https://doi.org/10.1515/1544-6115.1668 SN - 1544-6115 VL - 11 IS - 3 PB - De Gruyter CY - Berlin ER - TY - JOUR A1 - Feher, Kristen A1 - Whelan, James A1 - Müller, Samuel T1 - Assessing modularity using a random matrix theory approach JF - Statistical applications in genetics and molecular biology N2 - Random matrix theory (RMT) is well suited to describing the emergent properties of systems with complex interactions amongst their constituents through their eigenvalue spectrums. Some RMT results are applied to the problem of clustering high dimensional biological data with complex dependence structure amongst the variables. It will be shown that a gene relevance or correlation network can be constructed by choosing a correlation threshold in a principled way, such that it corresponds to a block diagonal structure in the correlation matrix, if such a structure exists. The structure is then found using community detection algorithms, but with parameter choice guided by RMT predictions. The resulting clustering is compared to a variety of hierarchical clustering outputs and is found to the most generalised result, in that it captures all the features found by the other considered methods. KW - random matrix theory KW - clustering KW - modularity Y1 - 2011 U6 - https://doi.org/10.2202/1544-6115.1667 SN - 2194-6302 SN - 1544-6115 VL - 10 IS - 1 PB - De Gruyter CY - Berlin ER -