TY  - JOUR
A1  - Padash, Amin
A1  - Aghion, Erez
A1  - Schulz, Alexander
A1  - Barkai, Eli
A1  - Chechkin, Aleksei V.
A1  - Metzler, Ralf
A1  - Kantz, Holger
T1  - Local equilibrium properties of ultraslow diffusion in the Sinai model
JF  - New journal of physics
N2  - We perform numerical studies of a thermally driven, overdamped particle in a random quenched force field, known as the Sinai model. We compare the unbounded motion on an infinite 1-dimensional domain to the motion in bounded domains with reflecting boundaries and show that the unbounded motion is at every time close to the equilibrium state of a finite system of growing size. This is due to time scale separation: inside wells of the random potential, there is relatively fast equilibration, while the motion across major potential barriers is ultraslow. Quantities studied by us are the time dependent mean squared displacement, the time dependent mean energy of an ensemble of particles, and the time dependent entropy of the probability distribution. Using a very fast numerical algorithm, we can explore times up top 10(17) steps and thereby also study finite-time crossover phenomena.
KW  - Sinai diffusion
KW  - clustering
KW  - local equilibrium
Y1  - 2022
U6  - https://doi.org/10.1088/1367-2630/ac7df8
SN  - 1367-2630
VL  - 24
IS  - 7
PB  - IOP Publishing
CY  - Bristol
ER  - 
TY  - JOUR
A1  - Draisbach, Uwe
A1  - Christen, Peter
A1  - Naumann, Felix
T1  - Transforming pairwise duplicates to entity clusters for high-quality duplicate detection
JF  - ACM Journal of Data and Information Quality
N2  - Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
KW  - Record linkage
KW  - data matching
KW  - entity resolution
KW  - deduplication
KW  - clustering
Y1  - 2019
U6  - https://doi.org/10.1145/3352591
SN  - 1936-1955
SN  - 1936-1963
VL  - 12
IS  - 1
SP  - 1
EP  - 30
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Clubb, Fiona J.
A1  - Bookhagen, Bodo
A1  - Rheinwalt, Aljoscha
T1  - Clustering river profiles to classify geomorphic domains
JF  - Journal of geophysical research : Earth surface
N2  - The structure and organization of river networks has been used for decades to investigate the influence of climate and tectonics on landscapes. The majority of these studies either analyze rivers in profile view by extracting channel steepness or calculate planform metrics such as drainage density. However, these techniques rely on the assumption of homogeneity: that intrinsic and external factors are spatially or temporally invariant over the measured profile. This assumption is violated for the majority of Earth's landscapes, where variations in uplift rate, rock strength, climate, and geomorphic process are almost ubiquitous. We propose a method for classifying river profiles to identify landscape regions with similar characteristics by adapting hierarchical clustering algorithms developed for time series data. We first test our clustering on two landscape evolution scenarios and find that we can successfully cluster regions with different erodibility and detect the transient response to sudden base level fall. We then test our method in two real landscapes: first in Bitterroot National Forest, Idaho, where we demonstrate that our method can detect transient incision waves and the topographic signature of fluvial and debris flow process regimes; and second, on Santa Cruz Island, California, where our technique identifies spatial patterns in lithology not detectable through normalized channel steepness analysis. By calculating channel steepness separately for each cluster, our method allows the extraction of more reliable steepness metrics than if calculated for the landscape as a whole. These examples demonstrate the method's ability to disentangle fluvial morphology in complex lithological and tectonic settings.
KW  - clustering
KW  - river networks
KW  - topographic analysis
KW  - landscape evolution modeling
Y1  - 2019
U6  - https://doi.org/10.1029/2019JF005025
SN  - 2169-9003
SN  - 2169-9011
VL  - 124
IS  - 6
SP  - 1417
EP  - 1439
PB  - American Geophysical Union
CY  - Hoboken
ER  - 
TY  - JOUR
A1  - Cesca, Simone
A1  - Sen, Ali Tolga
A1  - Dahm, Torsten
T1  - Seismicity monitoring by cluster analysis of moment tensors
JF  - Geophysical journal international
N2  - We suggest a new clustering approach to classify focal mechanisms from large moment tensor catalogues, with the purpose of automatically identify families of earthquakes with similar source geometry, recognize the orientation of most active faults, and detect temporal variations of the rupture processes. The approach differs in comparison to waveform similarity methods since clusters are detected even if they occur in large spatial distances. This approach is particularly helpful to analyse large moment tensor catalogues, as in microseismicity applications, where a manual analysis and classification is not feasible. A flexible algorithm is here proposed: it can handle different metrics, norms, and focal mechanism representations. In particular, the method can handle full moment tensor or constrained source model catalogues, for which different metrics are suggested. The method can account for variable uncertainties of different moment tensor components. We verify the method with synthetic catalogues. An application to real data from mining induced seismicity illustrates possible applications of the method and demonstrate the cluster detection and event classification performance with different moment tensor catalogues. Results proof that main earthquake source types occur on spatially separated faults, and that temporal changes in the number and characterization of focal mechanism clusters are detected. We suggest that moment tensor clustering can help assessing time dependent hazard in mines.
KW  - Persistence
KW  - memory
KW  - correlations
KW  - clustering
KW  - Earthquake source observations
Y1  - 2014
U6  - https://doi.org/10.1093/gji/ggt492
SN  - 0956-540X
SN  - 1365-246X
VL  - 196
IS  - 3
SP  - 1813
EP  - 1826
PB  - Oxford Univ. Press
CY  - Oxford
ER  - 
TY  - JOUR
A1  - Feher, Kristen
A1  - Whelan, James
A1  - Müller, Samuel
T1  - Exploring multicollinearity using a random matrix theory approach
JF  - Statistical applications in genetics and molecular biology
N2  - Clustering of gene expression data is often done with the latent aim of dimension reduction, by finding groups of genes that have a common response to potentially unknown stimuli. However, what is poorly understood to date is the behaviour of a low dimensional signal embedded in high dimensions. This paper introduces a multicollinear model which is based on random matrix theory results, and shows potential for the characterisation of a gene cluster's correlation matrix. This model projects a one dimensional signal into many dimensions and is based on the spiked covariance model, but rather characterises the behaviour of the corresponding correlation matrix. The eigenspectrum of the correlation matrix is empirically examined by simulation, under the addition of noise to the original signal. The simulation results are then used to propose a dimension estimation procedure of clusters from data. Moreover, the simulation results warn against considering pairwise correlations in isolation, as the model provides a mechanism whereby a pair of genes with 'low' correlation may simply be due to the interaction of high dimension and noise. Instead, collective information about all the variables is given by the eigenspectrum.
KW  - random matrix theory
KW  - clustering
KW  - dimension reduction
KW  - inverse correlation estimation
Y1  - 2012
U6  - https://doi.org/10.1515/1544-6115.1668
SN  - 1544-6115
VL  - 11
IS  - 3
PB  - De Gruyter
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Feher, Kristen
A1  - Whelan, James
A1  - Müller, Samuel
T1  - Assessing modularity using a random matrix theory approach
JF  - Statistical applications in genetics and molecular biology
N2  - Random matrix theory (RMT) is well suited to describing the emergent properties of systems with complex interactions amongst their constituents through their eigenvalue spectrums. Some RMT results are applied to the problem of clustering high dimensional biological data with complex dependence structure amongst the variables. It will be shown that a gene relevance or correlation network can be constructed by choosing a correlation threshold in a principled way, such that it corresponds to a block diagonal structure in the correlation matrix, if such a structure exists. The structure is then found using community detection algorithms, but with parameter choice guided by RMT predictions. The resulting clustering is compared to a variety of hierarchical clustering outputs and is found to the most generalised result, in that it captures all the features found by the other considered methods.
KW  - random matrix theory
KW  - clustering
KW  - modularity
Y1  - 2011
U6  - https://doi.org/10.2202/1544-6115.1667
SN  - 2194-6302
SN  - 1544-6115
VL  - 10
IS  - 1
PB  - De Gruyter
CY  - Berlin
ER  -