publish.UP Search

Exploring one-sided communication and synchronization on a non-cache-coherent many-core architecture (2017)

The ongoing many-core design aims at core counts where cache coherence becomes a serious challenge. Therefore, this paper discusses how one-sided communication and the required process synchronization can be realized on a non-cache-coherent many-core CPU. The Intel Single-chip Cloud Computer serves as an exemplary hardware architecture. The presented approach is based on software-managed cache coherence for MPI one-sided communication. The prototype implementation delivers a PUT performance of up to 5 times faster than the default message-based approach and reveals a reduction of the communication costs for the NAS Parallel Benchmarks 3-D fast Fourier Transform by a factor of 5. Further, the paper derives conclusions for future non-cache-coherent architectures.

Exploiting gigabit ethernet capacity for cluster applications (2002)

Ciaccio, Giuseppe ; Ehlert, Marco ; Schnor, Bettina

In this paper we report about the recently completed porting of GAMMA to the Netgear GA621 Gigabit Ethernet adapter, and provide a comparison among GAMMA, MPI/GAMMA, TCP/IP, and MPICH/TCP, based on the Netgear GA621 and the older Netgear GA620 network adapters and using different device drivers, in a Gigabit Ethernet cluster of PCs running Linux 2.4. GAMMA (the Genoa Active Message MAchine) is a lightweight messaging system based on an Active Message-like paradigm, originally designed for efficient exploitation of Fast Ethernet interconnects. The comparison includes simple latency/hspace{0pt}bandwidth evaluation of the messaging systems on both adapters, as well as performance comparisons based on the NAS NPB and an end-user fluid dynamics application called Modular Ocean Model (MOM). The analysis of results provides useful hints concerning the efficient use of Gigabit Ethernet with clusters of PCs. In particular, it emerges that GAMMA on the GA621 adapter, with a combination of low end-to-end latency (8.5 $mu$s) and high throughput (118.4 MByte/s), provides a performing, cost-effective alternative to proprietary high-speed networks, e.g.~Myrinet, for a wide range of cluster computing applications.

POET (v0.1): speedup of many-core parallel reactive transport simulations with fast DHT lookups (2021)

De Lucia, Marco ; Kühn, Michael ; Lindemann, Alexander ; Lübke, Max ; Schnor, Bettina

Coupled reactive transport simulations are extremely demanding in terms of required computational power, which hampers their application and leads to coarsened and oversimplified domains. The chemical sub-process represents the major bottleneck: its acceleration is an urgent challenge which gathers increasing interdisciplinary interest along with pressing requirements for subsurface utilization such as spent nuclear fuel storage, geothermal energy and CO2 storage. In this context we developed POET (POtsdam rEactive Transport), a research parallel reactive transport simulator integrating algorithmic improvements which decisively speed up coupled simulations. In particular, POET is designed with a master/worker architecture, which ensures computational efficiency in both multicore and cluster compute environments. POET does not rely on contiguous grid partitions for the parallelization of chemistry but forms work packages composed of grid cells distant from each other. Such scattering prevents particularly expensive geochemical simulations, usually concentrated in the vicinity of a reactive front, from generating load imbalance between the available CPUs (central processing units), as is often the case with classical partitions. Furthermore, POET leverages an original implementation of the distributed hash table (DHT) mechanism to cache the results of geochemical simulations for further reuse in subsequent time steps during the coupled simulation. The caching is hence particularly advantageous for initially chemically homogeneous simulations and for smooth reaction fronts. We tune the rounding employed in the DHT on a 2D benchmark to validate the caching approach, and we evaluate the performance gain of POET's master/worker architecture and the DHT speedup on a 3D benchmark comprising around 650 000 grid elements. The runtime for 200 coupling iterations, corresponding to 960 simulation days, reduced from about 24 h on 11 workers to 29 min on 719 workers. Activating the DHT reduces the runtime further to 2 h and 8 min respectively. Only with these kinds of reduced hardware requirements and computational costs is it possible to realistically perform the longterm complex reactive transport simulations, as well as perform the uncertainty analyses required by pressing societal challenges connected with subsurface utilization.

PCG-Agreement Dokument (2004)

Feider, Henryk ; Schnor, Bettina

Gridmake : the missing link for compilation in the Grid (2003)

Feider, Henryk ; Schnor, Bettina ; Dramlitsch, Thomas

In order to take full advantage of Grid environments, applications need to be able to run on various heterogeneous platforms. Distributed runs across several clusters or supercomputers for example, require matching binaries at each site. Thus, at some stage, each Grid enabled application needs to be recompiled for every platform. Up to now, creating matching binaries on different platforms was a manual, sequential, slow, and very error-prone process. Developers had to log into each machine, transfer source code, check consistency and recompile if necessary. This cumbersome procedure is surely one reason for the (still existing) lack of production Grid computing. Gridmake, a tool to automate and speed up this procedure is presented in this paper.

Loaded: Server Load Balancing for IPv6 (2006)

Friedrich, Sven ; Krahmer, Sebastian ; Schneidenbach, Lars ; Schnor, Bettina

With the next generation Internet protocol IPv6 at the horizon, it is time to think about how applications can migrate to IPv6. Web traffic is currently one of the most important applications in the Internet. The increasing popularity of dynamically generated content on the World Wide Web, has created the need for fast web servers. Server clustering together with server load balancing has emerged as a promising technique to build scalable web servers. The paper gives a short overview over the new features of IPv6 and different server load balancing technologies. Further, we present and evaluate Loaded, an user-space server load balancer for IPv4 and IPv6 based on Linux.

Loaded : Server Load Balancing for IPv6 (2004)

Friedrich, Sven ; Krahmer, Sebastian ; Schneidenbach, Lars ; Schnor, Bettina

SLIBNet : Server Load Balancing for InfiniBand Networks (2005)

Friedrich, Sven ; Schneidenbach, Lars ; Schnor, Bettina

Today, InfiniBand is an evolving high speed interconnect technology to build high performance computing clusters, that achieve top 10 rankings in the current top 500 of the worldwide fastest supercomputers. Network interfaces (called host channel adapters) provide transport layer services over connections and datagrams in reliable or unreliable manner. Additionally, InfiniBand supports remote direct memory access (RDMA) primitives that allow for one- sided communication. Using server load balancing together with a high performance cluster makes it possible to build a fast, scalable, and reliable service infrastructure. We have designed and implemented a scalable load balancer for InfiniBand clusters called SLIBNet. Our investigations show that the InfiniBand architecture offers features which perfectly support load balancing. We want to thank the Megware Computer GmbH for providing us an InfiniBand switch to realize a server load balancing testbed.

Grid Security for Fault Tolerant Grid Applications (2006)

Hallama, Nicole ; Luckow, André ; Schnor, Bettina

Fine-grained Security Management in a Service-oriented Grid Architecture (2007)

Hoheisel, A. ; Müller, S. ; Schnor, Bettina

Refine

Has Fulltext

Author

Year of publication

Document Type

Language

Is part of the Bibliography

Keywords

Institute

33 search hits