Refine
Year of publication
- 2017 (2) (remove)
Document Type
- Article (1)
- Doctoral Thesis (1)
Language
- English (2)
Is part of the Bibliography
- yes (2)
Keywords
- one-sided communication (2)
- MPI (1)
- Message Passing Interface (1)
- Middleware (1)
- Prozesssynchronisierung (1)
- Software-basierte Cache-Kohärenz (1)
- einseitige Kommunikation (1)
- middleware (1)
- parallel programming (1)
- parallele Programmierung (1)
Institute
Exploring one-sided communication and synchronization on a non-cache-coherent many-core architecture
(2017)
The ongoing many-core design aims at core counts where cache coherence becomes a serious challenge. Therefore, this paper discusses how one-sided communication and the required process synchronization can be realized on a non-cache-coherent many-core CPU. The Intel Single-chip Cloud Computer serves as an exemplary hardware architecture. The presented approach is based on software-managed cache coherence for MPI one-sided communication. The prototype implementation delivers a PUT performance of up to 5 times faster than the default message-based approach and reveals a reduction of the communication costs for the NAS Parallel Benchmarks 3-D fast Fourier Transform by a factor of 5. Further, the paper derives conclusions for future non-cache-coherent architectures.
Contemporary multi-core processors are parallel systems that also provide shared memory for programs running on them. Both the increasing number of cores in so-called many-core systems and the still growing computational power of the cores demand for memory systems that are able to deliver high bandwidths. Caches are essential components to satisfy this requirement. Nevertheless, hardware-based cache coherence in many-core chips faces practical limits to provide both coherence and high memory bandwidths. In addition, a shift away from global coherence can be observed. As a result, alternative architectures and suitable programming models need to be investigated.
This thesis focuses on fast communication for non-cache-coherent many-core architectures. Experiments are conducted on the Single-Chip Cloud Computer (SCC), a non-cache-coherent many-core processor with 48 mesh-connected cores. Although originally designed for message passing, the results of this thesis show that shared memory can be efficiently used for one-sided communication on this kind of architecture. One-sided communication enables data exchanges between processes where the receiver is not required to know the details of the performed communication. In the notion of the Message Passing Interface (MPI) standard, this type of communication allows to access memory of remote processes. In order to support this communication scheme on non-cache-coherent architectures, both an efficient process synchronization and a communication scheme with software-managed cache coherence are designed and investigated.
The process synchronization realizes the concept of the general active target synchronization scheme from the MPI standard. An existing classification of implementation approaches is extended and used to identify an appropriate class for the non-cache-coherent shared memory platform. Based on this classification, existing implementations are surveyed in order to find beneficial concepts, which are then used to design a lightweight synchronization protocol for the SCC that uses shared memory and uncached memory accesses. The proposed scheme is not prone to process skew and also enables direct communication as soon as both communication partners are ready. Experimental results show very good scaling properties and up to five times lower synchronization latency compared to a tuned message-based MPI implementation for the SCC.
For the communication, SCOSCo, a shared memory approach with software-managed cache coherence, is presented. According requirements for the coherence that fulfill MPI's separate memory model are formulated, and a lightweight implementation exploiting SCC hard- and software features is developed. Despite a discovered malfunction in the SCC's memory subsystem, the experimental evaluation of the design reveals up to five times better bandwidths and nearly four times lower latencies in micro-benchmarks compared to the SCC-tuned but message-based MPI library. For application benchmarks, like a parallel 3D fast Fourier transform, the runtime share of communication can be reduced by a factor of up to five. In addition, this thesis postulates beneficial hardware concepts that would support software-managed coherence for one-sided communication on future non-cache-coherent architectures where coherence might be only available in local subdomains but not on a global processor level.