Date of Completion

12-14-2018

Embargo Period

12-14-2018

Keywords

Multicore, Synchronization, Moving Compute to Data, explicit messaging, cache coherence

Major Advisor

Omer Khan

Associate Advisor

John Chandy

Associate Advisor

Marten van Dijk

Field of Study

Electrical Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

Single chip multicore processors are now prevalent and processors with hundreds of cores are being proposed and explored by both academia and industry. Shared memory cache coherence is the state-of-the-art technology for these processors to enable synchronization and communication between cores. However, since the synchronization of cores on shared data using hardware cache coherence suffers from instruction retries and cache line ping-pong overheads, it prevents performance scaling as core counts increase on a chip.

This thesis proposes to utilize a novel moving computation to data model (MC) to overcome this synchronization bottleneck in a 1000-cores scale shared memory multicore processor. The proposed MC model pins shared data to dedicated cores called service cores. The execution of critical code sections is explicitly requested from worker cores to be performed at the service cores. In this way, the cache line bouncing between cores is prevented, hence data locality optimization is enabled. The proposed MC model utilizes auxiliary in-hardware explicit messaging for the critical section requests to enable efficient fine-grained blocking and non-blocking communication between communicating cores. To show the effectiveness of the proposed model, workloads with wide range of synchronization requirements from graph analytics, machine learning and database domains are implemented. The proposed model is then prototyped and exhaustively evaluated on a 72 core machine, Tilera Tile-Gx72 multicore platform, as it incorporates in-hardware core-to-core messaging as an auxiliary capability to the shared memory cache coherence paradigm. Since the Tile-Gx72 machine includes only 72 cores, it is deployed for evaluation at 8 to 64 core count scale. For further analysis at higher core count, a simulated RISC-V multicore environment is built and utilized, and the performance and dynamic energy scaling advantages of the MC model is evaluated against various baseline synchronization models up to 1024 cores.

COinS