Optimising MPI Rank Placement at Runtime
The advances of the multi-core hardware era have been delivering nodes with ever large numbers of cores in each generation of HPC systems. A brief review of Cray architectures from HECToR to ARCHER shows a move from two to potentially forty-eight CPUs per node.
In the majority of applications, this results in more than one MPI rank running on the same node (even with hybrid MPI/OpenMP applications) and the ordering of the ranks on the nodes is left to the system default. However, the ordering of ranks on the Cray XC30 nodes can be very easily modified without any additional changes to the application. This allows users to take advantage of the shared memory between ranks on the same node and exploit it to improve overall application performance.
Codes that use a predictable communication pattern can reorder the layout of ranks on nodes to maximise the amount of intra-node communication via the shared memory system, which has a higher bandwidth and lower latency than inter-node communication over the network. This can result in significantly improved communication (and thus application) performance and can be done entirely through environment variables and a single additional input file. In many cases the Cray Performance Analysis Toolkit (CrayPAT) can detect communication patterns and provide an optimised layout for future runs. This tech-forum aims to cover:
- How being aware of the properties of multi-core nodes can be exploited to improve application performance, even for non-threaded or hybrid applications.
- How to use the Cray supplied rank-reordering techniques via environment variables
- How to generate customised rank reorder mappings for Cartesian communication patterns
- How to use CrayPAT to automatically detect and optimise your communication patterns.