Python on HPC
The Cray Centre of Excellence (CoE) for ARCHER has worked on a project on ARCHER to explore the use of Python in HPC. This was prompted by conversations with users which highlighted two issues:
- People are using additional Python packages beyond what is shipped with the standard Python distributions which come with Linux
- The startup time associated with Python applications can be high, especially for applications which are ran at scale. This is not a Cray-specific issue, but rather a generic Python issue when Python is used in large scaling applications.
To help address these issues, the CoE has instigated a project to explore possible solutions to these issues. The plan is to investigate a variety of approaches for resolving these issues, and asses them on a number of criteria (portability, cost, ease of use, performance). The CoE is working closely with EPCC as the Service Provider and CSE Provider in this project. A new project on ARCHER has been created of which Jason Beech-Brandt is the PI, and within this project evaluations have already started on a number of technologies. As an example developers at Cray have been working on a tool called DLFM which aims to help with the scalability issue associated with Python applications and has done this for users of the NERSC systems. DLFM has been installed onto ARCHER and tested with a workload which mimics that used by the Fluidity users. The speedups associated with this approach can be seen in the table below.
Nodes | Total Job Time | |
---|---|---|
DLFM | As-is | |
2 | 8 | 26 |
4 | 9 | 16 |
8 | 8 | 15 |
16 | 10 | 16 |
32 | 10 | 23 |
64 | 14 | 37 |
128 | 12 | 68 |
256 | 15 | 140 |
512 | 16 | 286 |
1024 | 23 | 561 |
2048 | 34 | 1067 |
As can be seen in these timings, there is a very significant speedup associated with the job startup time associated with this tool. This has been provided to the users of the Fluidity code for them to assess the performance benefit on their real application workload. The DLFM tool has been installed on ARCHER as a user module by the CSE team.
We've also had contact from a user of the GPAW code , which uses Python and are running on ARCHER. We have pointed them at the installation of DLFM for them to try this out and report back on their experiences. We are very interested in hearing from other users of Python (or dynamically linked applications which can display similar issues) for additional test cases of DLFM.
In addition the Cray CoE for ARCHER has made contact with the developers at Continuum Analytics who develop the Anaconda package which aims to simplify the packaging and distribution of Python and associated modules. An initial port of the Anaconda package has been done by the CoE and passed onto the CSE team and is now available on ARCHER as a user module.