Hardware
The ARCHER facility is based around a Cray XC30 supercomputer that provides the central computational resource. This 4920 node supercomputer is supported by a number of additional components including: high-performance parallel filesystems, pre- and post-processing facilities, external login nodes, and a large, resilient, long-term data facility.
Cray XC30 Supercomputer
The Cray XC30 architecture consists of compute nodes connected together by the Aries interconnect and service nodes that act as interfaces for the file system, interfaces to external login nodes, and job launcher nodes.
Compute Nodes
ARCHER compute nodes contain two 2.7 GHz, 12-core E5-2697 v2 (Ivy Bridge) series processors. Each of the cores in these processors can support 2 hardware threads (Hyperthreads). Within the node, the two processors are connected by two QuickPath Interconnect (QPI) links.
Standard compute nodes on ARCHER have 64 GB of memory shared between the two processors. There are a smaller number of high-memory nodes with 128 GB of memory shared between the two processors. The memory is arranged in a non-uniform access (NUMA) form: each 12-core processor is a single NUMA region with local memory of 32 GB (or 64 GB for high-memory nodes). Access to the local memory by cores within a NUMA region has a lower latency than accessing memory on the other NUMA region.
There are 4544 standard memory nodes (12 groups, 109,056 cores) and 376 high memory nodes (1 group, 9,024 cores) on ARCHER giving a total of 4920 compute nodes (13 groups, 118,080 cores). (See the "Aries Interconnect" section below for the definition of a group.)
At zero load the compute nodes on ARCHER draw approximately 400 kW of power and at full load they draw approximately 1200 kW of power.
Aries Interconnect
The Cray Aries interconnect links all compute nodes in a Dragonfly topology. In the Dragonfly topology 4 compute nodes are connected to each Aries router; 188 nodes are grouped into a cabinet; and two cabinets make up a group. The interconnect consists of 2D all-to-all electric connections between all nodes in a group with groups connected to each other by all-to-all optical connections. The number of optical connections between groups can be varied according to the requirements of the system. ARCHER has 144 optical links per group giving a peak bisection bandwidth of over 19,013 GB/s over the whole system. The MPI latency on Aries is ~1.3μs with an additional 100ns of latency when communicating over the optical links.
Service Nodes
ARCHER service nodes contain a single 2.7 GHz, 12-core Intel Xeon E5-2697 (Ivy Bridge) series processor.
The service nodes provide a number of different functions on ARCHER: PBS job launcher (MOM) nodes, LNET routers (for connecting to the filesystems) and others.
The /home, /work and RDF filesystems are mounted on the login nodes.
The /home and /work filesystems are mounted on the MOM nodes. This means that the RDF filesystems are not directly available within parallel job scripts. (You can, however, access the postprocessing nodes from within a job script and they have access to the RDF filesystems.)
External Login Nodes (esLogin)
ARCHER esLogin nodes are external to the main XC30 system and are therefore available for use even when the compute portion of the facility is out of service. They have access to all the filesystems on ARCHER and also have the full Cray application development environment installed. This means that data can be accessed, programs compiled, and pre- and post- processing performed when the compute nodes are unavailable.
The eight ARCHER esLogin nodes each contain two 2.6 GHz, 8-core Xeon E5-2650 v2 (Ivy Bridge) series processors.
Filesystems and Data Infrastructure
There are three filesystem classes available on ARCHER: the high performance, parallel /work filesystems, the NFS /home filesystems, and the Research Data Facility (RDF) filesystems for long term data storage.
The ARCHER filesystems collectively draw approximately 40 kW of power and the RDF draws approximately 120 kW of power.
/work filesystems
/work is a collection of high-performance, parallel Lustre filesystems. Each project is assigned space on a particular Lustre partition (fs2, fs3 or fs4) with the assignments chosen to balance the load across the available infrastructure. /work should be used for reading and writing during simulations.
There is a total of 4.4PB available in /work on ARCHER with
- fs2 having 1.4 PB
- fs3 having 1.4 PB
- fs4 having 1.6 PB
/work filesystems are not backed-up in any way.
Further technical details:
For a description of the terms associated with Lustre file systems please see the description on Wikipedia:
The OSS counts for each of the file systems:
- fs2: 12
- fs3: 12
- fs4: 14
Each OSS is a Seagate Sonexion 1600 OSS controller module, 1 x Intel Xeon CPU E5-2648L @ 1.80GHz, 32GB memory.
Each OSS has 40 discs, 4 OSTs per OSS, 10 discs per OST. These 10 discs are in RAID6, i.e. 8+2. There are also a number of hotspares and RAID and ext3 journaling SSDs on each OSS.
Each disc is a 4TB SEAGATE ST4000NM0023 (Constellation ES.3 - 3.5" - SAS 6Gb/s - 7,200 rpm).
There is one MDS and one backup MDS per file system.
Each MDS is a Seagate Sonexion 1600 MDS controller module, 2 x Intel(R) Xeon(R) CPU E5-2680 @ 2.70GHz
Each of the 3 MDTs comprise 14 discs in RAID10. Each disc is a 600GB SEAGATE ST9600205SS (Enterprise Performance 10K 600 GB - 2.5" - SAS 6Gb/s - 10,000 rpm).
Routing: Each client accesses the three file systems via 18 LNet Router nodes internal to the XC30. Each file system is attached to 10, 10 or 14 Router nodes; some router nodes service more than one path. This is complex involving overlapping primary and secondary paths, however the rule that affects performance is that the primary LNet path is configured so that all clients access 3 OSS nodes via 2 LNet router nodes. MDSs are accessed from the XC30 via 2 LNet router nodes each.
/home filesystems
/home filesystem is a NFS filesystem and is available on the esLogin and service nodes. The /home filesystem is backed up, first to a second set of hard disks, and then to tape. This is the filesystem to use for critical files, such as source code, makefiles or other build scripts, binary libraries and executables obtained from third parties and small permanent datasets.
There is a total of 218TB availble to users in /home .
It should be noted that the /home filesystem is not designed, and does not have the capacity, to act as a long term archive for large sets of results. Users should use the RDF facility (see below); transfer such data to systems at their own institutions; or use suitable facilities elsewhere.
RDF filesystems
The Research Data Facility (RDF) consists of 20.29PB usuable disk, with an additional backup tape capacity. The RDF is external to the national services, and is designed as long term data storage. The RDF currently has 3 filesystems mounted on ARCHER login, PP and service nodes:
/general /epsrc /nerc
The disk storage is based on four DDN 10K storage arrays populated with near-line SAS 3TB 72000rpm HDDs. Metadata storage is based on two IBM DS3524s populated with SAS 300GB 10krpm HDDs. The backup capability is managed via Tivoli Storage Manager, based on an IBM TS3500 tape library with 12 drives.
Pre- and Post-Processing Facilities
The two ARCHER postprocessing (PP) nodes each contain four 2.0 GHz, 10-core Intel Xeon E7-4850 (Westmere) series processor giving a total of 40 physical cores (80 hyperthreads) available on each node. Each PP node has 1 TB physical memory available.
/home, /work and the RDF are all mounted on the PP nodes.
The PP nodes are available to users by two methods:
- Via the serial queues: see the description in the Post Processing Jobs section of the User Guide.
- Via direct interactive access: as described in the Interactive access to Post Processing nodes section of the User Guide.