Platform for Scientific Computing

WR Cluster Hardware


Content


Overview

The whole cluster consists of: with a total of 1,508 CPU hardware threads plus additional more than 25,000 graphic processor cores.
Configuration

Key Hardware Parameters

system wr0 wr3 wr4 wr5 wr28-wr42 wr20-wr27 wr10-wr19 wr6 wr7 wr9 wr8 wr2 wr1
processor Intel Xeon E5-2697 v2 (Ivy Bridge EP) Intel Xeon 7250 (Knights Landing) Intel Xeon E5-4657L v2 (Ivy Bridge EP) Intel Xeon E5-2697 v3 (Haswell EP) Intel Xeon E5-2680 v3 (Haswell EP) Intel Xeon EP E5-2670 (Sandy Bridge EP) AMD 8378 (Shanghai) Intel Xeon E5-2643 v2(Ivy Bridge-EP) Intel Xeon EP E5504 (Nehalem EP) AMD 6272 (Interlagos) AMD 6168 (Magny Cours) Intel Xeon EP E5530 (Nehalem EP) Intel Xeon E5-2643 v2 (Ivy Bridge EP)
cores per processor 24 (+2*HT) 68 (+4*HT) 12 (+2*HT) 14 (+2*HT) 12 (+2*HT) 16 (+2*HT) 4 6 (+2*HT) 4 16 12 8 (+2*HT) 6 (+2*HT)
processors per node 2 1 4 2 2 2 4 1 2 4 4 2 2
threads per node (ppn=...) 48 272 96 56 48 32 16 12 8 64 48 16 24
clock speed [GHz] 2.7 (3.5 with TurboBoost) 1.4 (1.6 with TurboBoost) 2.4 (2.9 with TurboBoost) 2.6 (3.6 with TurboBoost) 2.5 (3.3 with TurboBoost) 2.6 (3.3 with TurboBoost) 2.4 3.5 (3.8 with TurboBoost) 2.0 2.1 (3.0 with TurboBoost) 1.9 2.4 (2.66 with TurboBoost) 3.5 (3.8 with TurboBoost)
L1 (per core) / L2 per core / L3 data cache size (shared) 32K / 256K / 30M 32K / 512K / 16 GB 32K / 256K / 30M 32K / 256K / 35M 32K / 256K / 30M 32K / 256K / 20M 64K / 512K / 6M 32K / 256K / 25M 32K / 256K / 4M 16K / 1024K / 12M 64K / 512K / 12M 32K / 256K / 8M 32K / 256K / 8M
accelerator - - - Nvidia Tesla K80 - Nvidia Tesla K20m - Intel XeonPhi 5110P Nvidia Tesla M2050 - - - -
main memory [GB] 256 96 768 128 128 128 16 64 12 128 32 12 128
local disk [GB] 6,480 500 500 256 500 500 128 240 500 1,000 1,000 8,000 80,000
(measured worst case) latency L1 / L2 / L3 / main memory [cycles] 3 / 9 / 38 / 176 4 / 17 / ? / 249 3 / 10 / 42 / 209 3 / 9 / 42 / 208 3 / 9 / 37 / 229 3 / 10 / 37 / 212 3 / 19 / 88 / 402 4 / 12 / 47 / 221 4 / 10 / 49 / 182 4 / 18 / 35 / 150 3 / 17 / 40 / 115 4 / 9 / 44 / 193 4 / 12 / 47 / 288
stream memory bandwidth [GB/s] 83.9 ? 76.6 106.6 114.2 73.0 23.6 43.7 19.5 111.9 101.5 30.7 30.7
theor. GFlops core / node (w.o. TurboBoost) 21.6 / 518.4 11.2 / 3050 19.2 / 912.6 41.6 / 1164.8 40 / 960 20.8 / 332.8 9.6 / 153.6 28.0 / 168.0 8.0 / 64.0 16.8 / 537.6 7.6 / 364.8 9.6 / 76.9 28.0 / 168.0
DGEMM GFlops core / node 25.8 / 488.3 ? / ? 21.4 / 769.6 47.5 / 952.9 45.4 / 756.0 24.8 / 317.2 8.9 / 127.7 27.0 / 158.7 7.7 / 50.3 14.5 / 356.3 7.1 / 245.4 10.2 / 34.7 27.0 / 158.7
job queue wr0 wr3 wr4 wr5 hpc2 hpc1 mpi wr6 wr7 wr9 wr8 - -


Server

Front-End Node (wr0)

This is the front end node and accts as the central access point for the whole cluster. Some technical details for the processor: Processor documentation is here.

Fileserver (wr1)

This is the file server and backup storage for other systems.

Backup System (wr2)

This server acts as a backup system. Some technical details for the processor:


Cluster Nodes WR-I

The WR-I cluster has 15 cluster nodes wr28-wr42 based on the 6018R-MT barebone from Supermicro. The specification for each cluster node is:


Cluster Nodes WR-II

The WR-II cluster has 8 cluster nodes wr20-wr27 based on the 2027GR-TRTF barebone from Supermicro. The specification for each cluster node is: Accumulated peak performance for all cluster III nodes without the GPU is 2.662 TeraFlops (256 cores each with 10.4 GigaFlops peak performance). The HPL benchmark delivers more than 1.6 TeraFlops for 8 cluster nodes (without GPU's) and n=326000.
Some technical details for the Intel E5-2670 (Sandy Bridge microarchitecture) processor: Processor documentation:
Each cluster node has one Nvidia Tesla K20m GPU accelerator (Kepler architecture). Some technical details for the GPU: There are several hardware restrictions using this card:

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          5.0 / 5.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 4800 MBytes (5032706048 bytes)
  (13) Multiprocessors x (192) CUDA Cores/MP:    2496 CUDA Cores
  GPU Clock rate:                                706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536),3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048,2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device
       simultaneously) >
The HPL benchmark delivers more than 2.5 TeraFlops for 8 cluster nodes with GPU's and n=244000.


Cluster Nodes WR-III

The WR-III cluster has 10 cluster nodes wr10-wr19 based on the 2U TX46 barebone from Tyan. The specification for each cluster node is: Accumulated peak performance for all WR-III cluster nodes is 1.536 TeraFlops (160 cores each with 9.6 GigaFlops. Each Shanghai core can compute 4 64-bit fp operations/cycle). The HPL benchmark delivers up to 1,06 TeraFlops.
Some technical details for the Opteron model 8378: Processor documentation: Each cluster node has a single-port Infiniband DDR PCIe Host Channel Adapter Mellanox MHGS18-XTC . The host adaper is connected to a Flextronics Infiniband switch over copper. The peak bandwidth between two nodes is 20 Gbit/s in each direction. With appropriate MPI operations the small message latency is approx. 3.8 usec, the maximum data rate can get up to 1.3 GB/s (or approx. 10 Gb/s) in each direction.

The video adapter Sparkle SX96GT512D3L-NM has the following specification:


Shared Memory Parallel Systems

Big Memory / Shared Memory Parallelism (wr4)

For shared memory jobs with a demand for large main memory and/or highly parallelism there is many-core shared memory server wr4 available based on the barebone Supermicro 8027T-TRF+. Some technical details for the processors:

Big Memory / Integer Shared Memory Parallelism (wr9)

For shared memory jobs with a demand for large main memory and/or highly parallelism there is many-core shared memory server wr9 available. Some technical details for the processor: Processor documentation:

Floating Point Shared Memory Parallelism (wr8)

For shared memory jobs with a demand for many core / floating point units there is many-core shared memory server wr8 available. Some technical details for the processor: Processor documentation:


Accelerator

Manycore Computing (wr3)

The node wr3 has a Intel Xeon Phi Knights Landing 7250 standalone processor.. Some technical details for the processor:

Manycore Computing (wr6)

For Manycore computing there is the node wr6 equipped with an Intel XeonPhi 5110P (called MIC; Many Integrated Core). Intel XeonPhi 5110P / MIC with: Our Intel MIC is configured as a coprocessor, e.g., by using OpenMP/OpenACC on the host system wr6 with offloading parts of the program to the MIC. There is no shared file system available on the MIC. If you would like to run native programs on the MIC or do any other fancy things, contact us. Documentation on Xeon Phi (we are running MPSS 3.x):

GPU Computing (wr5)

For batch GPU computing there is the node wr5 equipped with a Nvidia Tesla K80. Some technical details for the GPU: There are several hardware restrictions using this card (every of the two GPU chips is one CUDA device of its own):

  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 12288 MBytes (12884705280 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Clock rate:                                824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes

GPU Computing (wr7)

For interactive GPU development there is the node wr7 equipped with a Nvidia Tesla M2050. Some technical details for the GPU: There are several hardware restrictions using this card:

  Total amount of global memory:                 2817982464 bytes
  Multiprocessors x Cores/MP = Cores:            14 (MP) x 32 (Cores/MP) = 448 (Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1


Networks

Each cluster node has at least 2 network connections: Two central switches IBM G8264 connect all nodes with 10 Gb/s or 1Gb/s. The switches are linked over 40Gb/s ports.