Platform for Scientific Computing

WR Cluster Hardware


Content


Overview

The whole cluster consists of: with a total of approx. 5.200 CPU hardware threads plus additionally approx. 48,000 graphic processor cores.
Configuration

Key Hardware Parameters

system wr0 wr50-wr99 wr28-wr42 wr20-wr27 wr12, wr16-wr19 wr15 wr14 wr13 wr43 wr5 wr1 wr2 wr3 wr4
processor Intel Xeon Gold 6150 (Skylake SP) Intel Xeon Gold 6130 (Skylake EP) Intel Xeon E5-2680 v3 (Haswell EP) Intel Xeon EP E5-2670 (Sandy Bridge EP) Intel Xeon Gold 6130 (Skylake EP) Intel Xeon Gold 6130 (Skylake EP) Intel Xeon E5-2697 v3 (Haswell EP) Intel Xeon 7250 (Knights Landing) Intel Xeon E5-4657L v2 (Ivy Bridge EP) Intel Xeon Gold 6128 (Skylake SP) Intel Xeon E5-2643 v2 (Ivy Bridge EP) Intel Xeon Bronze (Skylake EP) Intel Xeon E5-2697 v2 (Ivy Bridge EP) Intel Xeon EP E5530 (Nehalem EP)
cores per processor 18 (+2*HT) 16 (+2*HT) 12 (+2*HT) 16 (+2*HT) 16 (+2*HT) 16 (+2*HT) 14 (+2*HT) 68 (+4*HT) 12 (+2*HT) 6 (+2*HT) 6 (+2*HT) 6 (+2*HT) 12 (+2*HT) 4 (+2*HT)
processors per node 2 2 2 2 2 2 2 1 4 2 2 1 2 2
hw threads per node 72 64 48 32 64 64 56 272 96 24 24 12 48 16
clock speed [GHz] 2.7 (3.7 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.5 (3.3 with TurboBoost) 2.6 (3.3 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.6 (3.6 with TurboBoost) 1.4 (1.6 with TurboBoost) 2.4 (2.9 with TurboBoost) 3.5 (3.8 with TurboBoost) 3.4 (3.7 with TurboBoost) 1.7 2.7 (3.5 with TurboBoost) 2.4 (2.66 with TurboBoost)
L1 (per core) / L2 per core / L3 data cache size (shared) 32K / 1M / 24.75M 32K / 1M / 22M 32K / 256K / 30M 32K / 256K / 20M 32K / 1M / 22M 32K / 1M / 22M 32K / 256K / 35M 32K / 512K / 16 GB 32K / 256K / 30M 32K / 256K / 8M 32K / 1M / 19.25M 32K / 256K / 8.25M 32K / 256K / 30M 32K / 256K / 8M
accelerator - - - Nvidia Tesla K20m Nvidia Tesla V100 4xNvidia Tesla V100 Nvidia Tesla K80 - - - - - - -
main memory [GB] 192 192 128 128 192 192 128 96 768 192 128 96 256 12
local disk [GB] 9,600 500 500 500 500 500 256 500 500 240,000 112,000 53,000 6,480 8,000
avg. latency L1 / L2 / L3 / main memory [cycles] 4 / 9 / 38 / 119 4 / 9 / 41 / 133 4 / 9 / 28 / 144 4 / 7 / 18 / 97 4 / 9 / 39 / 134 4 / 9 / 39 / 131 4 / 9 / 27 / 98 4 / 10 / - / 102 4 / 10 / 22 / 85 4 / 7 / 31 / 93 4 / 8 / 18 / 85 4 / 9 / 30 / 69 4 / 8 / 20 / 83 4 / 8 / 18 / 63
stream memory bandwidth core / node [GB/s] 13.4 / 167.0 13.2 / 174.9 19.4 / 116.7 9.9 / 71.5 13.2 / 159.3 / 20.1 / 111.4 16.4 / 343.9 (MCDRAM) 6.2 / 76.8 13.5 / 98.4 10.0 / 83.4 9.9 / 42.0 8.7 / 86.1 9.8 / 25.5
measured DGEMM GFlops core / node 100.1 / 2,490.6 103.4 / 1,841.3 46.0 / 809.9 25.0 / 333.3 100.6 / 1,838.4 101.5 / 1,844.2 48.3 / 1,022.3 37.6 / 1,468.4 21.5 / 672.9 84.6 / 960.4 27.1 / 315.6 11.9 / 72.2 26.3 / 495.8 6.0 / 43.3
job queue - any, hpc, hpc3 any, hpc, hpc2 any, hpc, hpc1 any, gpu gpu, gpu4 any, wr14 wr13 wr43 - - - - -


Server

Front-End Node (wr0)

This is the front end node and acts as the central access point for the whole cluster. Some technical details for the processor:

Fileserver (wr5)

This is the /scratch file server for other systems.

Backup Server (wr2)

Some filesystems are regularly backuped.


Standby Server

These systems act as a hot standby for critical servers.

Fileserver Standby (wr1)

Server Standby (wr3)

Some technical details for the processor:

Server Standby (wr4)

Some technical details for the processor:


Cluster Nodes WR-I

The WR-I cluster has 50 cluster nodes wr55-wr99 based on the PowerEdge C6420 barebone from Dell, 4 grouped in a PowerEdge C6400 chassis. The specification for each cluster node is: Some technical details for the processor:


Cluster Nodes WR-II

This cluster has 15 cluster nodes wr28-wr42 based on the 6018R-MT barebone from Supermicro. The specification for each cluster node is:


Cluster Nodes WR-III

This cluster has 8 cluster nodes wr20-wr27 based on the 2027GR-TRTF barebone from Supermicro. The specification for each cluster node is: Accumulated peak performance for all cluster III nodes without the GPU is 2.662 TeraFlops (256 cores each with 10.4 GigaFlops peak performance). The HPL benchmark delivers more than 1.6 TeraFlops for 8 cluster nodes (without GPU's) and n=326000.
Some technical details for the Intel E5-2670 (Sandy Bridge microarchitecture) processor:
Each cluster node has one Nvidia Tesla K20m GPU accelerator (Kepler architecture). Some technical details for the GPU: There are several hardware restrictions using this card:

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 5063 MBytes (5308743680 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
The HPL benchmark delivers more than 2.5 TeraFlops for 8 cluster nodes with GPU's and n=244000.


Shared Memory Parallel Systems

Big Memory / Shared Memory Parallelism (wr43)

For shared memory jobs with a demand for large main memory and/or highly parallelism there is this many-core shared memory server available based on the barebone Supermicro 8027T-TRF+. Some technical details for the processors:


Accelerator

GPU Computing (wr12, wr16-wr19)

These nodes have a Nvidia Tesla V100 PCIe. Some technical details for the GPU: There are several hardware restrictions using this card:

Device 0: "Tesla V100-PCIE-16GB"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1380 MHz (1.38 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 7 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 59 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

GPU Computing (wr15)

This node has 4 Nvidia Tesla V100 SXM2 connected with NVLink. Some technical details for the GPUs: There are several hardware restrictions using this cards (in total 4 devices):

Device 0: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 26 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

GPU Computing (wr14)

For batch GPU computing there is this node equipped with a Nvidia Tesla K80. Some technical details for the GPU: There are several hardware restrictions using this card (every of the two GPU chips is one CUDA device of its own):

Device 0: "Tesla K80"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 12207 MBytes (12799574016 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 132 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Manycore Computing (wr13)

This node has an Intel Xeon Phi Knights Landing 7250 (standalone) processor. Some technical details for the processor:


Networks

There are two networks where all nodes are connected to, either to one or both: Two central switches IBM G8264 connect most Ethernet nodes with 10 Gb/s or 1Gb/s. The two switches are linked with two 40Gb/s ports. The Omni-Path network is realized with two 48-port Omni-Path switches with a 3:1 blocking.