Platform for Scientific Computing

WR Cluster Hardware


Content


Overview

The whole cluster consists of: with a total of approx. 10.500 CPU hardware threads plus additionally approx. 200,000 graphic processor cores.
Configuration

Key Hardware Parameters

system wr0 wr50-wr74 wr75-wr106 wr20-wr25 wr14 wr15-wr19 wr44 wr43 wr4 wr5 wr3 wr1 wr2 wr6
processor AMD EPYC 9374F (Genoa, 4th Gen) AMD EPYC 9554 (Genoa, 4th Gen) Intel Xeon Gold 6130F (Skylake) AMD EPYC 7543 (Milan, 3rd Gen) Intel Xeon Gold 6130 (Skylake) Intel Xeon Gold 6130 (Skylake) AMD EPYC 7702 (Rome, 2nd Gen) Intel Xeon E5-4657L v2 (Ivy Bridge) AMD EPYC 75F3 (Milan, 3rd Gen) Intel Xeon Gold 6128 (Skylake) Intel Xeon Gold 6150 (Skylake) Intel Xeon E5-2643 v2 (Ivy Bridge) Intel Xeon Bronze (Skylake EP) Intel Xeon E5-2697 v3 (Haswell EP)
cores per processor 32 (+2*HT) 64 (+2*HT) 16 (+2*HT) 32 (+2*HT) 16 (+2*HT) 16 (+2*HT) 64 (+2*HT) 12 (+2*HT) 32 (+2*HT) 6 (+2*HT) 18 (+2*HT) 6 (+2*HT) 6 (+2*HT) 14 (+2*HT)
processors per node 2 2 2 2 2 2 2 4 2 2 2 2 1 2
hw threads per node 128 128 (HT disabled) 64 128 64 64 256 96 24 128 72 24 12 56
clock speed [GHz] 3.85 (4.3 with TurboBoost) 3.1 (3.75 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.8 (3.7 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.0 (3.35 with TurboBoost) 2.4 (2.9 with TurboBoost) 2.95 (4.0 with TurboBoost) 3.7 (3.4 with TurboBoost) 2.7 (3.7 with TurboBoost) 3.4 (3.7 with TurboBoost) 1.7 2.6 (3.6 with TurboBoost)
L1 (per core) / L2 per core / L3 data cache size (shared) 32K / 1M / 256M 32K / 1M / 256M 32K / 1M / 22M 32K / 1M / 256M 32K / 1M / 22M 32K / 1M / 22M 32K / 512K / 256M 32K / 256K / 30M 32K / 256K / 8M 32K / 512K / 256M 32K / 1M / 19.25M 32K / 1M / 19.25M 32K / 256K / 8.25M 32K / 256K / 35M
accelerator - - - 4x Nvidia A100 4x Nvidia V100 Nvidia V100 - - - - - - - Nvidia K80
main memory [GB] 384 384 192 512 192 192 1024 768 512 192 128 256 128 96
local disk [GB] 63,360 900 500 900 1,600 500 13,300 500 300,000 240,000 9,600 112,000 53,000 256
avg. latency L1 / L2 / L3 / main memory [cycles] 5 / 16 / 44 / 245 3 / 9 / 25 / 142 2 / 5 / 19 / 63 2 / 9 / 30 / 210 2 / 5 / 19 / 63 2 / 5 / 19 / 63 4 / 7 / 16 / 114 4 / 8 / 22 / 84 3 / 9 / 35 / 94 4 / 9 / 42 / 146 4 / 9 / 40 / 128 4 / 8 / 18 / 85 4 / 9 / 30 / 69 4 / 9 / 27 / 98
stream memory bandwidth core / node [GB/s] 42.2 / 388.5 60.6 / 750.0 13.2 / 180.3 50,6 / 345.4 13.2 / 179.7 13.2 / 179.7 38.8 / 278.1 6.2 / 76.8 46.6 / 225.5 13.5 / 142.9 14.1 / 167.0 10.0 / 83.4 9.9 / 42.0 48.3 / 111.4
measured DGEMM GFlops core / node 52.5 / 3,724.5 58.2 / 5,943.6 100.2 / 1,827.2 56.4 / 2,347.4 100.9 / 1,817.2 100.9 / 1,817.2 50.0 / 2,177.9 21.5 / 685.8 22.5 / 1406.4 84.7 / 960.4 101.1 / 2,490.6 27.1 / 315.6 11.9 / 72.2 48.3 / 1,022.3
job queue - any,hpc,hpc1 any,hpc,hpc3 gpu,gpu4 gpu any,gpu any,wr44 any,wr43 - - - - - -


Server

SSD-based Fileserver (wr4)

This is the fileserver for large data sets. This server hosts the filesystem /work.

HD-based Backup Fileserver (wr5)

This server acts as a backup server for critical data.

Network Gateway System (wr2)

This server acts as a gateway between Infiniband and Omnipath.


Hot Standby Server

These systems act as a hot standby for critical servers.

Standby Backup Server (wr1)

This server is a standby system for backup.

Hot-Standby Server Node (wr3)

This is the front end node and acts as the central access point for the whole cluster. Some technical details for the processor:

Test System (wr6)

This older server acts as an internal test system. It has also an (old) GPU Nvidia K80.


CPU Nodes

The CPU cluster part consists of two parts, one part with 25 systems wr50-wr74 based on newer AMD processors and a second part with 32 systems wr75-wr106 based on older Intel processors.

AMD-based CPU nodes

The cluster nodes wr50-wr74 are based on the Gigabyte barebone R183-Z90 rev.AAD1 with a MZ93-FS0 mainboard. The specification for each cluster node is: Some technical details for the processor:

Intel-based CPU nodes

The cluster nodes wr75-wr106 are based on the PowerEdge C6420 barebone from Dell, 4 grouped in a PowerEdge C6400 chassis. The specification for each cluster node is: Some technical details for the processor:


Shared Memory Parallel Systems

Big Memory / Shared Memory Parallelism (wr44)

For shared memory jobs with a demand for large main memory, highly parallelism and/or high local IO demands there is this many-core shared memory server available based on the barebone Gigabyte R182-Z92-00 with a MZ92-FS0-00 mainboard. Some technical details for each processor:

Big Memory / Shared Memory Parallelism (wr43)

For shared memory jobs with a demand for large main memory and/or highly parallelism there is this many-core shared memory server available based on the barebone Supermicro 8027T-TRF+. Some technical details for the processors:


Accelerator

GPU Computing (wr20-wr25)

These 6 nodes each have 4 Nvidia A100 SXM4 connected with NVLink. Some technical details for the processor: Some technical details for the GPUs: There are several hardware restrictions using this cards (in total 4 devices):

Device 0: "NVIDIA A100-SXM4-80GB"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 81070 MBytes (85007794176 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1593 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 193 / 0
  Compute Mode:
 < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

GPU Computing (wr14)

This node has 4 Nvidia Tesla V100 SXM2 connected with NVLink. Some technical details for the GPUs: There are several hardware restrictions using this cards (in total 4 devices):

Device 0: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 26 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

GPU Computing (wr15-wr19)

These nodes have a Nvidia Tesla V100 PCIe. Some technical details for the GPU: There are several hardware restrictions using this card:

Device 0: "Tesla V100-PCIE-16GB"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1380 MHz (1.38 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 7 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 59 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >


Networks

There are three networks where all nodes are connected to at least 2: For Ethernet connectivity, two central switches IBM G8264 connect most Ethernet nodes with 10 Gb/s or 1Gb/s. The two switches are linked with two 40Gb/s ports.
The Infiniband network is realized with three 40-port 200 Gb/s Infiniband switches.
The Omni-Path network is realized with two 48-port Omni-Path switches with a 3:1 blocking.