Platform for Scientific Computing

WR Cluster Hardware


Content


Overview

The whole cluster consists of: with a total of approx. 5.000 CPU hardware threads plus additionally approx. 200,000 graphic processor cores.
Configuration

Key Hardware Parameters

system wr0 wr50-wr99 wr20-wr25 wr12, wr16-wr19 wr15 wr14 wr44 wr43 wr4 wr5 wr1 wr2 wr3
processor Intel Xeon Gold 6150 (Skylake SP) Intel Xeon Gold 6130 (Skylake EP) AMD EPYC 7543 (Milan) Intel Xeon Gold 6130 (Skylake EP) Intel Xeon Gold 6130 (Skylake EP) Intel Xeon E5-2697 v3 (Haswell EP) AMD EPYC 7702 (Rome) Intel Xeon E5-4657L v2 (Ivy Bridge EP) AMD EPYC 75F3 (Milan) Intel Xeon Gold 6128 (Skylake SP) Intel Xeon E5-2643 v2 (Ivy Bridge EP) Intel Xeon Bronze (Skylake EP) Intel Xeon E5-2697 v2 (Ivy Bridge EP)
cores per processor 18 (+2*HT) 16 (+2*HT) 32 (+2*HT) 16 (+2*HT) 16 (+2*HT) 14 (+2*HT) 64 (+2*HT) 12 (+2*HT) 32 (+2*HT) 6 (+2*HT) 6 (+2*HT) 6 (+2*HT) 12 (+2*HT)
processors per node 2 2 2 2 2 2 2 4 2 2 2 1 2
hw threads per node 72 64 128 64 64 56 256 96 128 24 24 12 48
clock speed [GHz] 2.7 (3.7 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.8 (3.7 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.1 (3.7 with TurboBoost) 2.6 (3.6 with TurboBoost) 2.0 (3.35 with TurboBoost) 2.4 (2.9 with TurboBoost) 2.95 (4.0 with TurboBoost) 3.5 (3.8 with TurboBoost) 3.4 (3.7 with TurboBoost) 1.7 2.7 (3.5 with TurboBoost)
L1 (per core) / L2 per core / L3 data cache size (shared) 32K / 1M / 24.75M 32K / 1M / 22M 32K / 512K / 256M 32K / 1M / 22M 32K / 1M / 22M 32K / 256K / 35M 32K / 512K / 256M 32K / 256K / 30M 96K / 512K / 256M 32K / 256K / 8M 32K / 1M / 19.25M 32K / 256K / 8.25M 32K / 256K / 30M
accelerator - - 4xNvidia A100 Nvidia Tesla V100 4xNvidia Tesla V100 Nvidia Tesla K80 - - - - - - -
main memory [GB] 192 192 512 192 192 128 1024 768 512 192 128 96 256
local disk [GB] 9,600 500 3,800 500 500 256 13,300 500 300,000 240,000 112,000 53,000 6,480
avg. latency L1 / L2 / L3 / main memory [cycles] 4 / 9 / 38 / 119 4 / 9 / 41 / 133 4 / 12 / 35 / 101 4 / 9 / 39 / 134 4 / 9 / 39 / 131 4 / 9 / 27 / 98 4 / 12 / 27 / 100 4 / 10 / 22 / 85 4 / 12 / 35 / 94 4 / 7 / 29 / 97 4 / 8 / 18 / 85 4 / 9 / 30 / 69 4 / 8 / 20 / 83
stream memory bandwidth core / node [GB/s] 14.1 / 167.0 14.0 / 174.9 50.3 / 347.2 13.2 / 159.3 13.2 / 159.3 20.1 / 111.4 28.7 / 278.1 6.2 / 76.8 27.8 / 225.5 13.5 / 98.4 10.0 / 83.4 9.9 / 42.0 8.7 / 86.1
measured DGEMM GFlops core / node 100.1 / 2,490.6 103.4 / 1,897.4 55.5 / 2,104.8 100.6 / 1,838.4 101.5 / 1,844.2 48.3 / 1,022.3 22.5 / 2,171.9 21.5 / 672.9 22.5 / 1406.4 84.6 / 960.4 27.1 / 315.6 11.9 / 72.2 26.3 / 495.8
job queue - any, hpc, hpc3 gpu gpu, gpu4 any, wr14 any,wr44 any,wr43 - - - - -


Server

Front-End Node (wr0)

This is the front end node and acts as the central access point for the whole cluster. Some technical details for the processor:

SSD Fileserver (wr4)

Currently in test mode for GPU nodes.

Hard Disk Fileserver (wr5)

This is the /scratch file server for other systems.

IO Server (wr2)

This is currently used a test system.

Backup Server (wr1)

Some filesystems are regularly backuped.


Hot Standby Server

These systems act as a hot standby for critical servers.

Server Standby (wr3)

Some technical details for the processor:


Cluster Nodes

The CPU cluster has 50 cluster nodes wr55-wr99 based on the PowerEdge C6420 barebone from Dell, 4 grouped in a PowerEdge C6400 chassis. The specification for each cluster node is: Some technical details for the processor:


Shared Memory Parallel Systems

Big Memory / Shared Memory Parallelism (wr44)

For shared memory jobs with a demand for large main memory, highly parallelism and/or high local IO demands there is this many-core shared memory server available based on the barebone Gigabyte R182-Z92. Some technical details for each processor:

Big Memory / Shared Memory Parallelism (wr43)

For shared memory jobs with a demand for large main memory and/or highly parallelism there is this many-core shared memory server available based on the barebone Supermicro 8027T-TRF+. Some technical details for the processors:


Accelerator

GPU Computing (wr20-wr25)

These 6 nodes each have 4 Nvidia A100 SXM4 connected with NVLink. Some technical details for the GPUs: There are several hardware restrictions using this cards (in total 4 devices):

Device 0: "NVIDIA A100-SXM4-80GB"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 81070 MBytes (85007794176 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1593 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 193 / 0
  Compute Mode:
 < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

GPU Computing (wr15)

This node has 4 Nvidia Tesla V100 SXM2 connected with NVLink. Some technical details for the GPUs: There are several hardware restrictions using this cards (in total 4 devices):

Device 0: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 26 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

GPU Computing (wr12, wr16-wr19)

These nodes have a Nvidia Tesla V100 PCIe. Some technical details for the GPU: There are several hardware restrictions using this card:

Device 0: "Tesla V100-PCIE-16GB"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1380 MHz (1.38 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 7 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 59 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

GPU Computing (wr14)

For batch GPU computing there is this node equipped with a Nvidia Tesla K80. Some technical details for the GPU: There are several hardware restrictions using this card (every of the two GPU chips is one CUDA device of its own):

Device 0: "Tesla K80"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 12207 MBytes (12799574016 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 132 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >


Networks

There are three networks where all nodes are connected to at least 2: For Ethernet connectivity, two central switches IBM G8264 connect most Ethernet nodes with 10 Gb/s or 1Gb/s. The two switches are linked with two 40Gb/s ports.
The Infiniband network is realized with two 40-port 200 Gb/s Infiniband switches.
The Omni-Path network is realized with two 48-port Omni-Path switches with a 3:1 blocking.