one front-end node wr0 for interactive access and administration as well as several server
a sub-cluster WR-I with 50 nodes wr50-wr99 with 2 16-core 2-way hyperthreaded Intel Xeon Gold 6130 (a total of 3,200 way parallelism), connected with Omni-Path
a sub-cluster WR-II with 15 nodes wr28-wr42 with 12-core 2-way hyperthreaded Intel Xeon E5-2680v3 (a total of 624 way parallelism), connected with 10Gb-Ethernet
a sub-cluster WR-III with 8 nodes wr20-wr27 with 8-core 2-way hyperthreaded Intel Xeon E5-2670 (a total of 256 way parallelism), each with a Nvidia K20m GPU (2496 GPU cores), connected with 10Gb-Ethernet
L3 cache: 30 MB, 20-way set associative, write-back, inclusive cache
4 memory channels per processor, theor. memory bandwidth 59.7 GB/s per processor
peak performance is 21.6 GigaFlops / processor core thread
Cluster Nodes WR-I
The WR-I cluster has 50 cluster nodes wr55-wr99 based on the PowerEdge C6420 barebone from Dell, 4 grouped in a PowerEdge C6400 chassis.
The specification for each cluster node is:
Intel Xeon Gold 6130 and Intel Xeon Gold 6130-F at 2.1 GHz, each with in total 64 hardware threads
192 GB DDR4-2466 memory
480 GB SSD
100 Gb/s Intel Omni-Path through a Xeon Gold 6130-F
Some technical details for the processor:
L1 data cache: 32 KB, 8-way set associative, write-back, 64 bytes/line
2 Intel Xeon E5-2670 processors at 2.6 GHz (TurboBoost up to 3.3)
128 GB memory
120 GB Intel 520 SSD
10 Gb/s Ethernet Intel Server Adapter X520-SR1
Accumulated peak performance for all cluster III nodes without the GPU is 2.662 TeraFlops (256 cores each with 10.4 GigaFlops peak performance). The HPL benchmark delivers more than 1.6 TeraFlops for 8 cluster nodes (without GPU's) and n=326000.
Some technical details for the Intel E5-2670 (Sandy Bridge microarchitecture) processor:
L1 data cache per core: 32 KB, 8-way set associative, write-back, 64 bytes/line
L2 unified cache per core: 256 KB, 8-way set associative, write-back 64 bytes/line, exclusive cache
L3 shared unified cache per processor: 20 MB, 20-way set associative, write-back 64 bytes/line, inclusive cache
There are several hardware restrictions using this card:
Device 0: "Tesla K20m"
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 5063 MBytes (5308743680 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 1310720 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
The HPL benchmark delivers more than 2.5 TeraFlops for 8 cluster nodes with GPU's and n=244000.
Shared Memory Parallel Systems
Big Memory / Shared Memory Parallelism (wr44)
For shared memory jobs with a demand for large main memory, highly parallelism and/or high local IO demands there is this many-core shared memory server available based on the barebone Gigabyte R182-Z92.
motherboard Gigabyte MZ92-FS0
2x AMD EPYC 7702 processors at 2 GHz (TurboBoost up to 3.35)
1 TB DDR4-2933 memory
4x Micron 9300 MAX-3DWPD (U.2, 3.2 TB) and 1x Samsung 970 EVO (M.2, 500 GB)
10 Gb/s Ethernet Intel Server Adapter X520-SR2
Some technical details for each processor:
64 cores
L1 data cache per core: 32 KB, 8-way set associative, write-back, 64 bytes/line
L2 unified cache per core: 512 KB, 8-way set associative, write-back, 64 bytes/line
L3 unified victim cache shared by all cores: 256 MB, 16-way set associative, 64 bytes/line
theor. memory bandwidth 187.71 GB/s per processor (with our DDR4-2933)
Big Memory / Shared Memory Parallelism (wr43)
For shared memory jobs with a demand for large main memory and/or highly parallelism there is this many-core shared memory server available based on the barebone Supermicro 8027T-TRF+.
4-way motherboard Supermicro X9QR7-TF+
4 Intel Xeon E5-4657L processors at 2.4 GHz (TurboBoost up to 2.9)
768 GB DDR3-1866 memory
500 GB Micron MX200 SSD
10 Gb/s Ethernet Intel Server Adapter X520-SR1
Some technical details for the processors:
12 cores, 2-way hyperthreading
L1 data cache per core: 32 KB, 8-way set associative, write-back, 64 bytes/line
L2 unified cache per core: 256 KB, 8-way set associative, write-back, 64 bytes/line
L3 unified cache shared by all cores: 30 MB, 128-way set associative, write-back, 64 bytes/line
theor. memory bandwidth 59.7 GB/s per processor
Accelerator
GPU Computing (wr12, wr16-wr19)
These nodes have a Nvidia Tesla V100 PCIe.
Barebone Dell PowerEdge R740
2 Intel Xeon Gold 6130 at 2.1 GHz with in total 64 hardware threads
Nvidia Tesla V100 PCIe GPU with 5120 Cuda cores and 640 Tensor cores
14 TFlops / 7 TFlops / 112 TFlops peak performance 32 / 64 bit / tensor floating point performance
16 GB HBM2 memory
900 GB/s bandwidth to onboard memory
system interface PCIe 3.0 x16
The GPU implements the Nvidia Volta architecture
There are several hardware restrictions using this card:
Device 0: "Tesla V100-PCIE-16GB"
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 16160 MBytes (16945512448 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1380 MHz (1.38 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 7 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 59 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
GPU Computing (wr15)
This node has 4 Nvidia Tesla V100 SXM2 connected with NVLink.
Barebone Dell PowerEdge C4140
2 Intel Xeon Gold 6130 at 2.1 GHz with in total 64 hardware threads
192 GB DDR4-2466 memory
480 GB SSD
4 Nvidia Tesla V100 SXM2 with 16 GB memory connected by NVLink
Nvidia Tesla V100 SXM2 GPU with 5120 Cuda cores and 640 Tensor cores
15.7 TFlops / 7.8 TFlops / 125 TFlops peak performance 32 / 64 bit / tensor floating point performance
16 GB HBM2 memory
900 GB/s bandwidth to onboard memory
system interface PCIe 3.0 x16
300 GB/s NVLink interconnect bandwidth
The GPU implements the Nvidia Volta architecture
There are several hardware restrictions using this cards (in total 4 devices):
Device 0: "Tesla V100-SXM2-16GB"
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 16160 MBytes (16945512448 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1530 MHz (1.53 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 5 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 26 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
GPU Computing (wr14)
For batch GPU computing there is this node equipped with a Nvidia Tesla K80.
Barebone Supermicro SYS-2028GR-TR with X10DRG-H mainboard
2 Intel Xeon E5-2697 v3 at 2.6 GHz with in total 56 hardware threads
128 GB DDR4-2133 memory
Nvidia Tesla K80 with 2x GK210 GPUs and 2x12 GB memory
Some technical details for the GPU:
Nvidia Tesla K80 GPU with 2x2496 = 4992 cores
5.6 TFlops / 1.87 GFlops peak performance 32 / 64 bit floating point performance (without boost clocks)
0.560 GHz processor clock (boost up to 0.875 GHz)
24 GB memory (2x12 GB; 6.25% off for ECC)
aggregated 480 GB/s bandwidth to onboard GDDR5 memory
There are several hardware restrictions using this card (every of the two GPU chips is one CUDA device of its own):
Device 0: "Tesla K80"
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 12207 MBytes (12799574016 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 132 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Manycore Computing (wr13)
This node has an Intel Xeon Phi Knights Landing 7250 (standalone) processor.
global MCDRAM: 16 GB (configurable as L3 or additional fast memory)
68 cores with 4-way Hyperthreading
Networks
There are two networks where all nodes are connected to, either to one or both:
a fast interconnection network for communication in parallel applications (100 Gb/s Omni-Path)
a Gigabit Ethernet network for services
Two central switches IBM G8264 connect most Ethernet nodes with 10 Gb/s or 1Gb/s. The two switches are linked with two 40Gb/s ports.
The Omni-Path network is realized with two 48-port Omni-Path switches with a 3:1 blocking.