WR Cluster Usage

Content

Accounts
File Systems and I/O
Software Packages

Environment Module Usage
Examples
Available Modules
Initial Module Environment Setup

Using the Batch System

Usual Steps

1) Specify What Should Be Done
2) Specify Which Resources You Need
3) Submit the Job
4) Check Job Status
5) Get Results

Partitions, Limits and Job Priorities
Environment Variables and Modules
Special Requests
Temporary Files
Information about Job Runs

GPU Nodes
Container
Software Development

Interactive Development
Compiler
Base Software
Parallel Programming
Programming Tools

Resource Requirements
Usage Examples
Applications
Graphical X11 Applications
FAQ on Problems

Accounts

Users have accounts that are valid on all cluster nodes other than servers. Passwords can be changed with the passwd command on wr0. Passwords must comply to the passwords rules of the university (e.g. at least 10 characters,...)..
Please be aware that account names are the same as with your university accounts, but that the accounts are different, including passwords.

File Systems and I/O

Each node (server as well as cluster nodes) has its own operating system on a local disc. Certain shared directory subtrees are exported from wr0 to all cluster nodes. This includes user data (e.g. $HOME = /home/username) as well as commonly used application software (e.g. /usr/local).

The /tmp directory is on all nodes guaranteed to be located on a fast node-local filesystem. The environment variable $TMPDIR within a batch job contains a name to a job-private fast local directory somewhere in /tmp on a node. This directory is generated on each job start with a job-specific name and removed at job termination. See additional description here. If possible, use this dynamically set environment variable to write to and read from temporary files used only in one job run.

The /work directory can be used for larger amounts of data that needs to be available longer than a batch job run and/or needs to be shared between nodes. The /work directory is shared between all nodes and access to it is slower than to a local /tmp-based file.

Please be aware that data on /tmp filessystems may be deleted without any notice after a certain period of time. And be aware that there is no backup for the /tmp and /work file systems!

mount point purpose located shared on all nodes daily backup capacity access time default quota

/ operating system local no no - - -

/tmp node-local temporary user data local no no small fast 10 GB

/usr/local application software remote server yes yes - - -

/home user data remote server yes yes medium medium 100 GB

/work user data remote server yes no large medium 10 TB

mount point	purpose	located	shared on all nodes	daily backup	capacity	access time	default quota
/	operating system	local	no	no	-	-	-
/tmp	node-local temporary user data	local	no	no	small	fast	10 GB
/usr/local	application software	remote server	yes	yes	-	-	-
/home	user data	remote server	yes	yes	medium	medium	100 GB
/work	user data	remote server	yes	no	large	medium	10 TB

We have established quotas on file systems. Users can ask for their own quota with the command quota -s --show-mntpoint. The maximum number of files is per default restricted to 1 Mio. / 2 Mio. (soft / hard limit) files per file system. For the /work filesystem, the numbers are 5 Mio. / 6 Mio.

Beside the standard solutions for most users, we have individual nodes with special local I/O features. If you have special, fast or large I/O demands, contact the system administrator to find the best individual solution.

File and Directory Names

Don't use spaces or umlauts in file or directory names, e.g. on copying from a MS Windows system. Otherwise you may get in trouble due to, e.g., different character encodings.

Environment Software Modules

Beside a set of standard software packages / modules, a user can extend his/her package list with additional software packages or package versions. This needs to be done by a user itself using the module command with several possible subcommands. A software environment is called a module. Loading a module means usually that internally the search paths for commands, libraries etc. are extended. For a full command reference of the module system, read the documentation.

Usage

Here is a short overview of some (sub-)commands:

module avail shows a list of available modules
module whatis shows a verbose information on a module
module load loads a named module
module list show all currently loaded modules
module unload removes a named module

A module may exist in several versions where the user has the possibility to work with one specific version of choice. If no version is specified during the load a default version is used, if available. It is a good practice always to use the default version of a module even if the concrete version behind the default may change over the time. Most modules are downward compatible such that no problems should exist in this case and you will always get the most advanced, fast and with the least errors version of a module at any time.

Example: Instead of


user@wr0: module load gcc/13.2.0

just use


user@wr0: module load gcc

Examples


user@wr0: module avail

aocc/4.1.0            cuda/default     gnuplot/default         libFHBRS/default    openmpi/4.1.5        pin/default
aocc/default          dinero4/4.7      intel-compiler/2023     likwid/5.2.2        openmpi/default      scalasca/2.6.1
aocl/aocl-aocc-4.1.0  dinero4/default  intel-compiler/default  likwid/default      openmpi/gnu          slurm/23.02.4
aocl/aocl-gcc-4.1.0   ffmpeg/6.0       intel-mpi/2023          metis/5.2.1-32      oracle-java/20       slurm/default
aocl/default          ffmpeg/default   intel-mpi/default       metis/5.2.1-64      oracle-java/default  texlive/2023
cmake/3.27.1          gcc/13.2.0       intel-tools/2023        openjdk/20.0.2      pgi/20.1             texlive/default
cmake/default         gcc/default      intel-tools/default     openjdk/default     pgi/default          valgrind/3.21.0
cuda/12.2             gnuplot/5.4.8    libFHBRS/3.2            openmpi-system/gnu  pin/3.28             valgrind/default

------------------------------------------------- /usr/share/Modules/modulefiles -------------------------------------------------
dot  module-git  module-info  modules  null  use.own

----------------------------------------------------- /usr/share/modulefiles -----------------------------------------------------
mpi/mpich-x86_64  mpi/openmpi-x86_64

Key:
loaded  modulepath

user@wr0: module whatis gcc
gcc                  : GNU compiler suite version 10.1.0

# check current compiler version (system default without loading a module)
user@wr0: gcc --version
nt color=#FF0000>gcc --version
gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)

# load default version
user@wr0: module load gcc
user@wr0: gcc --version
gcc (GCC) 13.2.0

# unload default version
user@wr0: module unload gcc
user@wr0: gcc --version
gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)

Available Modules

A list of selected software packages (eventually with sub-versions) that are handled using the module command is:

name	purpose
aocc	AMD compiler (called with clang,...)
aocl	AMD optimized libraries (e.g. BLAS)
cmake	CMake system
cuda	Nvidia CUDA development and runtime environment
gcc	GNU compiler suite
gnuplot	plot program
intel-compiler	Intel Compiler enviroment
intel-tools	Intel development tools
matlab	Matlab mathematical software with toolboxes
metis	graph partitioning package
openmpi	OpenMPI MPI environment
openjdk	OpenJDK Java development kit
oracle-java	Oracle Java development kit
pgi	PGI compiler suite
slurm	batch system
texlive	TeX distribution
valgrind	Valgrind software analysis tool

Initial Module Enviroment Setup

If you need always the same modules, you may include the load commands in your .bash_profile (once per session executed) or .bashrc (once per shell executed) file in your home directory. Example $HOME/.bashrc file:


module load gcc openmpi

Using the Batch System

A batch system is used on HPC systems to manage the work of many users on such a system. Users submit their requests for computational work and (hardware) requirements that are necessary for the execution of their requests. Then, the batch system looks for resources that fulfill the requirements and starts the job as soon as such resources get available. This might be immediately or later. We use Slurm as a batch system and we ask you to use the batch system for all your work on all cluster nodes. Slurm has a command line interface and additionally a X11 based graphical interface to display certain batch system states. To work with batch jobs, a user usually does a sequence of steps described below step by step.

Usual Steps

1) Specify What Should Be Done

The first thing to do is to specify the work that has to be done by your job. This specification is done with a shell script (a file). Such a batch job script is a shell script that is submitted to and started by the batch system. In a batch script you specify all actions that should be done in your job either sequentially or parallel. The execution of the script later starts in the same directory where you submitted the job.

Sequential Job

An example of such a batch script /home/user/job_sequential.sh is:


#!/bin/sh
# start sequential program
./test_sequential.exe
# change directory and execute another sequential program
cd subdir
./another_program.exe

OpenMP Job

An example of such a batch script /home/user/job_openmp.sh is:


#!/bin/sh
# set the number of threads
export OMP_NUM_THREADS=16
# start OpenMP program
./test_openmp.exe

MPI Job

An example of such a batch script /home/user/job_mpi.sh is:


#!/bin/sh
# load the OpenMPI environment
module load openmpi/gnu

# start here your MPI program
mpirun ./test_mpi.exe

2) Specify Which Resources You Need

Additionally, at the begin of a job script a description must be given which resources are needed for the execution. The syntax for that is a sequence of lines starting with #SBATCH (which is a special form of a shell comment). In each line a certain part of the request can be specified. See the documentation of Slurm sbatch for a list of all options that are available to specify. Here, only an example is given. More options are given in a summary later. An example for such a resource request is:

 
#!/bin/bash
#SBATCH --partition=any          # partition / wait queue
#SBATCH --nodes=4                # number of nodes
#SBATCH --ntasks-per-node=32     # number of tasks per node
#SBATCH --mem=4G                 # memory per node in MB (different units with suffix K|M|G|T)
#SBATCH --time=2:00              # total runtime of job allocation (format D-HH:MM:SS; first parts optional)
#SBATCH --output=slurm.%j.out    # filename for STDOUT (%N: nodename, %j: job-ID)
#SBATCH --error=slurm.%j.err     # filename for STDERR

# here comes the part with the description of the computational work, for example:
# load the OpenMPI environment
module load openmpi/gnu

# start here your MPI program
mpirun ./test_mpi.exe

The meaning of the lines in this example are:

In the first line, a partition with the name any is requested. A partition is a class of hardware nodes with same / similar properties. For most partitions it is valid, that all nodes in that partition have the same or similar hardware properties.
In the second line, 4 nodes (computers) of that partition are requested.
In the following line, 32 cores per node are requested. Therefore in total 32 cores/node * 4 nodes = 128 cores.
The option --mem=4G asks for 4GB of main memory on each of the nodes.
The option --time=2:00 asks for a maximum of 2 minutes of usage for the requested resources.
The last options specify two filenames where the job output whould be written to (stdout and stderr, respectively).

A job can be started only, if all requested resource specifications can be fulfilled. For the example: 4 nodes in the partition any are available with 32 free cores and 4 GB memory each, for 2 minutes.

Instead of requesting 4 nodes with 32 cores, it is also possible to request a certain number of cores / hardware threads but that may be spread arbitrary over several nodes. The example given above adapted to that is:

 
#!/bin/bash
#SBATCH --partition=any          # partition (queue)
#SBATCH --tasks=80               # number of tasks     <---------- this is different to above
#SBATCH --mem=4G                 # memory per node in MB (different units with suffix K|M|G|T)
#SBATCH --time=2:00              # total runtime of job allocation ((format D-HH:MM:SS; first parts optional)
#SBATCH --output=slurm.%j.out    # filename for STDOUT (%N: nodename, %j: job-ID)
#SBATCH --error=slurm.%j.err     # filename for STDERR

# here comes the part with the description of the computational work, for example:
# load the OpenMPI environment
module load openmpi/gnu

# start here your MPI program
mpirun ./test_mpi.exe

In this example, 80 parallel execution units are requested. This can be fulfilled by 4 x 20-core nodes. But this request may also be fulfilled by one node with 80 cores or 80 nodes with one core used on each (and other cores on a node left for other jobs). This specification gives more freedom to the batch system to find resources. But the programming model is (usually) restricted to MPI as a program run may be spread over several nodes.

3) Submit the Job

After you specified in a file the requested resources and the work that should be done, you submit this job script to the batch system. This is done with the sbatch command using the job script filename as an argument. Example:


user@wr0: sbatch jobscript.sh

If the system accepts the request (i.e., no syntax error in the script, the requested resources can in principle be fulfilled at some time etc.) the batch system prints a job ID that may be used to refer to this job.

Please be aware, that all loaded modules of your interactive session (where you execute the sbatch command) are as well loaded when starting your submitted batch job. This may lead to different behaviour of a batch job for interactive sessions with differently loaded modules!

4) Check Job Status

After batch job submission you may check the status of your/all your jobs with several commands depending on the amount of information you want.

You can view the batch status of all batch jobs in a web brower ( link). The page gets updated periodically.

You can show the status of all of your non-finished jobs in a shell window with the squeue command.


user@wr0: squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                55       any test2.sh user     PD       0:00      2 (Resources)
                56       any test3.sh user     PD       0:00      2 (Priority)
                54       any test1.sh user      R       0:08      2 wr[50,51]

In the example the user has 3 jobs submitted that are either running or still waiting. The column ST marks the job state (R=running, PD=waiting).

5) Get Results

Output to stdout / stderr in your program is redirected to 2 files you find in the directory where you submitted the job after the job finished. The file names can be specified in the job script with the options shown above. It is a good idea, to include at least the job ID in the filename.
Example:


user@wr0: ls -l
-rw------- 1 user fb02    316 Mar  9 07:27 slurm.51.out
-rw------- 1 user fb02  11484 Mar  9 07:27 slurm.52.err

Selected Batch System Commands

A summary of useful commands is given in the following table. See appropriate man pages or the Slurm documentation for all available options and a full description.

command	meaning
sbatch <shell-script>	submit the shell-script to the batch system
scancel <jobid>	delete a job with the given job ID, that may be either in running or waiting state
squeue	show the state of own jobs in queues
sinfo [options]	show the state of partitions or nodes
scontrol show job <jobid>	show more details for the job

Partitions, Resource Limits and Job Priorities

We have established several partitions with different behaviour and restrictions. See the output of the sinfo command for a list of available partitions. With each partition are associated certain policies (hardware properties, maximum number of jobs in queue, maximum runtime per job, scheduling priority, maximum physical memory, special hardware features). Preferably use a more general partition, e.g. any unless you have special hardware/software requirements. Job requests for a more general partition have higher priority than job requests for a more specialized partition.

Resource Limits

As part of a job submit, you can specify a request for main memory above the default 1 GB. Be aware that on nodes not all main memory as given in the hardware overview table can be allocated for your job. For example, the operating system needs some memory for itself, for the efficient communication with a GPU memory is pinned etc. For example, it might happen that on a system with 128 GB main memory only 120 GB are available for a job. Therefore the advice is, that you should specify resource requests that fit to your job's needs and do not request the maximum available resources of a node.

A list of the most important queues is:

queue name maximum time per job usable memory default virt.memory/process (on CPU) nodes used

any 72 hours (dependent on node) 1 GB any node

hpc 72 hours 370 GB (AMD part) or 185 GB (Intel part) 1 GB wr50-wr106

hpc1 72 hours 370 GB 1 GB wr50-wr74

hpc3 72 hours 185 GB 1 GB wr75-wr106

gpu 72 hours 185 GB 1 GB wr15-wr19

gpu4 72 hours 185 GB (wr14) or 470 GB (wr20-wr25) 1 GB wr14,wr20-wr25

bigmem 72 hours 750 GB / 1 TB 1 GB wr43 / wr44

queue name	maximum time per job	usable memory	default virt.memory/process (on CPU)	nodes used
`any`	72 hours	(dependent on node)	1 GB	any node
`hpc`	72 hours	370 GB (AMD part) or 185 GB (Intel part)	1 GB	wr50-wr106
`hpc1`	72 hours	370 GB	1 GB	wr50-wr74
`hpc3`	72 hours	185 GB	1 GB	wr75-wr106
`gpu`	72 hours	185 GB	1 GB	wr15-wr19
`gpu4`	72 hours	185 GB (wr14) or 470 GB (wr20-wr25)	1 GB	wr14,wr20-wr25
`bigmem`	72 hours	750 GB / 1 TB	1 GB	wr43 / wr44

Job Priorities and Job Scheduling

Jobs are mainly scheduled based on their calculated job priority. Many factors contribute to a job's priority, amongst others are the main factors the waiting time and the resource consumption of the user during the last 14 days (fairshare). Additionally, a backfill strategy is used by the scheduling system, e.g. job requests with a worse priority but a small requested job time are put in holes in the scheduling table. You can contribute to a fair scheduling and efficient utilization of the whole system if you specify precisely what resources you need in a job (instead of the maximum resources that are available).

Environment Variables and Modules

The batch system defines certain environment variables that you may use in your batch job script. Among them are:

variable name purpose example

$SLURM_SUBMIT_DIR working directory where the job was submitted /home/user/testdir

$SLURM_JOB_ID Job ID given to the job 65

$SLURM_JOB_NAME Job name given to the job testjob

$SLURM_JOB_NUM_NODES number of nodes assigned to this job 2

$SLURM_JOB_CPUS_PER_NODE number of cores per node assigned to this job 32(x5) (32 cores, on 5 nodes)

$SLURM_JOB_NODELIST node names of assigned nodes wr[50,51]

variable name	purpose	example
`$SLURM_SUBMIT_DIR`	working directory where the job was submitted	/home/user/testdir
`$SLURM_JOB_ID`	Job ID given to the job	65
`$SLURM_JOB_NAME`	Job name given to the job	testjob
`$SLURM_JOB_NUM_NODES`	number of nodes assigned to this job	2
`$SLURM_JOB_CPUS_PER_NODE`	number of cores per node assigned to this job	32(x5) (32 cores, on 5 nodes)
`$SLURM_JOB_NODELIST`	node names of assigned nodes	wr[50,51]

Special Requests

Hyperthreading

If you do not want to use Hyperthreads (i.e. only real cores), specify in your job request additionally: #SBATCH --ntasks-per-core=1
Example: 1 hpc3-node is requested that has 32 cores / 64 hardware threads. The program starts with 32 OpenMP (software) threads spread over all cores and not using hHyperthreading.

 
#!/bin/bash
#SBATCH --partition=hpc3         # partition
#SBATCH --nodes=1                # number of nodes
#SBATCH --ntasks-per-core=1      # use only real cores
#SBATCH --time=2:00              # total runtime of job allocation

export OMP_NUM_THREADS=32
./test_openmp.exe

Hybrid Programming Models

If you want to use hybrid programming models (e.g. MPI+OpenMP), you can influence the mapping of MPI processes to the requested hardware in several ways, including

#SBATCH --cpus-per-task=X to reserve X CPUs per (MPI-) task

#SBATCH --ntasks-per-node=X to spread equally tasks over nodes (X tasks per node in the example)
Example: 4 nodes with 32 cores are requested. The MPI program starts with 4*32=128 MPI processes spread over the 4 nodes.

 
#!/bin/bash
#SBATCH --partition=any          # partition
#SBATCH --nodes=4                # number of nodes
#SBATCH --ntasks-per-node=32     # number of cores per node
#SBATCH --time=2:00              # total runtime of job allocation

module load openmpi
mpirun ./test_mpi.exe

Use Specific Nodes

If you want to use specific nodes (e.g. wr73), this can be specified in the resource specification part of the batch job.

#SBATCH --nodelist=wr73 to ask for node wr73

use that only in rare cases where this is necessary. Instead of limiting a job to a certain node, it is almost always better to specify a class of nodes, i.e. a partition.

GPU nodes

See the section on GPU Nodes.

Temporary Files

For a batch job an environment variable $TMPDIR gets defined with a name of a temporary directory (with fast access) that should be used for fast temporary file storage within a job scope. The directory is created on job start and deleted when the job finished. Example on how to use the environment variable within a program:


char *basedir = getenv("TMPDIR");
if(basedir != NULL)
  {
    char filename = "test.dat";
    char allname[1024];
    sprintf(allname, "%s/%s", basedir, filename);
    FILE *f = fopen(all, "w");
  }

Information about Job Runs

Sometimes it is necessary to get some information about a job execution. E.g. what the maximum amount of main memory is/was during the execution of the job to get reasonable values for the resource specification in a job script. The slurm command sstat helps with that for running jobs.
Example:


user@wr0: sstat --format=jobid,maxvmsize,MaxDiskRead 123456.batch
       JobID  MaxVMSize  MaxDiskRead
------------ ---------- ------------
123456.batch  47654040K     39789920

where 123456 is the job number of the running job.

If you need such information for already finishedy jobs, use the command sacct.
Example:


user@wr0: sacct -j 123456.batch --format="jobid,CPUTime,MaxVMSize,MaxDiskRead"
       JobID    CPUTime  MaxVMSize  MaxDiskRead
------------ ---------- ---------- ------------
123456.batch   01:37:04  24173824K      828.59M

GPU Nodes

What GPU nodes should be used for

There are several nodes with different types of GPU's to speed up certain computations. To optimize the throughput on our GPU nodes and to minimize waiting times for alle GPU users, please follow the rules:

Use GPU nodes only if you can utilize them appropriately with a high utilization.
Use the 4-way GPU nodes only, if you can utilize 4 GPU's in parallel. Otherwise use the nodes with 1 GPU.
If you can utilize 4 GPUs in your program, use the 4-way GPU nodes.

If you need assistance in choosing the appropriate GPU, contact us.

GPU nodes are in the following queues:

batch queue #GPUs available #GPU(s) on node available GPU memory

gpu 5 Nvidia V100 16 GB

gpu4 30 4x Nvidia A100 / 4x Nvidia V100 80 GB / 16 GB

batch queue	#GPUs available	#GPU(s) on node	available GPU memory
gpu	5	Nvidia V100	16 GB
gpu4	30	4x Nvidia A100 / 4x Nvidia V100	80 GB / 16 GB

How to ask for GPUs

To use one or more GPU's on a GPU-node (one of wr14-wr25), add --gres=gpu to your job request. For this default case, you ask for 1 GPU on the requested node(s). For the nodes wr14 and wr20-25 you may ask for up to 4 GPUs, respectively. To request more than one GPU, add --gres=gpu:4 to your job request, if you ask for example for 4 GPU's.If you don't use the --gres=... option, your job is started on the requested node without access to any GPU, i.e. CPU-only!
Example for a batch job file:

 
#!/bin/bash
#SBATCH --partition=gpu4         # GPU partition
#SBATCH --nodes=1                # number of nodes
#SBATCH --gres=gpu:4             # ask for a node with 4 GPUs
#SBATCH --time=24:00             # total runtime of job allocation

module load cuda
./my_cuda_program.exe

Using the --gres=... option, the environment variable CUDA_VISIBLE_DEVICES is set by the batch system to a value with the reserved GPU number(s) on a node. This environment variable is respected by CUDA. All your CUDA programs in your job run on those devices that are assigned to your job.

Is your application (really) able to use GPUs?

Use in the beginning of your GPU work node wr14 which you can access interactively, i.e. from wr0 do a

ssh
wr14

. Here you can start a short but representable program run to test your CUDA application. To simulate a batch job program run, set the environment variable accordingly:


user@wr14: export CUDA_VISIBLE_DEVICES=0

After the test run finished, you can get information about the GPU utilization of this program run with the command:


user@wr14: nvidia-smi -q -d ACCOUNTING

which results in a output like:


==============NVSMI LOG==============

Timestamp                           : Fri Nov 17 16:54:34 2023
Driver Version                      : 440.33.01
CUDA Version                        : 12.3

Attached GPUs                       : 2
GPU 00000000:84:00.0
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 4000
    Accounted Processes
        Process ID                  : 48147
            GPU Utilization         : 85 %
            Memory Utilization      : 81 %
            Max memory usage        : 171 MiB
            Time                    : 6936 ms
            Is Running              : 0

GPU 00000000:85:00.0
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 4000
    Accounted Processes             : None

The lines below Process ID give you valuable information about the GPU utilization. If more than one process is listed in the output, usually the last entry in the output relates to the last execution on the GPU.

Monitoring the GPU utilization of your application

We supply a weg page for continuous GPU monitoring of all GPUs on our systems. This page gets updated every minute. Use this page and the information which node(s) you use (you get here) to check whether your applications really uses and also utilizes the GPU(s) you requested in a job.

You can also monitor the GPU usage of your running program on a batch node with the command:


user@wr0: srun -s --jobid your-running-job-id --pty nvidia-smi

where your_running_job_id is the job-ID of your running GPU program on a batch node. The output looks similar to


user@wr0> srun -s --jobid 123456 --pty nvidia-smi
...
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:01:00.0 Off |                  Off |
| N/A   28C    P0              85W / 500W |    592MiB / 81920MiB |     85%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    386553      C   ./vectorproduct.exe                         534MiB |
+---------------------------------------------------------------------------------------+

which shows that the current utilization of the GPU used is 85% and that the program that runs a kernel on the GPU is called ./vectorproduct.exe and utilizes 534 MiB of GPU memory.

Container

We support container usage with apptainer. Apptainer has several advantages compared to docker, e.g. most of the functionality can be used without root privileges. See apptainer help for the available CLI commands with apptainer, many of them compatible with docker, including the usage of docker images.

Software Development

Interactice Development

To fasten development cycles, you can use some nodes interactively. The nodes available for that are:

wr0 for all development where you do not need special hardware (e.g. accelerator) or want to use MPI
wr14 for GPU development, i.e., CUDA, OpenCL, OpenACC. Additionally you can use this system for MPI tests with small data sets and a small number of MPI processes. From wr0, do a ssh -Y wr14 to work interactively on wr14. Don't start productions runs on this test system, for that use instead the batch system.

Additionally, you can use the srun command: srun --x11 --pty /bin/bash on other nodes with additinal options that are possible.

Compiler

All main development tools are available. Among them are compilers (C, C++, Java, Fortran, python3) and parallel programming environments (OpenMP, MPI, CUDA, OpenCL, OpenACC). Application software is in the responsibility of users.

compiler	name	module command	documentation	safe optimization	debug option	compiler feedback	version
GNU C	cc / gcc	`module load gcc`	man gcc	-O2	-g	-ftree-vectorizer-verbose=2	--version
AMD aocc	clang	`module load aocc`	man clang	-O2	-g	-ftree-vectorizer-verbose=2	--version
Intel C (oneAPI)	icx	`module load intel-compiler`	man icx	-O2	-g	–qopt-report	--version
Nvidia C	nvc	`module load nvidia-hpc`	man nvc	-O2	-g	-Minfo	--version
GNU C++	g++	`module load gcc`	man g++	-O2	-g	-ftree-vectorizer-verbose=2	--version
AMD aoc++	clang	`module load gcc`	man clang	-O2	-g	-ftree-vectorizer-verbose=2	--version
Intel C++ (oneAPI)	icpx	`module load intel-compiler`	man icpx	-O2	-g	-qopt-report	--version
Nvidia C++	nvcc	`module load nvidia-hpc`	man nvcc	-O2	-g	-Minfo	--version
GNU Fortran	gfortran	`module load gcc`	man gfortran	-O2	-g	-ftree-vectorizer-verbose=2	--version
AMD Fortran	flang	`module load aocc`	man flang	-O2	-g	-ftree-vectorizer-verbose=2	--version
Intel Fortran	ifort	`module load intel-compiler`	man ifort	-O2	-g	–vec-report=2 (or higher)	--version
Nvidia Fortran	nvfortran	`module load nvidia-hpc`	man nvfortran	-O2	-g	-Minfo	--version
Oracle Java	javac	`module load oracle-java`		-O	-g	n.a.	-version

Examples:

Compile a C file and generate optimized code cc -O2 t.c
Compile a Fortran file and generate optimized code that is executable on all nodes: module load intel-compiler; ifort -O2 t.f

Base Software

The following base software libraries are installed:

Intel MKL

The Intel Math Kernel Library (MKL) is installed. You can use this software after a module load intel-compiler which expands include file search paths and library search paths accordingly. It should be used preferably on Intel-based systems, but works also on AMD systems. The library contains basic mathematical functions (BLAS, LAPACK, FFT,...). If you use any of the Intel compilers, just add the flag -qmkl as a compiler and linker flag. Otherwise, check this page for the appropriate version and correspondings flags. Example for Makefile:


CC      = icx
CFLAGS  = -qmkl
LDLIBS  = -qmkl

By default MKL uses all available cores. You can restrict this number with the environment variable MKL_NUM_THREADS, e.g.


export MKL_NUM_THREADS=1

before you start a MKL-based program.

AMD AOCL

AMD has the AOCL library including implementations for BLAS and LAPACK. Do a module load aocl. See the AMD documentation how to use this software.

Parallel Programming

There are different approaches for parallel programming today: shared memory parallel programming based on OpenMP, distributed memory programming based on MPI, and GPGPU computing based on CUDA, OpenCL, OpenACC or OpenMP.

OpenMP

compiler	name	module command	documentation	version
GNU OpenMP C/C++ (same for AMD aocc clang)	gcc/g++ -fopenmp	`module load gcc`	man gcc	--version
Intel OpenMP C/C++ (oneAPI)	icx/icpx -qopenmp	`module load intel-compiler`	man icx / icpx	--version
Nvidia OpenMP C/C++	nvc/nvcc -mp	`module load nvidia-hpc`	man nvc / nvcc	--version
Intel OpenMP Fortran	ifort -qopenmp	`module load intel-compiler`	man ifort	--version
GNU OpenMP Fortran (same for AMD aocc flang)	gfortran -fopenmp	`module load gcc`	man gfortran	--version
Nvidia Fortran	nvfortran -mp	`module load nvidia-hpc`	man nvfortran	--version

Example: Compile and run an OpenMP C file:


module load gcc
gcc -fopenmp -O2 t.c
export OMP_NUM_THREADS=8
./a.out

MPI

compiler	name	module command	documentation	version
MPI C (based on gcc)	mpicc	`module load openmpi/gnu`	see gcc	--version
MPI C++ (based on gcc)	mpic++	`module load openmpi/gnu`	see g++	--version
MPI Fortran (based on gfortran)	mpif90	`module load openmpi/gnu`	see gfortran	--version
MPI C (based on Intel oneAPI icx)	mpiicx	`module load openmpi/intel`	see icx	--version
MPI C++ (based on Intel oneAPI icpx)	mpiicpx	`module load openmpi/intel`	see icpx	--version
MPI Fortran (based on ifort)	mpiifort	`module load openmpi/intel`	see ifort	--version

Which MPI-compilers are used can be influenced through the module command: with module load openmpi/gnu you can use the GNU compiler environment (gcc, g++, gfortran), and with module load openmpi/intel you can use the Intel compiler environment (icc, icpc, ifort). Be aware that using module load openmpi/intel the MPI compiler names mpicc etc. are mapped to the GNU compilers. To use an Intel compiler you need to specify Intel's own names for that, i.e., mpiicx, mpiicpx, mpiifort. We don't recommend Intel MPI.

All options discussed in the compiler section also apply here, e.g. optimization.

Example: Compile a MPI C file and generate optimised code:


module load openmpi/intel
mpiicx -O2 t.c

CUDA

The GPU nodes have either one or four NVIDIA GPU installed (A100, V100). GPU program development can (and should) be done interactively on wr14 (i.e. ssh wr14) as there are all necessary drivers installed locally on that system. Production runs on any GPU should be done using the appropriate batch queues. Use module load cuda to load the CUDA environment (including certain possible versions available). To compile a CUDA project you can use the following Makefile template:


# defines
CC              = cc
CUDA_CC         = nvcc
LDLIBS          = -lcudart

# default rules based on suffices
#       C
%.o: %.c
        $(CC) -c $(CFLAGS) -o $@ $<

#       CUDA
%.o: %.cu
        $(CUDA_CC) -c $(CUDA_CFLAGS) -o $@ $<

myprogram.exe: myprogram.o kernel.o
        $(CC) -o $@ $^ $(LDLIBS)

Here the CUDA kernel and host part is in a file kernel.cu and the non-CUDA part of your program is in a file myprogram.c.

OpenACC

Directive-based GPU programming is available through the Nvidia HPC SDK compiler nvc / nvcc. Use wr14 interactively to compile such programs. The generated code can be executed on all GPU nodes. You can specify the compute capability as a compiler option.

Tools

See this document .

Resource Requirements

If you want to find out the memory requirements of a non-MPI job, use:


/usr/bin/time -f "%M KB" command

which prints out the peak memory consumption in kilobytes of the command execution.

Usage Examples

Sequential C program

C-program named `test.c`


#include <stdio.h>
int main(int argc, char **argv) {
    printf("Hello world\n");
    return 0;
}

Makefile


CC     = cc
CFLAGS = -O

#default rules
%.o: %.c
        $(CC) $(CFLAGS) -c $<
%.exe: %.o
        $(CC) -o $@ $< $(LDLIBS)

default:: test_sequential.exe

Batch script


#!/bin/bash
#SBATCH --output=slurm.%j.out    # STDOUT
#SBATCH --error=slurm.%j.err     # STDERR
#SBATCH --partition=any          # partition (queue)
#SBATCH --ntasks=1               # use 1 task
#SBATCH --mem=100                # memory per node in MB (different units with suffix K|M|G|T)
#SBATCH --time=2:00              # total runtime of job allocation ((format D-HH:MM:SS; first parts optional)

# start program
./test_sequential.exe

OpenMP C program

C-program named `test_openmp.c`


#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
#pragma omp parallel
    printf("I am the %d. thread of  %d threads\n", omp_get_thread_num(), omp_get_num_threads());
    return 0;
}

Makefile


CC     = gcc -fopenmp
CFLAGS = -O

#default rules
%.o: %.c
        $(CC) $(CFLAGS) -c $<
%.exe: %.o
        $(CC) -o $@ $< $(LDLIBS)

default:: test_openmp.exe

Batch script


#!/bin/bash
#SBATCH --output=slurm.%j.out    # STDOUT (%N: nodename, %j: job-ID)
#SBATCH --error=slurm.%j.err     # STDERR
#SBATCH --partition=any          # partition (queue)
#SBATCH --nodes=1                # number of tasks/cores
#SBATCH --ntasks-per-node=32     # number of tasks per node
#SBATCH --mem=1G                 # memory per node in MB (different units with suffix K|M|G|T)
#SBATCH --time=2:00              # total runtime of job allocation ((format D-HH:MM:SS; first parts optional)

# start program (with 24 threads, in total 32 threads were requested by the job)
export OMP_NUM_THREADS=24
./test_openmp.exe

MPI C program

C-program named `test_mpi.c` :


#include <stdio.h>
#include <unistd.h>
#include <mpi.h>
int main(int argc, char **argv) {
  int size, rank;
  char hostname[80];

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  gethostname(hostname, 80);
  printf("Hello world from %d (on node %s) of size %d!\n", rank, hostname, size);
  MPI_Finalize();
  return 0;
}

Makefile


CC     = mpicc
CFLAGS = -O

#default rules
%.o: %.c
        $(CC) $(CFLAGS) -c $<
%.exe: %.o
        $(CC) -o $@ $< $(LDLIBS)

default:: test_mpi.exe

Batch script


#!/bin/bash
#SBATCH --output=slurm.%j.out    # STDOUT (%N: nodename, %j: job-ID)
#SBATCH --error=slurm.%j.err     # STDERR
#SBATCH --partition=any          # partition (queue)
#SBATCH --nodes=5                # number of tasks/cores
#SBATCH --ntasks-per-node=32     # number of tasks per node
#SBATCH --mem=4G                 # memory per node in MB (different units with suffix K|M|G|T)
#SBATCH --time=2:00              # total runtime of job allocation ((format D-HH:MM:SS; first parts optional)

module load gcc openmpi/gnu

# start program (with maximum parallelism as specified in job request, for this example 5*32=160)
mpirun ./test_mpi.exe

Applications

For some of the application programs installed a brief description is given here how to use them.

Matlab

Beside the basic Matlab program there are several Matlab toolboxes installed.

Using Matlab interactively

To run Matlab interactively on wr0 you have to do the following: Usage:

user@wr0: module load matlab
user@wr0: matlab

This starts the Matlab shell. If you logged in from a X-Server capable computer and used ssh -Y username@wr0.wr.inf.h-brs.de to login to wr0 the graphical panel appears on your computer instead of the text panel (see here for details of X-Server usage).

Using Matlab with the Batch System

Inside your batch job start Matlab without display:

    ...
    module load matlab
    matlab -nodisplay -nosplash -nodesktop -r "m-file"

where m-file is the name of your Matlab script with the suffix .m

Pitfalls Using Matlab

Matlab is very sensible with memory allocation / administration.

If you use Matlab multithreading (with many threads) you should set the environment variable export MALLOC_ARENA_MAX=4. This influences / restricts Matlab's memory allocation in a multithreaded environment.
Inside your Matlab script clear temporary variables.
Do not repeatedly change the size of large matrices (memory fragmentation).

X11 applications

X11 applications are possible only on wr0. To use X11 applications that open a display on your local X-server (e.g. xterm, ...) you need to redirect the X11 output to your local X11 server and to allow another computer to open a window on your computer.

The easiest way to enable this is to login to the WR-cluster with ssh and use the ssh option -Y (or with older ssh versions also -X ) that enables X11 tunneling through your ssh connection. If your login path goes over multiple computers please be sure to use the -Y option for every intermediate host on the path.
Example:
```
user@another_host:  ssh -Y user@wr0.wr.inf.h-brs.de
```
On your local computer (i.e. where the X-server is running) you must allow wr0 to open a window. Execute on your local computer in a shell: xhost +
Another possibility it to set the DISPLAY variable on the cluster and to allow other computers (i.e. the WR cluster) to open a window on your local X-Server.
Example:
- on cluster (i.e. X-client): export DISPLAY=mycomputer.mydomain:0.0
- on your local computer (i.e. X-server): xhost + (which would allow any computer to open a window on your X-server)
Please be aware that newer versions of X-Servers don't support by default IP-Ports but rather Unix ports and therefore this second version doesn't work.

You can test your X11 setup executing in an ssh shell window on wr0 xterm. A window on your local computer must pop up with a shell on wr0.

FAQ

I see no output/error files after a job run?
Answer 1: Do not use Umlauts or white space in directory names in the whole path name (see File and Directory Names)
Answer 2: The permissions of your home directory must not allow any other user to be able to write to your home directory. I.e., no write bit set for your group or for other/world.

WR Cluster Usage

Content

Accounts

File Systems and I/O

File and Directory Names

Environment Software Modules

Usage

Examples

Available Modules

Initial Module Enviroment Setup

Using the Batch System

Usual Steps

1) Specify What Should Be Done

Sequential Job

OpenMP Job

MPI Job

2) Specify Which Resources You Need

3) Submit the Job

4) Check Job Status

5) Get Results

Selected Batch System Commands

Partitions, Resource Limits and Job Priorities

Resource Limits

Job Priorities and Job Scheduling

Environment Variables and Modules

Special Requests

Temporary Files

Information about Job Runs

GPU Nodes

What GPU nodes should be used for

How to ask for GPUs

Is your application (really) able to use GPUs?

Monitoring the GPU utilization of your application

Container

Software Development

Interactice Development

Compiler

Base Software

Intel MKL

AMD AOCL

Parallel Programming

OpenMP

MPI

CUDA

OpenACC

Tools

Resource Requirements

Usage Examples

Sequential C program

C-program named test.c

Makefile

Batch script

OpenMP C program

C-program named test_openmp.c

Makefile

Batch script

MPI C program

C-program named test_mpi.c :

Makefile

Batch script

Applications

Matlab

Using Matlab interactively

Using Matlab with the Batch System

Pitfalls Using Matlab

X11 applications

FAQ

C-program named `test.c`

C-program named `test_openmp.c`

C-program named `test_mpi.c` :