It is of utmost importance to check how one's jobs are using the requested resources both regularly during the execution and after their completion: we can't stress enough the fact that the allocation must be thought carefully. Here are some tools and tips on how to chek on them.
As seen in the section Jobs managemet it's possible to obtain information on a job that has been launched using the command:
squeue -j <job_id>
.
Once you know on which node the job is phisically running, if you requested one or more GPUs you can connect to that node and take advantage of the command nvtop provided by Nvidia to have some details about the current usage of the graphics cards.
So, let's imagine our job is running on the node called gpu5 and your DEI account used to access the cluster is username:
srun --pty --jobid <job_id> /bin/bash
nvtop
This will show information about your current usage of the requested GPUs: if you see fewer GPUs than the amount you've requested, let's say just 1 while you've requested 2 cards, that means two GPUs have been effectively reserved, but one is idle and then this is not an efficient resource allocation.
In cases like this, please consider halting your execution and launching it again after choosing an inferior amount of resources.
During the execution of a job at any moment it's possible to get some relevant statistics about it and among them the efficiency in RAM usage.
It's possible to know the peak amount of RAM used and then understand the efficiency in memory usage through a tool called myjobinfo. Once you know which is the node where your job is running, connect to it and launch the tool followed by the job id. Let's say our node is gpu5, our username is username and our job id is 123456:
srun --pty --jobid <job_id> /bin/bash
myjobinfo 123456
The following is an excerpt of the output of the execution of the commands above:
CPUs : 4
GPUs : 1 (a40)
State : RUNNING
Submit time : 2024-11-06T00:00:00
Start time : 2024-11-06T01:00:00
Reserved walltime : 1-00:00:00
Used walltime : 03:56:00
Reserved memory : 8G/core
Max memory used : 11.69G (estimate)
Memory efficiency : 36.54%
Max disk write : 156.55G
Max disk read : 333.94G
Let's analyse the results.
Upon job completion you might want to checkout some information on the resources you used. For this the sacct command can be used:
sacct -o reqmem,maxrss,averss,elapsed –j <job_id>
To limit the list of your jobs to the ones executed in a certain period of time you can add a couple of parameters:
--starttime
(defaults to midnight)
--endtime
(defaults to now)
A more complete command, especially useful when you need to go back through jobs launched in the past and maybe compare them as to requests and usages made, could be the following (one-line code):
sacct --starttime=2024-11-01 -o submit,node,jobname,jobid,elapsed,reqcpus,reqmem,avevmsize,averss,exitcode,state
This command is very similar to sstat: for the sstat command the output may be "overwhelming" while here a specific more suitable format can be set.
To see a full list of options and how to format the output consult man sacct
on the frontend node or the web version.
Job efficiency measures how precisely you requested the computing resources. This is a parameter you should not underestimate. In fact:
This command is very powerful and precise. You can find the job efficiency of a completed job issuing:
seff <job_id>
Once you know which is the node where your job is running, connect to it and launch the tool with the number of the job. Let's say our node is runner-01: our username is username and our job number is 123456:
srun --pty --jobid <job_id> /bin/bash
seff 123456
The following is an excerpt of the output of the execution of the commands above:
Job ID: 123456
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:48:40
CPU Efficiency: 98.68% of 00:49:19 core-walltime
Memory Utilized: 4.06 GB
Memory Efficiency: 10.39% of 39.06 GB
The above job was very good at requesting computing cores. On the opposite side 40 GB of RAM were requested (and were therefore reserved throughout job execution!) but just above 4 GB were needed!
Info
As already stated in the the Singularity basics section, the email notification generated by the END event not only provides basic information about the job that has just terminated its execution, but it also includes the output of the seff command. So please include this type of event in order to better calibrate your future resource allocation: END or ALL (which includes END) both work.
Example of line to include in the slurm file:#SBATCH --mail-type ALL
or#SBATCH --mail-type END