As anticipated in the previous chapter, once you've written your job file you can submit it to the scheduler to get it executed using the sbatch
command:
sbatch [options] <job_file>
But first you need to connect to the login server using your credentials. Let's say your username is deiuser and your job file is called test.slurm:
ssh deiuser@login.dei.unipd.it
sbatch test.slurm
Upon (successful) job submission, you will get a message like this:
Submitted job 129774
Here 129744 is the jobid. This number can be used to check the job progress, to remove it from the execution queue and for other operations. You can read the sbatch documentation using man sbatch
from the frontend node or visiting the sbatch web page.
Options specified inside the job file (after the #SBATCH
directives) can be overridden or modified on the command line, e.g.:
sbatch --mem 10G --jobname test10G test.slurm
The above command line will set - just for this submission - the jobname to "test10G" and will request 10 gigabytes of RAM, possibly overriding what specified inside the slurm job file.
Important
Do not underestimate the importance of correctly calibrating the request of resources for a job.
Please double check if your CPUs, RAM e GPUs requests are adequate for the code you are going to execute. The efficiency must be high, otherwise your execution can result in other users having problems to launch their jobs and, in the worst case, it can lead to a forced termination of a job launched with an unwise resource allocation.
Consult the chapter How my jobs are performing for more information.
Once the job enters the queue you can use the squeue command to check its status:
squeue
By default, the squeue command will print out the job ID, partition, username, job status, number of nodes, and name of nodes for all jobs queued or running within Slurm.
The -l option can be added to receive a different listing:
squeue -l
So this will list all the jobs in the queue. Since the list can be very long you can filter only your jobs with -u
or --user
option:
squeue [-l] -u <user_id>
Or you can check a single job providing the job_id:
squeue -j <job_id>
When checking the status of a job, maybe we want to repeatedly call the squeue command to check for updates. We can accomplish this by adding the --iterate flag to our squeue command. This will run squeue every n seconds, allowing for a frequent, continuous update of queue information without needing to repeatedly call squeue:
squeue [-l] -u <user_id> --iterate <num_seconds>
The column referred to the state code tells you which is the current status of a job. Among them:
To see the complete list of output options and command flags use man squeue
from the frontend node or visit the squeue web page.
The status of jobs in a running state can be checked with:
sstat
This will show information about CPU usage, task information, node information, virtual memory and more. We can invoke the sstat command as such:
sstat --jobs=your_job-id
The fields that are shown may be too many for our needs, so it's possible to format the output filtering what is presented.
To see the complete list of output statistics (e.g. min/max/avg bytes read/written, min/max/avg CPU time, min/max/avg memory usage, etc.) and command options use man sstat
from the frontend node or visit the sstat web page.
It is possible to control your personal jobs through the scontrol
command.
You can suspend a job:
scontrol suspend <job_id>
You can then resume it:
scontrol resume <job_id>
A job in the queue can be held, that is it will be given the lowest priority and therefore it won't be executed:
scontrol hold <job_id>
You can then resume it:
scontrol release <job_id>
To remove a job from the queue use the scancel
command:
scancel <job_id>
Alternatively if you want to remove all your jobs from the queue you can use
scancel -u <user_id>
or scancel -u <username>
To cancel all the pending jobs for a user:
scancel -t PENDING -u <user_id>
or scancel -t PENDING -u <username>
To cancel one or more jobs by their name:
scancel --name <myJobName>
To cancel multiple jobs, you can use a comma-separated list of job IDs:
scancel <job_id1>, <job_id2>, <job_id3>
Caution!
There will be no confirmation prompts.