Skip to content

Commit 3286fad

Browse files
committed
tutorials: Add a flux cancel tutorial
1 parent 93dab21 commit 3286fad

File tree

2 files changed

+239
-0
lines changed

2 files changed

+239
-0
lines changed

tutorials/commands/flux-cancel.rst

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
.. _flux-cancel:
2+
.. _flux-pkill:
3+
4+
========================
5+
How to Cancel a Flux Job
6+
========================
7+
8+
Inevitably submitted jobs will have to be canceled for one reason or another. This tutorial
9+
will show you how.
10+
11+
----------------------------
12+
How to Cancel a Job by Jobid
13+
----------------------------
14+
15+
The basic way to cancel a job is through ``flux cancel``. All you have to do is specify
16+
the jobid on the command line. Here is a simple example after submitting a job.
17+
18+
.. code-block:: console
19+
20+
$ flux submit sleep 100
21+
ƒh35Dh5qRyq
22+
23+
$ flux jobs ƒh35Dh5qRyq
24+
JOBID USER NAME ST NTASKS NNODES TIME INFO
25+
ƒh35Dh5qRyq achu sleep R 1 1 13.33s corona174
26+
27+
$ flux cancel ƒh35Dh5qRyq
28+
29+
<snip wait a little bit>
30+
31+
$ flux jobs ƒh35Dh5qRyq
32+
JOBID USER NAME ST NTASKS NNODES TIME INFO
33+
ƒh35Dh5qRyq achu sleep CA 1 1 20.18s corona174
34+
35+
In the above example we submitted a simple job via ``flux submit`` that simply
36+
runs ``sleep``. Passing the resulting jobid to ``flux jobs`` shows that it is
37+
running (state is ``R``).
38+
39+
We cancel the job simply by passing the jobid to ``flux cancel``. After waiting
40+
a little bit, we see that the job is now canceled in ``flux jobs`` (state is ``CA``).
41+
42+
While we only passed one jobid to ``flux cancel`` in this example, multiple jobids can be
43+
passed on the commandline to cancel many jobs.
44+
45+
Note that in this particular example we happened to know the jobid of our job. If you do
46+
not know the the jobid of your job, you can always use ``flux jobs`` to see a list of all
47+
your currently active jobs.
48+
49+
.. note::
50+
51+
Optionally a message can be added to the cancellation through the ``-m`` option. For example:
52+
53+
.. code-block:: console
54+
55+
$ flux cancel -m "I ran the wrong command" f3vnSzaaB
56+
57+
This may be useful for later knowing why a job was canceled. You can see the message using
58+
``flux jobs`` and the ``endreason`` format. For example:
59+
60+
.. code-block:: console
61+
62+
$ flux jobs --format=endreason f3vnSzaaB
63+
JOBID USER NAME ST T_INACTIVE INACTIVE-REASON
64+
f3vnSzaaB achu sleep CA May11 08:52 Canceled: I ran the wrong command
65+
66+
---------------------------
67+
Canceling Many of Your Jobs
68+
---------------------------
69+
70+
When you need to cancel many or all of your jobs, you can use either the ``--all`` option with ``flux cancel``
71+
or the ``flux pkill`` command. Lets run through several examples with the ``--all`` option first.
72+
73+
``flux cancel --all`` allows you to cancel jobs without specifying jobids. By default it cancels all of your active
74+
jobs, but several options allow you to target a subset of the jobs.
75+
76+
To start off, lets create 100 jobs that will sleep infinitely. We will use the special ``--cc`` (carbon copy)
77+
option to ``flux submit`` that will submit 100 duplicate copies of the ``sleep`` job.
78+
79+
.. code-block:: console
80+
81+
$ flux submit --cc=1-100 sleep inf
82+
<snip - many job ids printed out>
83+
84+
$ flux jobs
85+
JOBID USER NAME ST NTASKS NNODES TIME INFO
86+
ƒjTWS5m3 achu sleep S 1 - -
87+
ƒjTWS5m4 achu sleep S 1 - -
88+
ƒjTWS5m5 achu sleep S 1 - -
89+
ƒjTWS5m6 achu sleep S 1 - -
90+
<snip - there are many jobs waiting to be run>
91+
ƒjTWS5m2 achu sleep R 1 1 8.858s corona212
92+
ƒjTWS5m1 achu sleep R 1 1 8.860s corona212
93+
ƒjTUx6Um achu sleep R 1 1 8.870s corona212
94+
ƒjTUx6Uk achu sleep R 1 1 8.870s corona212
95+
ƒjTUx6Uj achu sleep R 1 1 8.870s corona212
96+
ƒjTUx6Ui achu sleep R 1 1 8.871s corona212
97+
<snip - there are many jobs running>
98+
99+
As you can see, we have a lot of jobs waiting to run (state ``S``) and a lot of running jobs (state ``R``).
100+
101+
Lets first ``flux cancel --all`` without any options.
102+
103+
.. code-block:: console
104+
105+
$ flux cancel --all
106+
flux-cancel: Canceled 100 jobs (0 errors)
107+
108+
$ flux jobs
109+
JOBID USER NAME ST NTASKS NNODES TIME INFO
110+
111+
As you can see, all the jobs are now canceled. ``flux jobs``
112+
confirms there are no longer any of our jobs running or waiting to run.
113+
114+
There are several options to filter the jobs to cancel when using the ``--all`` option. Perhaps the most commonly used
115+
option is the ``-S`` or ``--states`` option. The ``--states`` option specifies the state(s) of a job to cancel. The most
116+
common states to target are ``pending`` and ``running``. Lets resubmit our 100 jobs and see the result
117+
of trying to cancel ``pending`` vs ``running`` jobs.
118+
119+
.. code-block:: console
120+
121+
$ flux submit --cc=1-100 sleep inf
122+
<snip - many job ids printed out>
123+
124+
$ flux cancel --all --states=pending
125+
flux-cancel: Canceled 52 jobs (0 errors)
126+
127+
$ flux cancel --all --states=running
128+
flux-cancel: Canceled 48 jobs (0 errors)
129+
130+
As you can see ``flux cancel --all --states=pending`` targeted the 52 pending jobs for cancellation and
131+
``flux cancel --all --states=running`` targeted the current 48 running jobs for cancellation.
132+
133+
--------------------------
134+
Cancelling with Flux Pkill
135+
--------------------------
136+
137+
The final way to cancel a job is via ``flux pkill``. There are a number of search and filtering options available in
138+
``flux pkill`` which can be seen in the :core:man1:`flux-pkill` manpage.
139+
140+
However, there are two common ways ``flux pkill`` is used. The first is to cancel a range of jobids. The jobid range can be specified
141+
via the format ``jobid1..jobidN``.
142+
143+
It is best shown with an example.
144+
145+
.. code-block:: console
146+
147+
$ flux submit --cc=1-5 sleep inf
148+
ƒ3vEobuhH
149+
ƒ3vEobuhJ
150+
ƒ3vEobuhK
151+
ƒ3vEq5tyd
152+
ƒ3vEq5tye
153+
154+
$ flux jobs
155+
JOBID USER NAME ST NTASKS NNODES TIME INFO
156+
ƒ3vEq5tye achu sleep R 1 1 14.23s corona212
157+
ƒ3vEq5tyd achu sleep R 1 1 14.23s corona212
158+
ƒ3vEobuhK achu sleep R 1 1 14.23s corona212
159+
ƒ3vEobuhJ achu sleep R 1 1 14.23s corona212
160+
ƒ3vEobuhH achu sleep R 1 1 14.23s corona212
161+
162+
Similar to before, we've submitted some sleep jobs. We see all five of the sleep jobs are
163+
running (state ``R``) in the ``flux jobs`` output.
164+
165+
We can inform ``flux pkill`` to cancel the set of 5 jobs by specifying the first and last jobid of this range.
166+
167+
.. code-block:: console
168+
169+
$ flux pkill ƒ3vEobuhH..ƒ3vEq5tye
170+
flux-pkill: INFO: Canceled 5 jobs
171+
172+
$ flux jobs
173+
JOBID USER NAME ST NTASKS NNODES TIME INFO
174+
175+
As you can see ``flux pkill`` canceled the five jobs in the range.
176+
177+
The other common way to ``flux pkill`` is used is to cancel jobs with matching job names. For example, you may
178+
submit several different types of jobs and give them different types of names to describe their function. ``flux pkill``
179+
can be used to match on the job names and cancel only the ones that match.
180+
181+
Lets submit several jobs and give them specific names using the ``--job-name`` option.
182+
183+
.. code-block:: console
184+
185+
$ flux submit --job-name=foo sleep inf
186+
ƒ6KjHNcxP
187+
188+
$ flux submit --job-name=foobar sleep inf
189+
ƒ6Limcmju
190+
191+
$ flux submit --job-name=boo sleep inf
192+
ƒ6NCaXCmV
193+
194+
$ flux submit --job-name=baz sleep inf
195+
ƒ6PjZG6jq
196+
197+
$ flux jobs
198+
JOBID USER NAME ST NTASKS NNODES TIME INFO
199+
ƒ6PjZG6jq achu baz R 1 1 38.06s corona212
200+
ƒ6NCaXCmV achu boo R 1 1 41.54s corona212
201+
ƒ6Limcmju achu foobar R 1 1 44.9s corona212
202+
ƒ6KjHNcxP achu foo R 1 1 47.15s corona212
203+
204+
205+
We've submitted four jobs, giving them the job names "foo", "foobar", "boo", and "baz".
206+
207+
Lets cancel the job "boo" via ``flux pkill``
208+
209+
.. code-block:: console
210+
211+
$ flux pkill boo
212+
flux-pkill: INFO: Canceled 1 job
213+
214+
$ flux jobs
215+
JOBID USER NAME ST NTASKS NNODES TIME INFO
216+
ƒ6PjZG6jq achu baz R 1 1 2.856m corona212
217+
ƒ6Limcmju achu foobar R 1 1 2.97m corona212
218+
ƒ6KjHNcxP achu foo R 1 1 3.008m corona212
219+
220+
As you can see, ``flux pkill`` canceled just one job, the one assigned the name "boo".
221+
222+
``flux pkill`` will actually search for all jobs matching the supplied name, so what would happen if we asked ``flux pkill``
223+
to cancel jobs with the matching name "foo".
224+
225+
.. code-block:: console
226+
227+
$ flux pkill foo
228+
flux-pkill: INFO: Canceled 2 jobs
229+
230+
$ flux jobs
231+
JOBID USER NAME ST NTASKS NNODES TIME INFO
232+
ƒ6PjZG6jq achu baz R 1 1 4.626m corona212
233+
234+
As you can see it didn't cancel 1 job, it canceled 2 jobs, the job "foo" and the job "foobar".
235+
236+
And that's it! If you have any questions, please
237+
`let us know <https://github.com/flux-framework/flux-docs/issues>`_.

tutorials/commands/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Welcome to the Command Tutorials! These tutorials should help you to map specifi
77
with your use case, and then see detailed usage.
88

99
- ``flux submit/flux run`` (:ref:`flux-submit`): "Submit a job in a Flux instance"
10+
- ``flux cancel/flux cancelall/flux pkill`` (:ref:`flux-cancel`): "Cancel a job you submitted"
1011
- ``flux proxy`` (:ref:`ssh-across-clusters`): "Send commands to a Flux instance across clusters using ssh"
1112

1213
This section is currently 🚧️ under construction 🚧️, so please come back later to see more command tutorials!
@@ -17,4 +18,5 @@ This section is currently 🚧️ under construction 🚧️, so please come bac
1718
:caption: Command Tutorials
1819

1920
flux-submit
21+
flux-cancel
2022
ssh-across-clusters

0 commit comments

Comments
 (0)