Skip to content

Commit d8d7134

Browse files
John FonnerJohn Fonner
John Fonner
authored and
John Fonner
committed
jobs and other edits
1 parent 9324ca5 commit d8d7134

File tree

4 files changed

+119
-5
lines changed

4 files changed

+119
-5
lines changed

README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,11 @@ This class provides beginners and intermediate users with information on using r
1616
|--------------|--------------------------------------|
1717
| 9:00 - 9:20 | [Overview of TACC and the user portal](tacc/00-usingTACC.md) |
1818
| 9:20 - 9:40 | [Shell basics](shell/00-intro.md) |
19-
| 9:40 - 10:00 | [File movement and management](shell/01-filedir.md) |
19+
| 9:40 - 10:00 | [File movement](shell/01-filedir.md) and [management](shell/02-create.md) |
2020
|10:00 - 10:20 | [Piping commands and looping](shell/03-pipefilter.md) |
21-
|10:20 - 10:40 | Coffee break |
22-
|10:40 - 11:00 | Life sciences software and modules |
23-
|11:00 - 11:20 | Job submission and management |
21+
|10:20 - 10:40 | Coffee break |
22+
|10:40 - 11:00 | [Life sciences software and modules](shell/05-modules.md) |
23+
|11:00 - 11:20 | [Job submission and management](shell/06-jobs.md) |
2424
|11:20 - 11:40 | Customizing environment settings |
2525
|11:40 - 12:00 | Catch-up time and questions |
2626

shell/05-modules.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Using Software Modules
2-
----------------------
2+
======================
33

44
---
55

shell/06-jobs.md

+108
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
Job Submission and Management
2+
=============================
3+
---
4+
5+
#### Objectives
6+
* Learn what information the scheduler needs to run a job
7+
* Submit a job on Lonestar
8+
* Follow your job through the execution stages
9+
* Examine the output
10+
11+
---
12+
13+
For this session, we will look at submitting jobs on Lonestar, which uses the SGE scheduler. Other TACC systems, like Stampede, use the SLURM scheduler, which has a different syntax, but the concepts are the same. The TACC user guides always have information on interacting with the scheduler, and they are available here:
14+
15+
Lonestar: [https://www.tacc.utexas.edu/user-services/user-guides/lonestar-user-guide#running](https://www.tacc.utexas.edu/user-services/user-guides/lonestar-user-guide#running)
16+
Stampede: [https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running](https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running)
17+
Maverick: [https://www.tacc.utexas.edu/user-services/user-guides/maverick-user-guide#running](https://www.tacc.utexas.edu/user-services/user-guides/maverick-user-guide#running)
18+
19+
### Using a Shared Computing Cluster
20+
21+
When you login to TACC, you arrive at a "login node" that is shared by many users. If you try running resource intensive software on a login node, you will be very disappointed. It will run slowly, since other users are also using the node, you will make the node less responsive to other users, and if you don't stop the software quickly, the system administrator will shut it down for you and disable your account. Instead, software should run on dedicated "compute nodes" that you can reserve by interacting with the scheduler. The goal is always to be a responsible user that respects other users while also getting all the computing resources you need to be productive. Lets start by looking at the queue:
22+
23+
```
24+
$ showq -l
25+
ACTIVE JOBS--------------------------
26+
JOBID JOBNAME USERNAME STATE CORE HOST QUEUE ACCOUNT REMAINING STARTTIME
27+
=============================================================================================================
28+
...
29+
WAITING JOBS WITH JOB DEPENDENCIES---
30+
...
31+
UNSCHEDULED JOBS---------------------
32+
...
33+
Total jobs: 205 Active Jobs: 168 Waiting Jobs: 35 Dep/Unsched Jobs: 2
34+
```
35+
36+
The ```showq``` command will give you quite a bit of information about all the jobs currently running on the compute nodes of the system. In this case, we used the ```-l``` flag to get the "long" format, but normally you won't use that. Let's look at the information this shows us:
37+
38+
* JOBID - a sequential number assigned to every job
39+
* JOBNAME - a name provided by the user for convenience
40+
* USERNAME - the name of the user that launched the job
41+
* STATE - tells if the job is currently running, waiting, waiting because of a dependency, or unscheduled. Jobs shown as unscheduled are usually finished, and the system is just cleaning up the node.
42+
* CORE - the number of physical CPU cores reserved for the job. It is up to the user and the software to use those CPU cores efficiently
43+
* HOST - how many "hosts" or "nodes" are being used by this job
44+
* QUEUE - What queues do you see in the list? Each queue uses a different pool of nodes, which may have different hardware specifications. For example, the "largemem" queue has nodes with 1,000 GB of memory
45+
* ACCOUNT - every job is "charged" to an account. Your accounts are listed when you login.
46+
* REMAINING - the maximum amount of time the job has to complete. For waiting jobs, this field is:
47+
* WCLIMIT - this is the amount of wall clock time requested by the user. If a job is still running after this much time, it is killed, the node is cleaned up, and it is ready for another job.
48+
* STARTTIME - when the job started running. For waiting jobs, this field instead shows the QUEUETIME, when the job was submitted to the queue.
49+
50+
A lot of this information was supplied by the user to the scheduler. We'll see how to do this in the next section.
51+
52+
### Creating a Job Submission Script
53+
54+
It's usually easiest to start with a template when creating a job submission script. TACC keeps some examples under the /share/doc/ directory, and there are also examples in the user guides (links at the start of this document). Here is an example for Lonestar:
55+
56+
```
57+
#!/bin/bash
58+
#$ -V
59+
#$ -cwd
60+
#$ -pe 1way 12
61+
#$ -N my_job_name # change this to whatever name you want (but no spaces!)
62+
#$ -o output.$JOB_ID # name of the "stdout" output file
63+
#$ -e error.$JOB_ID # name of the "stderr" output file
64+
#$ -q normal # "normal" for production jobs. "development" for test jobs
65+
#$ -A <YourAllocationHere> # allocation name. Not required if you only have one allocation.
66+
#$ -l h_rt=24:00:00 # max runtime. 24:00:00 = 24 hours. you can't run longer than 48 hours.
67+
##$ -M <YourEmail> # email address (optional; take out the extra "#" to use this feature)
68+
##$ -m e # email at end of job ("-m be" would email at the beginning as well)
69+
70+
# make sure we have the modules loaded that we expect:
71+
module list
72+
73+
# run commands here:
74+
myHost=$(hostname)
75+
echo "This job ran on host $myHost at $(date)"
76+
time sleep 120
77+
echo "Job finished at $(date)"
78+
```
79+
80+
Copy the text into your your own job submission script on Lonestar and edit it using your favorite text editor. There are multiple ways to do this, depending on your preferences. For the example, we'll show how to do it in the ```nano``` editor.
81+
82+
```
83+
$ nano job_script.sge
84+
```
85+
86+
That command will open nano. You should be able to copy and paste the text from this example now. You can move around with the arrow keys to make edits. When you're ready to save, Ctrl+X (shown as ^X in the menu) will close the window, and nano will ask you if you want to save.
87+
88+
### Checking Job Status
89+
90+
Time to see if your job ran.
91+
92+
```
93+
$ showq -u
94+
```
95+
96+
We used the "-u" argument to only show your jobs (rather than the long list from before). You can press the up arrow to go to your last command. In this case, it makes it easy to repeat the ```showq -u``` command to check on your job. After a couple of minutes, you should see it move to the running state. Then it will go to "unscheduled" for a few seconds. Then the jobs will disappear. Of course, if you used the email option, you can also just wait for the email.
97+
98+
### Examining the Output
99+
100+
Linux command line software will usually write to files and/or print to the screen. When you run a command on a compute node in "batch" mode, writing to files works like normal. All the nodes can see your $HOME, $WORK, and $SCRATCH filesystems. But what about data printed to the screen?
101+
102+
It is captured in the output files you specified. If you followed the template, you'll end up with something similar to output.12345 and error.12345. In this case, error messages that would have been printed to the screen go to the error.* file. Normal output was redirected to output.*. What information is there other than the output from your commands?
103+
104+
## Challenges
105+
* How would you join the "stdout" and "stderr" files into one? (Hint: you can do this in the #$ section of your job submission script)
106+
* Submit a job to a different queue.
107+
* Do all the queues have the same wall clock time limits? What queue has the shortest runtime limit? Why?
108+
* Do all the queues have the same CPU core limits? What queue lets you use the most cores? Why?

tacc/00-usingTACC.md

+6
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,10 @@ To use TACC systems, three things are required:
3636
3. an active **Allocation** - an "allocation" is a quota of computing time or storage space devoted to a project. A project can have allocations on several different systems to serve their research goals. Allocations are usually active for a year, and principle investigators or their delegates can renew allocations each year for more computing resources.
3737

3838

39+
Once you have all three of those things, you should be able to login to a compute system (whichever ones your Allocation gives you access to) through SSH. Mac users will likely use their "Terminal" to do this, while Windows users can use "Git Bash" or another third party application. You should use the same username and password that you use in the Portal.
3940

41+
## Challenges
42+
43+
* Login to the portal and check what projects you are involved in (there should at least be this training project)
44+
* What allocations does that project have?
45+
* SSH into Lonestar and look over the "welcome" text

0 commit comments

Comments
 (0)