Skip to content

Commit 2636d44

Browse files
committed
Use V100 GPUs
1 parent f3ac55d commit 2636d44

File tree

9 files changed

+142
-105
lines changed

9 files changed

+142
-105
lines changed

.dockerignore

+3
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,5 @@
11
**/.ipynb_checkpoints
22
**/.git
3+
build_docker.sh
4+
Dockerfile
5+
local/

01_hello_sky/01_hello_sky.ipynb

+49-36
Original file line numberDiff line numberDiff line change
@@ -28,16 +28,16 @@
2828
}
2929
},
3030
"source": [
31-
"SkyPilot is a framework for running machine learning workloads on any cloud.\n",
31+
"SkyPilot is a framework for easily running machine learning workloads on any cloud. \n",
3232
"\n",
33-
"SkyPilot makes it easy to use multiple clouds and reduce your cloud costs.\n",
33+
"Use the clouds **easily** and **cost effectively**, without needing cloud infra expertise.\n",
3434
"\n",
35-
"_Ease of use & productivity_\n",
35+
"_Ease of use_\n",
3636
"* **Run existing projects on the cloud** with zero code changes\n",
37-
"* **Easily manage jobs** across multiple clusters\n",
38-
"* **Automatic fail-over** to find scarce resources (GPUs) across regions and clouds\n",
39-
"* **Store datasets on the cloud** and access them like you would on a local file system \n",
40-
"* **No cloud lock-in** – seamlessly run your code across different cloud providers (AWS, Azure or GCP)\n",
37+
"* Use a **unified interface** to run on any cloud, without vendor lock-in (currently AWS, Azure, GCP)\n",
38+
"* **Queue jobs** on one or multiple clusters\n",
39+
"* **Automatic failover** to find scarce resources (GPUs) across regions and clouds\n",
40+
"* **Use datasets on the cloud** like you would on a local file system \n",
4141
"\n",
4242
"_Cost saving_\n",
4343
"* Run jobs on **spot instances** with **automatic recovery** from preemptions\n",
@@ -57,7 +57,8 @@
5757
"1. Understand the basic SkyPilot YAML interface (`setup`, `run`).\n",
5858
"2. Run a hello world task on a cloud of your choice.\n",
5959
"3. SSH into your cluster for debugging and development.\n",
60-
"4. Terminate the cluster and understand the cluster lifecycle."
60+
"4. Terminate the cluster and understand the cluster lifecycle.\n",
61+
"5. Run your task seamlessly across different clouds."
6162
]
6263
},
6364
{
@@ -75,8 +76,8 @@
7576
"\n",
7677
"There are points in these notebooks where you may need to edit files outside the notebook and open a terminal to run some commands. These points will be highlighted with **two icons**:\n",
7778
"\n",
78-
"### 📝 - Edit an external file\n",
79-
"### 💻 - Run commands in an interactive terminal window\n",
79+
"### <span style=\"color:green\">[DIY]</span> 📝 - Edit an external file\n",
80+
"### <span style=\"color:green\">[DIY]</span> 💻 - Run commands in an interactive terminal window\n",
8081
"\n",
8182
"Use these icons as a hint to know when to switch away from the current notebook and edit a file or open a terminal.\n",
8283
"\n",
@@ -154,8 +155,8 @@
154155
"cell_type": "markdown",
155156
"metadata": {},
156157
"source": [
157-
"## 📝 Edit `example.yaml` to echo \"Hello SkyPilot\" \n",
158-
"Go ahead and open example.yaml and edit the run field to echo \"Hello SkyPilot\"."
158+
"## <span style=\"color:green\">[DIY]</span> 📝 Edit `example.yaml` to echo \"Hello SkyPilot\" \n",
159+
"**Go ahead and open example.yaml and edit the run field to echo \"Hello SkyPilot\".**"
159160
]
160161
},
161162
{
@@ -176,16 +177,18 @@
176177
"cell_type": "markdown",
177178
"metadata": {},
178179
"source": [
179-
"## 💻 Launch your Sky Task!\n",
180+
"## <span style=\"color:green\">[DIY]</span> 💻 Launch your Sky Task!\n",
180181
"\n",
181-
"In a terminal window, run:\n",
182+
"**In a terminal window, run:**\n",
182183
"\n",
183184
"-------------------------\n",
184185
"```console\n",
185186
"sky launch 01_hello_sky/example.yaml\n",
186187
"```\n",
187188
"-------------------------\n",
188189
"\n",
190+
"This will take about a minute to run.\n",
191+
"\n",
189192
"> **💡 Hint** - If you're using jupyter lab, you can create a terminal in your browser by going to `File -> New -> Terminal`\n",
190193
"\n",
191194
"You'll notice that SkyPilot will perform multiple actions for you:\n",
@@ -265,8 +268,9 @@
265268
"cell_type": "markdown",
266269
"metadata": {},
267270
"source": [
268-
"## 💻 Checking your cluster status with `sky status`\n",
269-
"In a terminal window, run:\n",
271+
"## <span style=\"color:green\">[DIY]</span> 💻 Checking your cluster status with `sky status`\n",
272+
"\n",
273+
"**In a terminal window, run:**\n",
270274
"\n",
271275
"\n",
272276
"-------------------------\n",
@@ -302,14 +306,16 @@
302306
"cell_type": "markdown",
303307
"metadata": {},
304308
"source": [
305-
"## 💻 SSH into the cluster!"
309+
"## <span style=\"color:green\">[DIY]</span> 💻 SSH into the cluster!"
306310
]
307311
},
308312
{
309313
"cell_type": "markdown",
310314
"metadata": {},
311315
"source": [
312-
"For debugging and development, you can easily SSH into a SkyPilot cluster with the `ssh` utility. In a terminal window, run:\n",
316+
"For debugging and development, you can easily SSH into a SkyPilot cluster with the `ssh` utility. \n",
317+
"\n",
318+
"**In a terminal window, run:**\n",
313319
"\n",
314320
"-------------------------\n",
315321
"```console\n",
@@ -344,6 +350,8 @@
344350
"```\n",
345351
"-------------------------\n",
346352
"\n",
353+
"You can use `ctrl+d` to exit from the SSH session.\n",
354+
"\n",
347355
"> **💡 Hint** - To enable the SSH functionality, SkyPilot adds the remote cluster to your `~/.ssh/config`. This means you can use the cluster name alias with other ssh tools, such as `scp`, `rsync`, VSCode and more!"
348356
]
349357
},
@@ -358,17 +366,18 @@
358366
"cell_type": "markdown",
359367
"metadata": {},
360368
"source": [
361-
"SkyPilot clusters can exist in three states, each of which has different billing and storage implications:\n",
369+
"SkyPilot clusters can exist in four states, each of which has different billing and storage implications:\n",
362370
"\n",
363-
"* **`RUNNING`** - Cluster is up and running, you will be billed for the instance and the attached storages.\n",
371+
"* **`INIT`** - Cluster is initializing.\n",
372+
"* **`UP`** - Cluster is up and running, you will be billed for the instance and the attached storages.\n",
364373
"* **`STOPPED`** - Cluster nodes are shut down and their disks are suspended. Your data and node state is safe and the cluster can be restored to running state when required. You will be billed only for the storage.\n",
365374
"* **`TERMINATED`** - Cluster is terminated and all nodes and their attached disks are deleted. These clusters cannot be restarted and will not be shown in `sky status`.\n",
366375
"\n",
367376
"To manage these states, SkyPilot offers three useful commands:\n",
368377
"\n",
369-
"1. **`sky stop`** - stops a `RUNNING` cluster.\n",
378+
"1. **`sky stop`** - stops a `UP` cluster.\n",
370379
"2. **`sky start`** - starts a `STOPPED` cluster.\n",
371-
"2. **`sky down`** - terminates a `RUNNING` or `STOPPED` cluster.\n",
380+
"2. **`sky down`** - terminates a `UP` or `STOPPED` cluster.\n",
372381
"\n",
373382
"> **💡 Hint** - `sky stop` and `sky start` are useful when you want to suspend your experiments for a while but want to quickly resume later. `sky down` is useful to delete a cluster and restart a job from scratch."
374383
]
@@ -377,22 +386,22 @@
377386
"cell_type": "markdown",
378387
"metadata": {},
379388
"source": [
380-
"## 💻 Terminate your cluster!\n",
389+
"## <span style=\"color:green\">[DIY]</span> 💻 Terminate your cluster!\n",
381390
"Now that we are done using the cluster, let's terminate it to stop being billed for it. You can use `sky down` to terminate a cluster.\n",
382391
"\n",
383-
"First, let's get the cluster name with `sky status`.\n",
392+
"**First, get the cluster name with `sky status`.**\n",
384393
"\n",
385394
"-------------------------\n",
386395
"```console\n",
387-
"sky status\n",
396+
"$ sky status\n",
388397
"```\n",
389398
"-------------------------\n",
390399
"\n",
391-
"and then run `sky down` to terminate the cluster\n",
400+
"**and then run `sky down` to terminate the cluster**\n",
392401
"\n",
393402
"-------------------------\n",
394403
"```console\n",
395-
"sky down <cluster-name>\n",
404+
"$ sky down <cluster-name>\n",
396405
"```\n",
397406
"-------------------------"
398407
]
@@ -435,20 +444,24 @@
435444
"cell_type": "markdown",
436445
"metadata": {},
437446
"source": [
438-
"## 💻 Launch example.yaml on google cloud with with the `--cloud` flag"
447+
"## <span style=\"color:green\">[DIY]</span> 💻 Launch example.yaml on google cloud with with the `--cloud` flag"
439448
]
440449
},
441450
{
442451
"cell_type": "markdown",
443452
"metadata": {},
444453
"source": [
445-
"To override the SkyPilot optimizer and manually pick a cloud, use the `--cloud [aws,gcp,azure]` flag for `sky launch` like so:\n",
454+
"To override the SkyPilot optimizer and manually pick a cloud, use the `--cloud <cloud>` flag for `sky launch`.\n",
455+
"\n",
456+
"**Go ahead and run the task on GCP using `--cloud gcp` flag.**\n",
446457
"\n",
447458
"-------------------------\n",
448459
"```console\n",
449460
"sky launch 01_hello_sky/example.yaml --cloud gcp\n",
450461
"```\n",
451-
"-------------------------"
462+
"-------------------------\n",
463+
"\n",
464+
"This will take about a minute."
452465
]
453466
},
454467
{
@@ -483,18 +496,18 @@
483496
"cell_type": "markdown",
484497
"metadata": {},
485498
"source": [
486-
"## 💻 Terminate your cluster!\n",
487-
"We're at the end of this notebook and we don't want to let your cluster keep running and rack up a big bill! Let's terminate the cluster with `sky down`.\n",
499+
"## <span style=\"color:green\">[DIY]</span> 💻 Terminate your GCP cluster!\n",
500+
"We're at the end of this notebook and we don't want to let your GCP cluster keep running and rack up a big bill! Let's terminate the cluster with `sky down`.\n",
488501
"\n",
489-
"First, let's get the cluster name with `sky status`.\n",
502+
"**First, get the cluster name with `sky status`.**\n",
490503
"\n",
491504
"-------------------------\n",
492505
"```console\n",
493506
"sky status\n",
494507
"```\n",
495508
"-------------------------\n",
496509
"\n",
497-
"and then run `sky down` to terminate the cluster\n",
510+
"**and then run `sky down` to terminate the cluster**\n",
498511
"\n",
499512
"-------------------------\n",
500513
"```console\n",
@@ -507,7 +520,7 @@
507520
"cell_type": "markdown",
508521
"metadata": {},
509522
"source": [
510-
"#### 🎉 Congratulations! You have learnt the basics of SkyPilot! Please proceed to the next notebook to learn how to use accelerators and object stores in SkyPilot.\n"
523+
"#### 🎉 Congratulations! You have used SkyPilot to seamlessly run tasks on two clouds! Please proceed to the next notebook to learn how to use accelerators and object stores in SkyPilot.\n"
511524
]
512525
}
513526
],
@@ -532,4 +545,4 @@
532545
},
533546
"nbformat": 4,
534547
"nbformat_minor": 4
535-
}
548+
}

01_hello_sky/example.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,5 @@ setup: |
66
sudo apt install cowsay
77
88
run: |
9-
echo "Hello SkyPilot"
10-
cowsay "Moo!"
9+
echo "Hello Stranger!"
10+
cowsay "Moo!"

02_using_accelerators/02_using_accelerators.ipynb

+35-7
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,11 @@
3737
"cell_type": "markdown",
3838
"metadata": {},
3939
"source": [
40-
"# Listing supported accelerators with `sky show-gpus`\n",
40+
"# <span style=\"color:green\">[DIY]</span> Listing supported accelerators with `sky show-gpus`\n",
4141
"\n",
42-
"To see the list of accelerators supported by SkyPilot , you can use the `sky show-gpus` command. You can run `sky show-gpus` by running the cell below."
42+
"To see the list of accelerators supported by SkyPilot , you can use the `sky show-gpus` command. \n",
43+
"\n",
44+
"**Run `sky show-gpus` by running the cell below:**"
4345
]
4446
},
4547
{
@@ -118,7 +120,7 @@
118120
"cell_type": "markdown",
119121
"metadata": {},
120122
"source": [
121-
"## 📝 Edit `bert.yaml` to use a T4 GPU! \n",
123+
"## <span style=\"color:green\">[DIY]</span> 📝 Edit `bert.yaml` to use a V100 GPU! \n",
122124
"\n",
123125
"We have provided an example YAML (`bert.yaml`) which fine-tunes a BERT model on the SQuAD dataset. However, it does not specify any GPU resources for training.\n",
124126
"\n",
@@ -168,19 +170,21 @@
168170
"cell_type": "markdown",
169171
"metadata": {},
170172
"source": [
171-
"## 💻 Launch your BERT training task!\n",
173+
"## <span style=\"color:green\">[DIY]</span> 💻 Launch your BERT training task!\n",
172174
"\n",
173-
"**After you have edited `bert.yaml` to use T4 GPUs**, open a terminal and use `sky launch` to create a GPU cluster:\n",
175+
"**After you have edited `bert.yaml` to use V100 GPUs, open a terminal and use `sky launch` to create a GPU cluster:**\n",
174176
"\n",
175177
"-------------------------\n",
176178
"```console\n",
177179
"sky launch 02_using_accelerators/bert.yaml\n",
178180
"```\n",
179181
"-------------------------\n",
180182
"\n",
183+
"This will take about two minutes.\n",
184+
"\n",
181185
"### Expected output\n",
182186
"\n",
183-
"After the usual SkyPilot output, you should your task run:\n",
187+
"After the usual SkyPilot output, you should see your task run:\n",
184188
"\n",
185189
"-------------------------\n",
186190
"```console\n",
@@ -205,7 +209,9 @@
205209
"cell_type": "markdown",
206210
"metadata": {},
207211
"source": [
208-
"## 💻 Remember to terminate your cluster once you're done!\n",
212+
"## <span style=\"color:green\">[DIY]</span> 💻 Remember to terminate your cluster once you're done!\n",
213+
"\n",
214+
"**Run `sky status` to get the cluster name and then use `sky down` to terminate it.**\n",
209215
"\n",
210216
"-------------------------\n",
211217
"```console\n",
@@ -216,6 +222,28 @@
216222
"-------------------------"
217223
]
218224
},
225+
{
226+
"cell_type": "markdown",
227+
"metadata": {},
228+
"source": [
229+
"# Transparently training BERT on a different cloud\n",
230+
"Moving this complex BERT training job to a different cloud is easy with SkyPilot. \n",
231+
"\n",
232+
"**Even though this task requires access to accelerators and object stores, SkyPilot can seamlessly run this job on a different cloud with just one line change - adding the `--cloud` flag to `sky launch`.**\n",
233+
"\n",
234+
"Just like in the previous notebook, you can simply use the same YAML:\n",
235+
"\n",
236+
"-----------------\n",
237+
"```\n",
238+
"sky launch 02_using_accelerators/bert.yaml --cloud gcp\n",
239+
"```\n",
240+
"-----------------\n",
241+
"\n",
242+
"(In the interest of time, we don't run this command in this notebook but feel free to try it later!)\n",
243+
"\n",
244+
"SkyPilot will find instance types on GCP that support the required GPU (V100), and it will also mount the object store when the task runs."
245+
]
246+
},
219247
{
220248
"cell_type": "markdown",
221249
"metadata": {},

02_using_accelerators/bert.yaml

+10-2
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,25 @@ name: bert
33
resources:
44
accelerators: # Add V100:1 here!
55

6+
# For this task, we specify cloud and region for quota reasons.
7+
# If these are not specified, SkyPilot will try the cheapest region first, and failover if quota is exceeded.
8+
cloud: aws
9+
region: us-west-2
10+
11+
# file_mounts specifies the any data that must be made available to the task
612
file_mounts:
7-
/dataset/:
8-
source: s3://sky-bert-dataset/
13+
/dataset/: # This specifies the destination where the object bucket will be mounted
14+
source: s3://sky-bert-dataset/ # The bucket URL to be mounted
915

16+
# Setup repository.
1017
setup: |
1118
git clone https://github.com/huggingface/transformers.git
1219
cd transformers && git checkout v4.18.0
1320
pip install -e .
1421
cd examples/pytorch/question-answering/
1522
pip install -r requirements.txt
1623
24+
# Run command. Note that the --train_file argument reads from the object store mounted at /dataset
1725
run: |
1826
cd transformers/examples/pytorch/question-answering/
1927
python run_qa.py \

0 commit comments

Comments
 (0)