Skip to content

Commit a0f2d50

Browse files
authored
deepspeed github repo migration (#99)
* deepspeed github repo migration * better
1 parent 5a8ed77 commit a0f2d50

File tree

7 files changed

+14
-14
lines changed

7 files changed

+14
-14
lines changed

compute/accelerator/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -716,7 +716,7 @@ AMD GPUs run on [ROCm](https://www.amd.com/en/products/software/rocm.html) - not
716716
The API is via [Habana SynapseAI® SDK](https://habana.ai/training-software/) which supports PyTorch and TensorFlow.
717717

718718
Useful integrations:
719-
- [HF Optimum Habana](https://github.com/huggingface/optimum-habana) which also includes - [DeepSpeed](https://github.com/microsoft/DeepSpeed) integration.
719+
- [HF Optimum Habana](https://github.com/huggingface/optimum-habana) which also includes - [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) integration.
720720

721721

722722

debug/pytorch.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -689,7 +689,7 @@ This was a simple low-dimensional example, but in reality the tensors are much b
689689

690690
Now you might say that the `1e-6` discrepancy can be safely ignored. And it's often so as long as this is a final result. If this tensor from the example above is now fed through a 100 layers of `matmul`s, this tiny discrepancy is going to compound and spread out to impact many other elements with the final outcome being quite different from the same action performed on another type of device.
691691

692-
For example, see this [discussion](https://github.com/microsoft/DeepSpeed/issues/4932) - the users reported that when doing Llama-2-7b inference they were getting quite different logits depending on how the model was initialized. To clarify the initial discussion was about Deepspeed potentially being the problem, but in later comments you can see that it was reduced to just which device the model's buffers were initialized on. The trained weights aren't an issue they are loaded from the checkpoint, but the buffers are recreated from scratch when the model is loaded, so that's where the problem emerges.
692+
For example, see this [discussion](https://github.com/deepspeedai/DeepSpeed/issues/4932) - the users reported that when doing Llama-2-7b inference they were getting quite different logits depending on how the model was initialized. To clarify the initial discussion was about Deepspeed potentially being the problem, but in later comments you can see that it was reduced to just which device the model's buffers were initialized on. The trained weights aren't an issue they are loaded from the checkpoint, but the buffers are recreated from scratch when the model is loaded, so that's where the problem emerges.
693693

694694
It's uncommon that small variations make much of a difference, but sometimes the difference can be clearly seen, as in this example where the same image is produced on a CPU and an MPS device.
695695

inference/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -619,7 +619,7 @@ This section is trying hard to be neutral and not recommend any particular frame
619619

620620
### DeepSpeed-FastGen
621621

622-
[DeepSpeed-FastGen](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) from [the DeepSpeed team](https://github.com/microsoft/DeepSpeed).
622+
[DeepSpeed-FastGen](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-fastgen) from [the DeepSpeed team](https://github.com/deepspeedai/DeepSpeed).
623623

624624
### TensorRT-LLM
625625

network/benchmarks/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ Notes:
114114

115115
You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically you'd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps.
116116

117-
Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/microsoft/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.
117+
Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/deepspeedai/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.
118118

119119
Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and don't have to be super fantastic.
120120

@@ -124,7 +124,7 @@ To get reasonable GPU throughput when training at scale (64+GPUs) with DeepSpeed
124124
2. 200-400 Gbps is ok
125125
3. 800-1000 Gbps is ideal
126126

127-
[full details](https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491)
127+
[full details](https://github.com/deepspeedai/DeepSpeed/issues/2928#issuecomment-1463041491)
128128

129129
Of course, the requirements are higher for A100 gpu nodes and even higher for H100s (but no such benchmark information has been shared yet).
130130

stabs/incoming.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ Make a new benchmark section:
107107

108108
1. nccl-tests
109109
2. `all_reduce_bench.py`
110-
3. https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication
110+
3. https://github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/communication
111111
4. like nccl-tests, another common set of benchmarks used at HPC sites are the OSU microbenchmarks like osu_lat, osu_bw, and osu_bibw.
112112

113113
https://mvapich.cse.ohio-state.edu/benchmarks/

training/fault-tolerance/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -309,7 +309,7 @@ for batch in iterator:
309309
train_step(batch)
310310
```
311311

312-
footnote: don't do this unless you really have to, since caching makes things faster. Ideally figure out the fragmentation issue instead. For example, look up `max_split_size_mb` in the doc for [`PYTORCH_CUDA_ALLOC_CONF`](https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) as it controls how memory is allocated. Some frameworks like [Deepspeed](https://github.com/microsoft/DeepSpeed) solve this by pre-allocating tensors at start time and then reuse them again and again preventing the issue of fragmentation altogether.
312+
footnote: don't do this unless you really have to, since caching makes things faster. Ideally figure out the fragmentation issue instead. For example, look up `max_split_size_mb` in the doc for [`PYTORCH_CUDA_ALLOC_CONF`](https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) as it controls how memory is allocated. Some frameworks like [Deepspeed](https://github.com/deepspeedai/DeepSpeed) solve this by pre-allocating tensors at start time and then reuse them again and again preventing the issue of fragmentation altogether.
313313

314314
footnote: this simplified example would work for a single node. For multiple nodes you'd need to gather the stats from all participating nodes and find the one that has the least amount of memory left and act upon that.
315315

training/model-parallelism/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ PyTorch:
145145
- [PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel](https://arxiv.org/abs/2304.11277)
146146

147147
Main DeepSpeed ZeRO Resources:
148-
- [Project's github](https://github.com/microsoft/deepspeed)
148+
- [Project's github](https://github.com/deepspeedai/DeepSpeed)
149149
- [Usage docs](https://www.deepspeed.ai/getting-started/)
150150
- [API docs](https://deepspeed.readthedocs.io/en/latest/index.html)
151151
- [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed)
@@ -372,7 +372,7 @@ Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't
372372
Since each dimension requires at least 2 GPUs, here you'd need at least 4 GPUs.
373373

374374
Implementations:
375-
- [DeepSpeed](https://github.com/microsoft/DeepSpeed)
375+
- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed)
376376
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
377377
- [Varuna](https://github.com/microsoft/varuna)
378378
- [SageMaker](https://arxiv.org/abs/2111.05972)
@@ -393,7 +393,7 @@ This diagram is from a blog post [3D parallelism: Scaling to trillion-parameter
393393
Since each dimension requires at least 2 GPUs, here you'd need at least 8 GPUs.
394394

395395
Implementations:
396-
- [DeepSpeed](https://github.com/microsoft/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP.
396+
- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP.
397397
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
398398
- [Varuna](https://github.com/microsoft/varuna)
399399
- [SageMaker](https://arxiv.org/abs/2111.05972)
@@ -448,7 +448,7 @@ During compute each sequence chunk is projected onto QKV and then gathered to th
448448

449449
![deepspeed-ulysses sp](images/deepspeed-ulysses.png)
450450

451-
[source](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ulysses)
451+
[source](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ulysses)
452452

453453
On the diagram:
454454
1. Input sequences N are partitioned across P available devices.
@@ -468,7 +468,7 @@ Example: Let's consider seqlen=8K, num_heads=128 and a single node of num_gpus=8
468468
b. the attention computation is done on the first 16 sub-heads
469469
the same logic is performed on the remaining 7 GPUs, each computing 8k attention over its 16 sub-heads
470470

471-
You can read the specifics of the very efficient comms [here](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ulysses#significant-communication-volume-reduction).
471+
You can read the specifics of the very efficient comms [here](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ulysses#significant-communication-volume-reduction).
472472

473473
DeepSpeed-Ulysses keeps communication volume consistent by increasing GPUs proportional to message size or sequence length.
474474

@@ -496,7 +496,7 @@ Paper: [Ring Attention with Blockwise Transformers for Near-Infinite Context](ht
496496

497497
SP Implementations:
498498
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
499-
- [Deepspeed](https://github.com/microsoft/DeepSpeed)
499+
- [Deepspeed](https://github.com/deepspeedai/DeepSpeed)
500500
- [Colossal-AI](https://colossalai.org/)
501501
- [torchtitan](https://github.com/pytorch/torchtitan)
502502

@@ -659,7 +659,7 @@ If the network were to be 5x faster, that is 212GBs (1700Gbps) then:
659659

660660
which would be insignificant comparatively to the compute time, especially if some of it is successfully overlapped with the commute.
661661

662-
Also the Deepspeed team empirically [benchmarked a 176B model](https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491) on 384 V100 GPUs (24 DGX-2 nodes) and found that:
662+
Also the Deepspeed team empirically [benchmarked a 176B model](https://github.com/deepspeedai/DeepSpeed/issues/2928#issuecomment-1463041491) on 384 V100 GPUs (24 DGX-2 nodes) and found that:
663663

664664
1. With 100 Gbps IB, we only have <20 TFLOPs per GPU (bad)
665665
2. With 200-400 Gbps IB, we achieve reasonable TFLOPs around 30-40 per GPU (ok)

0 commit comments

Comments
 (0)