diff --git a/.github/scripts/fbgemm_gpu_docs.bash b/.github/scripts/fbgemm_gpu_docs.bash index 0087e741cd..719b940555 100644 --- a/.github/scripts/fbgemm_gpu_docs.bash +++ b/.github/scripts/fbgemm_gpu_docs.bash @@ -82,6 +82,14 @@ build_fbgemm_gpu_docs () { # shellcheck disable=SC2086 print_exec conda env config vars set ${env_prefix} SPHINX_LINT=1 + # print_exec mkdir -p ../fbgemm_gpu/experimental/ + # print_exec cp -r ../experimental/gen_ai/gen_ai ../fbgemm_gpu/experimental/ || return 1 + # print_exec cp -r ../experimental/gemm/triton_gemm ../fbgemm_gpu/experimental/ || return 1 + # print_exec ls -la ../fbgemm_gpu + # print_exec ls -la ../fbgemm_gpu/experimental/ + # print_exec ls -la ../fbgemm_gpu/experimental/gen_ai + # print_exec ls -la ../fbgemm_gpu/experimental/triton_gemm + # shellcheck disable=SC2086 if print_exec conda run ${env_prefix} make clean doxygen; then echo "[DOCS] Doxygen build passed" diff --git a/README.md b/README.md index 24b1adcc97..4c4c39f810 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ documentation. applications. Please see [the documentation](fbgemm_gpu/README.md) for more information. -* **FBGEMM_GPU GenAI**: A collection of PyTorch GPU operator libraries that are +* **FBGEMM GenAI**: A collection of PyTorch GPU operator libraries that are designed for generative AI applications, such as FP8 row-wise quantization and collective communications. Please see [the documentation](fbgemm_gpu/README.md) for more information. diff --git a/fbgemm_gpu/docs/Doxyfile.in b/fbgemm_gpu/docs/Doxyfile.in index 74c983b53a..7af310067a 100644 --- a/fbgemm_gpu/docs/Doxyfile.in +++ b/fbgemm_gpu/docs/Doxyfile.in @@ -2343,7 +2343,8 @@ SEARCH_INCLUDES = YES INCLUDE_PATH = "../codegen" \ "../include" \ "../src" \ - "../../include/fbgemm" + "../../include/fbgemm" \ + "../experimental/gen_ai/src" # You can use the INCLUDE_FILE_PATTERNS tag to specify one or more wildcard # patterns (like *.h and *.hpp) to filter out the header-files in the diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/development/BuildInstructions.rst b/fbgemm_gpu/docs/src/fbgemm_genai/development/BuildInstructions.rst index 64a51a0f92..12437bf9a5 100644 --- a/fbgemm_gpu/docs/src/fbgemm_genai/development/BuildInstructions.rst +++ b/fbgemm_gpu/docs/src/fbgemm_genai/development/BuildInstructions.rst @@ -26,9 +26,26 @@ Follow the instructions to set up the Conda environment: #. :ref:`fbgemm-gpu.build.setup.env` #. :ref:`fbgemm-gpu.build.setup.cuda` +#. :ref:`fbgemm-gpu.build.setup.cutlass` #. :ref:`fbgemm-gpu.build.setup.tools.install` #. :ref:`fbgemm-gpu.build.setup.pytorch.install` +.. _fbgemm-gpu.build.setup.cutlass: + +Install CUTLASS +~~~~~~~~~~~~~~~ + +CUTLASS should be already be available in the FBGEMM repository as a git +submodule (see :ref:`fbgemm-gpu.build.prepare`). The following include paths +are already added to the CMake configuration: + +.. code:: cmake + + set(THIRDPARTY ${FBGEMM}/external) + + ${THIRDPARTY}/cutlass/include + ${THIRDPARTY}/cutlass/tools/util/include + Other Pre-Build Setup --------------------- diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/development/InstallationInstructions.rst b/fbgemm_gpu/docs/src/fbgemm_genai/development/InstallationInstructions.rst index 4e52bc33f9..1a854102ed 100644 --- a/fbgemm_gpu/docs/src/fbgemm_genai/development/InstallationInstructions.rst +++ b/fbgemm_gpu/docs/src/fbgemm_genai/development/InstallationInstructions.rst @@ -30,7 +30,7 @@ Follow the instructions for setting up the runtime environment: Install the FBGEMM GenAI Package ------------------------------- +-------------------------------- Install through PyTorch PIP ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -43,11 +43,11 @@ PyTorch PIP is the preferred channel for installing FBGEMM GenAI: # CUDA Nightly pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ - pip install --pre fbgemm-genai --index-url https://download.pytorch.org/whl/nightly/cu126/ + pip install --pre fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/nightly/cu126/ # CUDA Release pip install torch --index-url https://download.pytorch.org/whl/cu126/ - pip install fbgemm-genai --index-url https://download.pytorch.org/whl/cu126/ + pip install fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/cu126/ # Test the installation python -c "import torch; import fbgemm_gpu.experimental.gen_ai" diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/development/TestInstructions.rst b/fbgemm_gpu/docs/src/fbgemm_genai/development/TestInstructions.rst index b5f45d90cd..78362c1ff6 100644 --- a/fbgemm_gpu/docs/src/fbgemm_genai/development/TestInstructions.rst +++ b/fbgemm_gpu/docs/src/fbgemm_genai/development/TestInstructions.rst @@ -6,7 +6,7 @@ benchmarks (in the ``fbgemm_gpu/experimental/gen_ai/bench/`` directory) provide good examples on how to use FBGEMM GenAI operators. Set Up the FBGEMM GenAI Test Environment ---------------------------------------- +---------------------------------------- After an environment is available from building / installing the FBGEMM GenAI package, additional packages need to be installed for tests to run correctly: diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/index.rst b/fbgemm_gpu/docs/src/fbgemm_genai/index.rst index 0c3f14d802..50d72a5f81 100644 --- a/fbgemm_gpu/docs/src/fbgemm_genai/index.rst +++ b/fbgemm_gpu/docs/src/fbgemm_genai/index.rst @@ -16,3 +16,15 @@ FP8 row-wise quantization and collective communications. development/BuildInstructions development/InstallationInstructions development/TestInstructions + +.. toctree:: + :maxdepth: 2 + :caption: FBGEMM GenAI Overview + + overview/Overview + +.. toctree:: + :maxdepth: 2 + :caption: FBGEMM GenAI Python API + + python-api/quantize_ops diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/overview/Overview.rst b/fbgemm_gpu/docs/src/fbgemm_genai/overview/Overview.rst new file mode 100644 index 0000000000..ba3a48ebc4 --- /dev/null +++ b/fbgemm_gpu/docs/src/fbgemm_genai/overview/Overview.rst @@ -0,0 +1,48 @@ +FBGEMM GenAI Overview +===================== + +High Level Overview +------------------- + +FBGEMM FP8 rowwise quantization kernels have been officially adopted in the +[Llama3.1 release](https://fb.workplace.com/groups/221503021668016/permalink/1900301927121442/). +FP8 has been applied across Llama3 models with 8 B, 70 B, and 405 B. +Notably, for the 405 B model, FP8 enables the inference on a single node, +achieving a 2x throughput improvement over the baseline BF16 running on two +nodes with pipeline parallelism. Externally, it has been mentioned in +[Llama3 paper](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/) & +[repo](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/models/llama/quantize_impls.py), [HuggingFace](https://huggingface.co/docs/transformers/main/quantization/fbgemm_fp8), [vLLM](https://blog.vllm.ai/2024/07/23/llama31.html), and [TensorRT-LLM](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms/). + +FBGEMM GenAI FP8 supports a variety of configurations: + +* GEMM Operators: {CUTLASS, CK, Triton} x {BF16, FP8} x {tensor-wise, row-wise, block-wise} x {Nvidia H100, AMD MI300x}. +* High/low Precision Conversion Kernels: (FP32 / BF16 <-> FP8) with scaling options {tensor-wise, row-wise, block-wise} across hardware platforms {Nvidia H100, AMD MI300x} and programming options of {Triton, CUDA/HIP}. + +Besides FP8 support, FBGEMM GenAI operators also support: + +* Customized AllReduce communications (reduce latency for small message sizes). +* GQA: optimized specifically for decoding cases, as detailed in PyTorch's blog on [INT4 decoding](https://pytorch.org/blog/int4-decoding/). +* KV cache quantizations. +* Rotary Positional Embedding (RoPE). + +FP8 Core API Functions +---------------------- + +.. code:: python + + # Rowwise quantize (channel wise) the weight from BF16 to FP8 + wq, w_scale = torch.ops.fbgemm.quantize_fp8_per_row(w) + # Rowwise quantize the activation (token wise) from BF16 to FP8 + xq, x_scale = torch.ops.fbgemm.quantize_fp8_per_row( + x, num_tokens, activation_scale_ub + ) + # Rowwise quantize GEMM with FP8 input and BF16 output + y = torch.ops.fbgemm.f8f8bf16_rowwise( + xq, + wq, + x_scale, + w_scale, + use_fast_accum=True, + ) + +See :ref:`genai-quantize-ops-stable-api` for more details. diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/python-api/quantize_ops.rst b/fbgemm_gpu/docs/src/fbgemm_genai/python-api/quantize_ops.rst new file mode 100644 index 0000000000..0f9682297f --- /dev/null +++ b/fbgemm_gpu/docs/src/fbgemm_genai/python-api/quantize_ops.rst @@ -0,0 +1,16 @@ +Quantization Operators +====================== + +.. automodule:: fbgemm_gpu + +.. _genai-quantize-ops-stable-api: + +Stable API +---------- + +.. autofunction:: fbgemm_gpu.experimental.gen_ai.quantize.int4_row_quantize + +.. autofunction:: fbgemm_gpu.experimental.gen_ai.quantize.quantize_int4_preshuffle + +Other API +--------- diff --git a/fbgemm_gpu/docs/src/fbgemm_gpu/development/BuildInstructions.rst b/fbgemm_gpu/docs/src/fbgemm_gpu/development/BuildInstructions.rst index 2a8332ea2f..84d2f939f4 100644 --- a/fbgemm_gpu/docs/src/fbgemm_gpu/development/BuildInstructions.rst +++ b/fbgemm_gpu/docs/src/fbgemm_gpu/development/BuildInstructions.rst @@ -151,21 +151,6 @@ cuDNN package for the given CUDA version: wget -q "${cudnn_url}" -O cudnn.tar.xz tar -xvf cudnn.tar.xz -Install CUTLASS -~~~~~~~~~~~~~~~ - -This section is only applicable to building the experimental FBGEMM_GPU GenAI -module. CUTLASS should be already be available in the repository as a git -submodule (see :ref:`fbgemm-gpu.build.prepare`). The following include paths -are already added to the CMake configuration: - -.. code:: cmake - - set(THIRDPARTY ${FBGEMM}/external) - - ${THIRDPARTY}/cutlass/include - ${THIRDPARTY}/cutlass/tools/util/include - .. _fbgemm-gpu.build.setup.rocm: diff --git a/fbgemm_gpu/docs/src/index.rst b/fbgemm_gpu/docs/src/index.rst index f6a74b0815..392ddab407 100644 --- a/fbgemm_gpu/docs/src/index.rst +++ b/fbgemm_gpu/docs/src/index.rst @@ -121,3 +121,15 @@ Table of Contents fbgemm_genai/development/BuildInstructions fbgemm_genai/development/InstallationInstructions fbgemm_genai/development/TestInstructions + +.. toctree:: + :maxdepth: 1 + :caption: FBGEMM GenAI Overview + + fbgemm_genai/overview/Overview + +.. toctree:: + :maxdepth: 1 + :caption: FBGEMM GenAI Python API + + fbgemm_genai/python-api/quantize_ops diff --git a/fbgemm_gpu/fbgemm_gpu/__init__.py b/fbgemm_gpu/fbgemm_gpu/__init__.py index be06d84ac0..8160663884 100644 --- a/fbgemm_gpu/fbgemm_gpu/__init__.py +++ b/fbgemm_gpu/fbgemm_gpu/__init__.py @@ -56,7 +56,7 @@ def _load_library(filename: str, no_throw: bool = False) -> None: "experimental/gen_ai/fbgemm_gpu_experimental_gen_ai", ] -# NOTE: While FBGEMM_GPU GenAI is not available for ROCm yet, we would like to +# NOTE: While FBGEMM GenAI is not available for ROCm yet, we would like to # be able to install the existing CUDA variant of the package onto ROCm systems, # so that we can at least use the Triton GEMM libraries from experimental/gemm. # But loading fbgemm_gpu package will trigger load-checking the .SO file for the