Skip to content

[fbgemm_gpu] Add more docs scaffolding for GenAI #3944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .github/scripts/fbgemm_gpu_docs.bash
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,14 @@ build_fbgemm_gpu_docs () {
# shellcheck disable=SC2086
print_exec conda env config vars set ${env_prefix} SPHINX_LINT=1

# print_exec mkdir -p ../fbgemm_gpu/experimental/
# print_exec cp -r ../experimental/gen_ai/gen_ai ../fbgemm_gpu/experimental/ || return 1
# print_exec cp -r ../experimental/gemm/triton_gemm ../fbgemm_gpu/experimental/ || return 1
# print_exec ls -la ../fbgemm_gpu
# print_exec ls -la ../fbgemm_gpu/experimental/
# print_exec ls -la ../fbgemm_gpu/experimental/gen_ai
# print_exec ls -la ../fbgemm_gpu/experimental/triton_gemm

# shellcheck disable=SC2086
if print_exec conda run ${env_prefix} make clean doxygen; then
echo "[DOCS] Doxygen build passed"
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ documentation.
applications. Please see [the documentation](fbgemm_gpu/README.md) for more
information.

* **FBGEMM_GPU GenAI**: A collection of PyTorch GPU operator libraries that are
* **FBGEMM GenAI**: A collection of PyTorch GPU operator libraries that are
designed for generative AI applications, such as FP8 row-wise quantization and
collective communications. Please see [the documentation](fbgemm_gpu/README.md)
for more information.
Expand Down
3 changes: 2 additions & 1 deletion fbgemm_gpu/docs/Doxyfile.in
Original file line number Diff line number Diff line change
Expand Up @@ -2343,7 +2343,8 @@ SEARCH_INCLUDES = YES
INCLUDE_PATH = "../codegen" \
"../include" \
"../src" \
"../../include/fbgemm"
"../../include/fbgemm" \
"../experimental/gen_ai/src"

# You can use the INCLUDE_FILE_PATTERNS tag to specify one or more wildcard
# patterns (like *.h and *.hpp) to filter out the header-files in the
Expand Down
17 changes: 17 additions & 0 deletions fbgemm_gpu/docs/src/fbgemm_genai/development/BuildInstructions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,26 @@ Follow the instructions to set up the Conda environment:

#. :ref:`fbgemm-gpu.build.setup.env`
#. :ref:`fbgemm-gpu.build.setup.cuda`
#. :ref:`fbgemm-gpu.build.setup.cutlass`
#. :ref:`fbgemm-gpu.build.setup.tools.install`
#. :ref:`fbgemm-gpu.build.setup.pytorch.install`

.. _fbgemm-gpu.build.setup.cutlass:

Install CUTLASS
~~~~~~~~~~~~~~~

CUTLASS should be already be available in the FBGEMM repository as a git
submodule (see :ref:`fbgemm-gpu.build.prepare`). The following include paths
are already added to the CMake configuration:

.. code:: cmake

set(THIRDPARTY ${FBGEMM}/external)

${THIRDPARTY}/cutlass/include
${THIRDPARTY}/cutlass/tools/util/include


Other Pre-Build Setup
---------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Follow the instructions for setting up the runtime environment:


Install the FBGEMM GenAI Package
------------------------------
--------------------------------

Install through PyTorch PIP
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -43,11 +43,11 @@ PyTorch PIP is the preferred channel for installing FBGEMM GenAI:

# CUDA Nightly
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
pip install --pre fbgemm-genai --index-url https://download.pytorch.org/whl/nightly/cu126/
pip install --pre fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/nightly/cu126/

# CUDA Release
pip install torch --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-genai --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/cu126/

# Test the installation
python -c "import torch; import fbgemm_gpu.experimental.gen_ai"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ benchmarks (in the ``fbgemm_gpu/experimental/gen_ai/bench/`` directory) provide
good examples on how to use FBGEMM GenAI operators.

Set Up the FBGEMM GenAI Test Environment
---------------------------------------
----------------------------------------

After an environment is available from building / installing the FBGEMM GenAI
package, additional packages need to be installed for tests to run correctly:
Expand Down
12 changes: 12 additions & 0 deletions fbgemm_gpu/docs/src/fbgemm_genai/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,15 @@ FP8 row-wise quantization and collective communications.
development/BuildInstructions
development/InstallationInstructions
development/TestInstructions

.. toctree::
:maxdepth: 2
:caption: FBGEMM GenAI Overview

overview/Overview

.. toctree::
:maxdepth: 2
:caption: FBGEMM GenAI Python API

python-api/quantize_ops
48 changes: 48 additions & 0 deletions fbgemm_gpu/docs/src/fbgemm_genai/overview/Overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
FBGEMM GenAI Overview
=====================

High Level Overview
-------------------

FBGEMM FP8 rowwise quantization kernels have been officially adopted in the
[Llama3.1 release](https://fb.workplace.com/groups/221503021668016/permalink/1900301927121442/).
FP8 has been applied across Llama3 models with 8 B, 70 B, and 405 B.
Notably, for the 405 B model, FP8 enables the inference on a single node,
achieving a 2x throughput improvement over the baseline BF16 running on two
nodes with pipeline parallelism. Externally, it has been mentioned in
[Llama3 paper](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/) &
[repo](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/models/llama/quantize_impls.py), [HuggingFace](https://huggingface.co/docs/transformers/main/quantization/fbgemm_fp8), [vLLM](https://blog.vllm.ai/2024/07/23/llama31.html), and [TensorRT-LLM](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms/).

FBGEMM GenAI FP8 supports a variety of configurations:

* GEMM Operators: {CUTLASS, CK, Triton} x {BF16, FP8} x {tensor-wise, row-wise, block-wise} x {Nvidia H100, AMD MI300x}.
* High/low Precision Conversion Kernels: (FP32 / BF16 <-> FP8) with scaling options {tensor-wise, row-wise, block-wise} across hardware platforms {Nvidia H100, AMD MI300x} and programming options of {Triton, CUDA/HIP}.

Besides FP8 support, FBGEMM GenAI operators also support:

* Customized AllReduce communications (reduce latency for small message sizes).
* GQA: optimized specifically for decoding cases, as detailed in PyTorch's blog on [INT4 decoding](https://pytorch.org/blog/int4-decoding/).
* KV cache quantizations.
* Rotary Positional Embedding (RoPE).

FP8 Core API Functions
----------------------

.. code:: python

# Rowwise quantize (channel wise) the weight from BF16 to FP8
wq, w_scale = torch.ops.fbgemm.quantize_fp8_per_row(w)
# Rowwise quantize the activation (token wise) from BF16 to FP8
xq, x_scale = torch.ops.fbgemm.quantize_fp8_per_row(
x, num_tokens, activation_scale_ub
)
# Rowwise quantize GEMM with FP8 input and BF16 output
y = torch.ops.fbgemm.f8f8bf16_rowwise(
xq,
wq,
x_scale,
w_scale,
use_fast_accum=True,
)

See :ref:`genai-quantize-ops-stable-api` for more details.
16 changes: 16 additions & 0 deletions fbgemm_gpu/docs/src/fbgemm_genai/python-api/quantize_ops.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Quantization Operators
======================

.. automodule:: fbgemm_gpu

.. _genai-quantize-ops-stable-api:

Stable API
----------

.. autofunction:: fbgemm_gpu.experimental.gen_ai.quantize.int4_row_quantize

.. autofunction:: fbgemm_gpu.experimental.gen_ai.quantize.quantize_int4_preshuffle

Other API
---------
15 changes: 0 additions & 15 deletions fbgemm_gpu/docs/src/fbgemm_gpu/development/BuildInstructions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,21 +151,6 @@ cuDNN package for the given CUDA version:
wget -q "${cudnn_url}" -O cudnn.tar.xz
tar -xvf cudnn.tar.xz

Install CUTLASS
~~~~~~~~~~~~~~~

This section is only applicable to building the experimental FBGEMM_GPU GenAI
module. CUTLASS should be already be available in the repository as a git
submodule (see :ref:`fbgemm-gpu.build.prepare`). The following include paths
are already added to the CMake configuration:

.. code:: cmake

set(THIRDPARTY ${FBGEMM}/external)

${THIRDPARTY}/cutlass/include
${THIRDPARTY}/cutlass/tools/util/include


.. _fbgemm-gpu.build.setup.rocm:

Expand Down
12 changes: 12 additions & 0 deletions fbgemm_gpu/docs/src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,15 @@ Table of Contents
fbgemm_genai/development/BuildInstructions
fbgemm_genai/development/InstallationInstructions
fbgemm_genai/development/TestInstructions

.. toctree::
:maxdepth: 1
:caption: FBGEMM GenAI Overview

fbgemm_genai/overview/Overview

.. toctree::
:maxdepth: 1
:caption: FBGEMM GenAI Python API

fbgemm_genai/python-api/quantize_ops
2 changes: 1 addition & 1 deletion fbgemm_gpu/fbgemm_gpu/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def _load_library(filename: str, no_throw: bool = False) -> None:
"experimental/gen_ai/fbgemm_gpu_experimental_gen_ai",
]

# NOTE: While FBGEMM_GPU GenAI is not available for ROCm yet, we would like to
# NOTE: While FBGEMM GenAI is not available for ROCm yet, we would like to
# be able to install the existing CUDA variant of the package onto ROCm systems,
# so that we can at least use the Triton GEMM libraries from experimental/gemm.
# But loading fbgemm_gpu package will trigger load-checking the .SO file for the
Expand Down
Loading