Skip to content

Add support for optimum-habana deepseek v3/r1 fp8 quantization #2164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

skavulya
Copy link

@skavulya skavulya commented Apr 4, 2025

Type of Change

What does this PR do?

Support FP8 static quantization for optimum-habana deepseek v3/r1 models using Intel Neural Compressor (INC)

This feature needs changes in:

Steps for FP8 quantization

# install OH
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
git fetch origin pull/1907/head:deepseek_v3_fp8
git checkout deepseek_v3_fp8
pip install -e .
pip install git+https://github.com/HabanaAI/[email protected]
pip install blobfile tiktoken

# install INC PR with OH deepseek_v3 support
git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
git fetch origin pull/2164/head:oh_ds_r1
git checkout oh_ds_r1 
pip uninstall neural_compressor_pt
pip install -r requirements.txt
pip install -r requirements_pt.txt
python setup.py develop pt

# Test FP8 Quantization with moonlight model on 2 cards with expert-parallelism
cd ../optimum-habana/examples/text-generation/
PT_HPU_LAZY_MODE=1 INC_DYNAMIC_MOE_EXPERTS=64 QUANT_CONFIG=quantization_config/maxabs_measure.json python3 ../gaudi_spawn.py  --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache  --prompt "DeepSpeed is a machine learning framework"  --parallel_strategy "ep" --trust_remote_code_tokenizer

# FP8 dynamic moe op segfaults if SLICE_MAX_EXPERT>32
 SLICE_MAX_EXPERT=32 INC_DYNAMIC_MOE_EXPERTS=64 PT_HPU_LAZY_MODE=1 QUANT_CONFIG=quantization_config/maxabs_quant_mixtral.json python3 ../gaudi_spawn.py  --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache  --prompt "DeepSpeed is a machine learning framework"  --parallel_strategy "ep" --trust_remote_code_tokenizer

@skavulya skavulya marked this pull request as draft April 4, 2025 00:44
@skavulya skavulya marked this pull request as ready for review April 4, 2025 01:47
@skavulya skavulya marked this pull request as draft April 7, 2025 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant