deployment

zhentaoyu

and

VincyZhang

[Reorg Folder] reorg transformers and langchain folders (#1360 )

Mar 19, 2024

ae54f69 · Mar 19, 2024

History

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	[Reorg Folder] reorg transformers and langchain folders (#1360 )	Mar 19, 2024
generation_utils.py	generation_utils.py	add pre-commit-ci codespell check (#1188 )	Jan 29, 2024
gpt_neox_pattern.conf	gpt_neox_pattern.conf	[Engine]: Enbale gpt neox and dolly (#939 )	Jun 19, 2023
int8_pattern.conf	int8_pattern.conf	change path of examples (#814 )	Apr 13, 2023
llama_int8_pattern.conf	llama_int8_pattern.conf	[Engine]Refine the llama example (#947 )	May 24, 2023
llama_pattern.conf	llama_pattern.conf	[Engine]Refine the llama example (#947 )	May 24, 2023
llamaprompt.json	llamaprompt.json	add pre-commit-ci codespell check (#1188 )	Jan 29, 2024
optimize_llm.py	optimize_llm.py	[Reorg Folder] reorg transformers and langchain folders (#1360 )	Mar 19, 2024
prompt.json	prompt.json	add pre-commit-ci codespell check (#1188 )	Jan 29, 2024
requirements.txt	requirements.txt	[Reorg Folder] reorg transformers and langchain folders (#1360 )	Mar 19, 2024
run_gptj_acc.py	run_gptj_acc.py	[Reorg Folder] reorg transformers and langchain folders (#1360 )	Mar 19, 2024
run_llama_acc.py	run_llama_acc.py	[Reorg Folder] reorg transformers and langchain folders (#1360 )	Mar 19, 2024
run_llm.py	run_llm.py	[Reorg Folder] reorg transformers and langchain folders (#1360 )	Mar 19, 2024
torchoutput.pkl	torchoutput.pkl	change path of examples (#814 )	Apr 13, 2023

README.md

Step-by-Step

In this example, we provide the inference benchmarking script run_llm.py for EleutherAI/gpt-j-6B, decapoda-research/llama-7b-hf, EleutherAI/gpt-neox-20b and databricks/dolly-v2-3b etc. You can also refer to link to do LLM inference with cpp graph to get better performance, but it may have constrain on batched inference.

Note: The default search algorithm is beam search with num_beams = 4

Create Environment

# Create Environment (conda)
conda create -n llm python=3.9 -y
conda install mkl mkl-include -y
conda install gperftools jemalloc==5.2.1 -c conda-forge -y
pip install -r requirements.txt

# if you want to run gpt-j model, please install transformers==4.27.4
pip install transformers==4.27.4

# for other models, please install transformers==4.34.1:
pip install transformers==4.34.1

Note: Suggest use transformers no higher than 4.34.1

Environment Variables

export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0
# IOMP
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so
# Tcmalloc is a recommended malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so

Performance

The fp32 model is from Hugging Face EleutherAI/gpt-j-6B, decapoda-research/llama-7b-hf, decapoda-research/llama-13b-hf, databricks/dolly-v2-3b, [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b, and gpt-j int8 model has been publiced on Intel/gpt-j-6B-pytorch-int8-static.

Generate Neural Engine model

python optimize_llm.py --model=EleutherAI/gpt-j-6B --dtype=(fp32|bf16) --output_model=<path to engine model>

# int8
wget https://huggingface.co/Intel/gpt-j-6B-pytorch-int8-static/resolve/main/pytorch_model.bin -O <path to int8_model.pt>
python optimize_llm.py --model=EleutherAI/gpt-j-6B --dtype=int8 --output_model=<path to ir> --pt_file=<path to int8_model.pt>

When the input dtype is fp32 or bf16, the model will be downloaded if it does not exist.
When the input dtype is int8, the int8 trace model should exist.

Inference

We support inference with FP32/BF16/INT8 Neural Engine model.

OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_llm.py --max-new-tokens 32 --input-tokens 32 --batch-size 1 --model <model name> --model_path <path to engine model>

Advanced Inference

Neural Engine also supports weight compression to fp8_4e3m, fp8_5e2m and int8 only when running bf16 graph. If you want to try, please add arg --weight_type, like:

OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_llm.py --max-new-tokens 32 --input-tokens 32 --batch-size 1 --model_path <path to bf16 engine model> --model <model name> --weight_type=fp8_5e2m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

deployment

deployment

README.md

Step-by-Step

Create Environment

Environment Variables

Performance

Generate Neural Engine model

Inference

Advanced Inference

Files

deployment

Directory actions

More options

Directory actions

More options

Latest commit

History

deployment

Folders and files

parent directory

README.md

Step-by-Step

Create Environment

Environment Variables

Performance

Generate Neural Engine model

Inference

Advanced Inference