Skip to content
This repository was archived by the owner on Oct 25, 2024. It is now read-only.

Commit 2580f3c

Browse files
Refined documents of NeuralChat. (#1027)
1 parent c6858bd commit 2580f3c

File tree

7 files changed

+277
-220
lines changed

7 files changed

+277
-220
lines changed

workflows/chatbot/README.md

+26-204
Original file line numberDiff line numberDiff line change
@@ -1,210 +1,32 @@
11
NeuralChat
22
============
33

4-
This example demonstrates how to finetune the pretrained large language model (LLM) with the instruction-following dataset for creating the NeuralChat, a chatbot that can conduct the textual conversation. Giving NeuralChat the textual instruction, it will respond with the textual response. This example have been validated on the 4th Gen Intel® Xeon® Processors, Sapphire Rapids.
5-
6-
# Prerequisite​
7-
8-
## 1. Environment​
9-
Recommend python 3.9 or higher version.
10-
```shell
11-
pip install -r requirements.txt
12-
# To use ccl as the distributed backend in distributed training on CPU requires to install below requirement.
13-
python -m pip install oneccl_bind_pt==1.13 -f https://developer.intel.com/ipex-whl-stable-cpu
14-
```
15-
16-
## 2. Prepare the Model
17-
18-
### LLaMA
19-
To acquire the checkpoints and tokenizer, the user has two options: completing the [Google form](https://forms.gle/jk851eBVbX1m5TAv5) or attempting [the released model on Huggingface](https://huggingface.co/decapoda-research/llama-7b-hf).
20-
21-
It should be noticed that the early version of LLama model's name in Transformers has resulted in many loading issues, please refer to this [revision history](https://github.com/huggingface/transformers/pull/21955). Therefore, Transformers has reorganized the code and rename LLaMA model as `Llama` in the model file. But the release model on Huggingface did not make modifications in react to this change. To avoid unexpexted confliction issues, we advise the user to modify the local `config.json` and `tokenizer_config.json` files according to the following recommendations:
22-
1. The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`;
23-
2. The `architectures` in `config.json` should be changed from `LLaMAForCausalLM` to `LlamaForCausalLM`.
24-
25-
### FLAN-T5
26-
The user can obtain the [release model](https://huggingface.co/google/flan-t5-xl) from Huggingface.
27-
28-
## 3. Prepare Dataset
29-
The instruction-following dataset is needed for the finetuning. We select two kinds of Datasets to conduct the finetuning process: general domain dataset and domain specific dataset.
30-
31-
1. General domain dataset: We use the [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca) from Stanford University as the general domain dataset to fine-tune the model. This dataset is provided in the form of a JSON file, [alpaca_data.json](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json). In Alpaca, researchers have manually crafted 175 seed tasks to guide `text-davinci-003` in generating 52K instruction data for diverse tasks.
32-
33-
2. Domain-specific dataset: Inspired by Alpaca, we constructed a domain-specific dataset focusing on Business and Intel-related issues. We made minor modifications to the [prompt template](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) to proactively guide Alpaca in generating more Intel and Business related instruction data. The generated data could be find in `intel_domain.json`.
34-
35-
# Finetune
36-
37-
We employ the [LoRA approach](https://arxiv.org/pdf/2106.09685.pdf) to finetune the LLM efficiently, currently, FLAN-T5 and LLaMA are supported for finetuning.
38-
39-
## 1. Single Node Fine-tuning
40-
41-
For FLAN-T5, use the below command line for finetuning on the Alpaca dataset.
42-
43-
```bash
44-
python finetune_seq2seq.py \
45-
--model_name_or_path "google/flan-t5-xl" \
46-
--train_file "stanford_alpaca/alpaca_data.json" \
47-
--per_device_train_batch_size 2 \
48-
--per_device_eval_batch_size 2 \
49-
--gradient_accumulation_steps 1 \
50-
--do_train \
51-
--learning_rate 1.0e-5 \
52-
--warmup_ratio 0.03 \
53-
--weight_decay 0.0 \
54-
--num_train_epochs 5 \
55-
--logging_steps 10 \
56-
--save_steps 2000 \
57-
--save_total_limit 2 \
58-
--overwrite_output_dir \
59-
--output_dir ./flan-t5-xl_peft_finetuned_model
60-
```
61-
62-
For LLaMA, use the below command line for finetuning on the Alpaca dataset.
63-
64-
```bash
65-
python finetune_clm.py \
66-
--model_name_or_path "decapoda-research/llama-7b-hf" \
67-
--train_file "/path/to/alpaca_data.json" \
68-
--dataset_concatenation \
69-
--per_device_train_batch_size 8 \
70-
--per_device_eval_batch_size 8 \
71-
--gradient_accumulation_steps 1 \
72-
--do_train \
73-
--learning_rate 2e-5 \
74-
--num_train_epochs 3 \
75-
--logging_steps 100 \
76-
--save_total_limit 2 \
77-
--overwrite_output_dir \
78-
--log_level info \
79-
--save_strategy epoch \
80-
--output_dir ./llama_peft_finetuned_model \
81-
--peft lora \
82-
--use_fast_tokenizer false \
83-
```
84-
85-
Where the `--dataset_concatenation` argument is a way to vastly accelerate the fine-tuning process through training samples concatenation. With several tokenized sentences concatenated into a longer and concentrated sentence as the training sample instead of having several training samples with different lengths, this way is more efficient due to the parallelism characteristic provided by the more concentrated training samples.
86-
87-
For finetuning on SPR, add `--bf16` argument will speedup the finetuning process without the loss of model's performance.
88-
you could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LORA,
89-
see https://github.com/huggingface/peft
90-
91-
add option "--use_fast_tokenizer False" when using latest transformers if you meet failure in llama fast tokenizer
92-
for llama, The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`
93-
94-
## 2. Multi-node Fine-tuning
95-
96-
We also supported Distributed Data Parallel finetuning on single node and multi nodes settings. To use Distributed Data Parallel to speedup training, the bash command needs a small adjustment.
97-
<br>
98-
For example, to finetune FLAN-T5 through Distributed Data Parallel training, bash command will look like the following, where
99-
<br>
100-
*`<MASTER_ADDRESS>`* is the address of the master node, it won't be necessary for single node case,
101-
<br>
102-
*`<NUM_PROCESSES_PER_NODE>`* is the desired processes to use in current node, for node with GPU, usually set to number of GPUs in this node, for node without GPU and use CPU for training, it's recommended set to 1,
103-
<br>
104-
*`<NUM_NODES>`* is the number of nodes to use,
105-
<br>
106-
*`<NODE_RANK>`* is the rank of the current node, rank starts from 0 to *`<NUM_NODES>`*`-1`.
107-
<br>
108-
> Also please note that to use CPU for training in each node with multi nodes settings, argument `--no_cuda` is mandatory, and `--xpu_backend ccl` is required if to use ccl as the distributed backend. In multi nodes setting, following command needs to be launched in each node, and all the commands should be the same except for *`<NODE_RANK>`*, which should be integer from 0 to *`<NUM_NODES>`*`-1` assigned to each node.
109-
110-
``` bash
111-
python -m torch.distributed.launch --master_addr=<MASTER_ADDRESS> --nproc_per_node=<NUM_PROCESSES_PER_NODE> --nnodes=<NUM_NODES> --node_rank=<NODE_RANK> \
112-
finetune_seq2seq.py \
113-
--model_name_or_path "google/flan-t5-xl" \
114-
--train_file "stanford_alpaca/alpaca_data.json" \
115-
--per_device_train_batch_size 2 \
116-
--per_device_eval_batch_size 2 \
117-
--gradient_accumulation_steps 1 \
118-
--do_train \
119-
--learning_rate 1.0e-5 \
120-
--warmup_ratio 0.03 \
121-
--weight_decay 0.0 \
122-
--num_train_epochs 5 \
123-
--logging_steps 10 \
124-
--save_steps 2000 \
125-
--save_total_limit 2 \
126-
--overwrite_output_dir \
127-
--output_dir ./flan-t5-xl_peft_finetuned_model
128-
```
129-
130-
If you have enabled passwordless SSH in cpu clusters, you could also use mpirun in master node to start the DDP finetune. Take llama alpaca finetune for example. follow the [hugginface guide](https://huggingface.co/docs/transformers/perf_train_cpu_many) to install Intel® oneCCL Bindings for PyTorch, IPEX
131-
132-
oneccl_bindings_for_pytorch is installed along with the MPI tool set. Need to source the environment before using it.
133-
134-
for Intel® oneCCL >= 1.12.0
135-
``` bash
136-
oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
137-
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
138-
```
139-
140-
for Intel® oneCCL whose version < 1.12.0
141-
``` bash
142-
torch_ccl_path=$(python -c "import torch; import torch_ccl; import os; print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))")
143-
source $torch_ccl_path/env/setvars.sh
144-
```
145-
146-
The following command enables training with a total of 16 processes on 4 Xeons (node0/1/2/3, 2 sockets each node. taking node0 as the master node), ppn (processes per node) is set to 4, with two processes running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
147-
148-
In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.
149-
``` bash
150-
cat hostfile
151-
xxx.xxx.xxx.xxx #node0 ip
152-
xxx.xxx.xxx.xxx #node1 ip
153-
xxx.xxx.xxx.xxx #node2 ip
154-
xxx.xxx.xxx.xxx #node3 ip
155-
```
156-
Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1 with BF16 auto mixed precision:
157-
``` bash
158-
export CCL_WORKER_COUNT=1
159-
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
160-
mpirun -f nodefile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py \
161-
--model_name_or_path decapoda-research/llama-7b-hf \
162-
--train_file ./alpaca_data.json \
163-
--bf16 True \
164-
--output_dir ./llama_peft_finetuned_model \
165-
--num_train_epochs 3 \
166-
--per_device_train_batch_size 4 \
167-
--per_device_eval_batch_size 4 \
168-
--gradient_accumulation_steps 1 \
169-
--evaluation_strategy "no" \
170-
--save_strategy "steps" \
171-
--save_steps 2000 \
172-
--save_total_limit 1 \
173-
--learning_rate 2e-5 \
174-
--weight_decay 0. \
175-
--warmup_ratio 0.03 \
176-
--lr_scheduler_type "cosine" \
177-
--logging_steps 1 \
178-
--peft ptun \
179-
--group_by_length True \
180-
--dataset_concatenation \
181-
--use_fast_tokenizer false \
182-
--do_train \
183-
184-
```
185-
you could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LORA,
186-
see https://github.com/huggingface/peft
187-
188-
# Chat with the Finetuned Model
189-
190-
Once the model is finetuned, use the below command line to chat with it. Take t5 as example, and you could extend it to other models.
191-
```bash
192-
python generate.py \
193-
--base_model_path "google/flan-t5-xl" \
194-
--peft_model_path "./flan-t5-xl_peft_finetuned_model" \
195-
--instructions "Transform the following sentence into one that shows contrast. The tree is rotten."
196-
```
197-
198-
add option "--use_slow_tokenizer" when using latest transformers if you meet failure in llama fast tokenizer
199-
for llama, The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`
200-
201-
```bash
202-
python generate.py \
203-
--base_model_path "decapoda-research/llama-7b-hf" \
204-
--peft_model_path "./llama_peft_finetuned_model" \
205-
--use_slow_tokenizer \
206-
--instructions "Transform the following sentence into one that shows contrast. The tree is rotten."
207-
```
4+
NeuralChat is a powerful and versatile chatbot designed to facilitate textual conversations. By providing NeuralChat with textual instructions, users can receive accurate and relevant textual responses. We provide a comprehensive workflow for building a highly customizable end-to-end chatbot service, covering model pre-training, model fine-tuning, model compression, prompt engineering, knowledge base retrieval and quick deployment.
5+
6+
7+
8+
## Fine-tuning Pipeline
9+
10+
We provide a comprehensive pipeline on fine-tuning a customized model. It covers the process of [generating custom instruction datasets](./fine_tuning/instruction_generator/), [instruction templates](./fine_tuning/instruction_template), [fine-tuning the model with these datasets](./fine_tuning/instruction_tuning_pipeline/), and leveraging an [RLHF (Reinforcement Learning from Human Feedback) pipeline](./fine_tuning/rlhf_learning_pipeline/) for efficient fine-tuning of the pretrained large language model (LLM). For detailed information and step-by-step instructions, please consult this [README file](./fine_tuning/README.md).
11+
12+
13+
## Inference Pipeline
14+
15+
We focuse on optimizing the inference process of the fine-tuned customized model. It includes [auto prompt engineering](./inference/auto_prompt/) techniques for improving user prompts, [document indexing](./inference/document_indexing/) for efficient retrieval of relevant information, including Dense Indexing based on [LangChain](https://github.com/hwchase17/langchain) and Sparse Indexing based on [fastRAG](https://github.com/IntelLabs/fastRAG), [document rankers](./inference/document_ranker/) to prioritize the most relevant responses, [instruction optimization](./inference/instruction_optimization/) to enhance the model's performance, and a [memory controller](./inference/memory_controller/) for efficient memory utilization. For more information on these optimization techniques, please refer to this [README file](./inference/README.md).
16+
17+
## Deployment
18+
19+
### Demo
20+
21+
We offer a rich demonstration of the capabilities of NeuralChat. It showcases a variety of components, including a basic frontend, an advanced frontend with enhanced features, a Command-Line interface for convenient interaction, and different backends to suit diverse requirements. For more detailed information and instructions, please refer to the [README file](./demo/README.md).
22+
23+
### Service
24+
25+
Under construction.
26+
27+
28+
To simplify the deployment process, we have also included Docker files for each part, allowing for easy and efficient building of the whole workflow service. These Docker files provide a standardized environment and streamline the deployment process, ensuring smooth execution of the chatbot service.
29+
20830

20931
# Purpose of the NeuralChat for Intel Architecture
21032

0 commit comments

Comments
 (0)