Skip to content

Commit 64f4a04

Browse files
authored
[QEff Finetune]: Adding steps about how to fine tune on any custom dataset. (#381)
1) Added steps on how to create the custom_dataset.py to run fine-tuning through QEfficient pipeline on any custom dataset. Also, added a detailed template for the user which covers how to create custom_dataset.py 2) Added the argument 'context_length' in the existing APIs which helps run fine tuning with padding for custom dataset. 3) Made alpaca_dataset as the default dataset. 4) For DDP without sorting, shuffling was set to True. Made it False to sync it up with single SOC run and also to be able to use 'resume finetuning from between' feature. --------- Signed-off-by: Swati Allabadi <[email protected]> Signed-off-by: Swati Allabadi <[email protected]>
1 parent 6bcf5de commit 64f4a04

File tree

3 files changed

+44
-4
lines changed

3 files changed

+44
-4
lines changed

QEfficient/finetune/dataset/custom_dataset.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ def load_module_from_py_file(py_file: str) -> object:
2323
return module
2424

2525

26-
def get_custom_dataset(dataset_config, tokenizer, split: str):
26+
def get_custom_dataset(dataset_config, tokenizer, split: str, context_length=None):
2727
if ":" in dataset_config.file:
2828
module_path, func_name = dataset_config.file.split(":")
2929
else:
@@ -38,7 +38,7 @@ def get_custom_dataset(dataset_config, tokenizer, split: str):
3838

3939
module = load_module_from_py_file(module_path.as_posix())
4040
try:
41-
return getattr(module, func_name)(dataset_config, tokenizer, split)
41+
return getattr(module, func_name)(dataset_config, tokenizer, split, context_length)
4242
except AttributeError as e:
4343
print(
4444
f"It seems like the given method name ({func_name}) is not present in the dataset .py file ({module_path.as_posix()})."

QEfficient/finetune/utils/dataset_utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def get_dataloader_kwargs(train_config, dataset, dataset_processer, split):
5151
)
5252
else:
5353
kwargs["sampler"] = torch.utils.data.DistributedSampler(
54-
dataset, num_replicas=dist.get_world_size(), rank=dist.get_rank(), shuffle=True
54+
dataset, num_replicas=dist.get_world_size(), rank=dist.get_rank(), shuffle=False
5555
)
5656
kwargs["batch_size"] = batch_size
5757
kwargs["drop_last"] = True

docs/source/finetune.md

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,44 @@ to visualise the data,
6464

6565
```python
6666
tensorboard --logdir runs/<file> --bind_all
67-
```
67+
```
68+
69+
## Some features/functionalities of fine-tuning stack:
70+
1) Gradient accumulation: By default, gradient accumulation happens for 4 steps. To update this value, command line argument gradient_accumulation_steps has to be passed. (Example: '--gradient_accumulation_steps 8')
71+
2) Gradient Checkpointing: By default, gradient checkpointing is disabled. To enable it, command line argument gradient_accumulation_steps has to be passed.
72+
73+
## Fine-Tuning on custom dataset
74+
75+
To run fine tuning for any user specific dataset, prepare the dataset using the following steps:
76+
77+
1) Create a directory named 'dataset' inside efficient-transformers.
78+
2) Inside this directory, create a file named 'custom_dataset.py'.
79+
3) Inside the newly created efficient-transformers/dataset/custom_dataset.py, define a function named 'get_custom_dataset'.
80+
4) get_custom_dataset() should have following 4 parameters: dataset_config, tokenizer, split, context_length.
81+
5) Inside get_custom_dataset(), user needs to apply prompt and tokenize the dataset accordingly. Please refer the below template on how to define get_custom_dataset().
82+
6) For examples, please refer python files present in [dataset](https://github.com/quic/efficient-transformers/tree/main/QEfficient/finetune/dataset). In case of Samsum dataset, get_preprocessed_samsum() of efficient-transformers/QEfficient/finetune/dataset/samsum_dataset.py is called.
83+
7) In [dataset_config.py](https://github.com/quic/efficient-transformers/blob/main/QEfficient/finetune/configs/dataset_config.py), for custom_dataset class, pass the appropriate value for train_split and test_split. As an alternative, these values can be passed as command line arguments as well with the finetune command. For example "--train_split train".
84+
8) While running fine tuning, pass argument "-–dataset custom_dataset" to finetune on custom dataset.
85+
86+
Template for get_custom_dataset() to be defined inside efficient-transformers/dataset/custom_dataset.py is as follows:
87+
88+
```python
89+
def get_custom_dataset(dataset_config, tokenizer, split, context_length=None):
90+
91+
# load dataset
92+
# based on split, retrieve only the specific portion of the dataset (train or eval) either here or at the last
93+
94+
def apply_prompt_template():
95+
# transform the passed datapoint by applying the prompt on it
96+
97+
def tokenize():
98+
# tokenize the passed datapoint
99+
100+
# define the prompt
101+
# call apply_prompt_template() for each data point:
102+
# dataset = dataset.map(apply_prompt_template ,<other args>)
103+
# call tokenize() for each data point:
104+
# dataset = dataset.map(tokenize, <other args>)
105+
106+
return dataset
107+
```

0 commit comments

Comments
 (0)