Skip to content

Commit 44e7482

Browse files
authored
Merge pull request #109 from markmc/sdg-flow-yaml
Propose adding a file format for custom pipelines
2 parents 88f2461 + 5c0e071 commit 44e7482

File tree

2 files changed

+280
-2
lines changed

2 files changed

+280
-2
lines changed

.spellcheck-en-custom.txt

+22-2
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Akash
55
AMDGPU
66
Anil
77
arge
8+
args
89
arXiv
910
backend
1011
backends
@@ -20,6 +21,7 @@ Colab
2021
compositional
2122
Conda
2223
config
24+
configs
2325
Containerfile
2426
cpp
2527
cuBLAS
@@ -31,6 +33,7 @@ dataset
3133
DCO
3234
Dependabot
3335
dev
36+
disambiguating
3437
ditaa
3538
docstring
3639
dr
@@ -42,23 +45,29 @@ eval
4245
Excalidraw
4346
exfiltrate
4447
exfiltrating
48+
Filesystem
49+
filesystem
4550
Finetuning
4651
formedness
52+
freeform
4753
GFX
4854
GGUF
4955
GGUFs
5056
GiB
5157
Gmail
5258
gpu
5359
Guang
60+
hardcode
5461
hardcoded
5562
hipBLAS
5663
ilab
5764
impactful
5865
Inferencing
66+
instantiation
5967
instructlab
6068
io
6169
ISA
70+
iters
6271
Jie
6372
JIT
6473
JSON
@@ -78,31 +87,36 @@ Markdownlint
7887
md
7988
Mergify
8089
Merlinite
90+
merlinite
8191
mimimum
8292
Miniforge
8393
Mixtral
94+
mixtral
8495
MLX
8596
mlx
8697
MMLU
8798
Nakamura
99+
num
88100
NVidia
89101
Nvidia
90102
Ollama
91103
ollama
92104
OpenStax
93105
orchestrator
94106
ots
107+
Params
95108
Pareja
96109
PEFT
97110
Pereira
98111
PlantUML
99112
PLOS
100113
PNG
114+
POC
101115
Podman
102116
podman
103-
PR's
104117
pre
105118
preprint
119+
PR's
106120
pyenv
107121
PyPI
108122
PyTorch
@@ -117,6 +131,7 @@ Ren
117131
repo
118132
ROCm
119133
RTX
134+
runtime
120135
RX
121136
SaaS
122137
safetensors
@@ -129,6 +144,8 @@ Shellcheck
129144
Shivchander
130145
Signoff
131146
Sigstore
147+
specifiying
148+
src
132149
Srivastava
133150
Standup
134151
subcommand
@@ -138,14 +155,16 @@ Sudalairaj
138155
Taj
139156
tatsu
140157
TBD
158+
templating
141159
Tesla
142160
th
143161
tl
162+
TODO
144163
tox
145164
traigers
146165
triager
147-
triager's
148166
Triagers
167+
triager's
149168
triagers
150169
UI
151170
ui
@@ -165,3 +184,4 @@ XT
165184
XTX
166185
Xu
167186
YAML
187+
yaml

docs/sdg/sdg-flow-yaml.md

+258
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# SDG API - Add a file format for defining custom Flows
2+
3+
## Problem Statement
4+
5+
The `instructlab/sdg` library is introducing more extensive data generation pipelines. To enable customization, we should allow users of the library to provide a configuration file which defines a custom pipeline or extends an existing pipeline.
6+
7+
In terms of the API constructs, a Pipeline is created from a sequence of “block configurations” which express how to instantiate and invoke the individual steps (aka blocks) in the pipeline. A Flow construct serves as a template from which a sequence of block configs can be generated.
8+
9+
## Objective
10+
11+
- Library users can specify a custom flow using a well-defined file format.
12+
- Library users can either use a custom flow standalone, or combine a custom flow with existing flows.
13+
- The file format and library can evolve substantially without making breaking changes.
14+
- Incompatible changes can be introduced while retaining support for existing custom flows for a deprecation period.
15+
16+
## Proposal
17+
18+
### Existing API Review
19+
20+
The current `Pipeline` API allows instantiation with a list of `Block` configurations.
21+
These configurations could come from one or many sources. In its simplest form:
22+
23+
```python
24+
pipeline = Pipeline(block_configs)
25+
```
26+
27+
or if you had two separate lists of block configurations to append together:
28+
29+
```python
30+
pipeline = Pipeline(block_configs1 + block_configs2)
31+
```
32+
33+
### API Additions
34+
35+
We will add an API that instantiates a `Pipeline` object from a YAML file:
36+
37+
```python
38+
pipeline = Pipeline.from_file(ctx, 'mycustomflow.yaml')
39+
```
40+
41+
The YAML file format will mirror the API and look like this:
42+
43+
```yaml
44+
version: 1.0
45+
blocks:
46+
- name: gen_knowledge
47+
type: LLMBlock
48+
config: # LLMBlock constructor kwargs
49+
output_cols: ["question", "response"]
50+
gen_kwargs: # kwargs for block.generate()
51+
max_tokens": 2048,
52+
drop_duplicates: ["question"]
53+
- name: filter_faithfulness
54+
type: FilterByValueBlock
55+
config:
56+
filter_column: judgment
57+
filter_value: YES
58+
operation: eq
59+
drop_columns: ["judgment", "explanation"]
60+
```
61+
62+
## Versioning
63+
64+
A mandatory `version` field in the YAML file expresses major and minor versions (e.g., 1.0, 1.1, 2.0).
65+
66+
Compatibility rules
67+
68+
1. If the major version of the YAML file is higher than the parser can handle, the parser should reject the file.
69+
2. If the minor version of the YAML file is higher than the highest version the parser is aware of, the parser should read the file but ignore any unrecognized content.
70+
3. If the file’s version is lower than the parser version, the parser should provide default values for any configuration introduced in later versions.
71+
72+
Example parsing logic:
73+
74+
```python
75+
def parse_custom_flow(content):
76+
version = content['version']
77+
major, minor = map(int, version.split('.'))
78+
79+
if major > PARSER_MAJOR:
80+
raise IncompatibleVersionError("The custom flow file format is from a future major version.")
81+
elif major <= PARSER_MAJOR and minor > PARSER_MINOR:
82+
logger.warning("The custom flow file may have new features that will be ignored.")
83+
```
84+
85+
### Pipeline Context
86+
87+
The following runtime parameters will no longer be part of the pipeline configuration definition and instead available to blocks via a `PipelineContext` object:
88+
89+
- client - an OpenAI completions API client for talking to the teacher model via the serving backend (i.e. llama-cpp or vLLM)
90+
- model_family - e.g. mixtral or merlinite
91+
- model_id - a path name for the specific teacher model being used
92+
- num_instructions_to_generate - how many samples to generate
93+
94+
For now, we assume there is no need to do any sort of templating in the custom pipelines based on these runtime parameters.
95+
96+
### Model Prompts
97+
98+
Based on whether model_family is mixtral or merlinite, a different prompt is used with the teacher model
99+
100+
```python
101+
_MODEL_PROMPT_MIXTRAL = "<s> [INST] {prompt} [/INST]"
102+
_MODEL_PROMPT_MERLINITE = "'<|system|>\nYou are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.\n<|user|>\n{prompt}\n<|assistant|>\n'"
103+
```
104+
105+
For now, we assume that the `LLMBlock` class will choose the appropriate model prompt based on the family and that there is no need to specify a custom prompt.
106+
107+
### Prompt Config Files
108+
109+
Every LLMBlock references a separate prompt config file, and presumably a custom pipeline will provide custom prompt configs too.
110+
111+
These prompt config files are quite simple YAML files - they contain a single object with system, introduction, principles, examples, and generation keys. See e.g. src/instructlab/sdg/configs/skills/freeform_questions.yaml
112+
113+
We will continue to use these config files unchanged, and custom files can be specified with an absolute path. Relative paths are assumed to be relative to the Python package e.g. `configs/skills/...`.
114+
115+
### Model Serving
116+
117+
Custom pipelines may have more unique model serving requirements. Instead of serving just one model, we may need to launch the model server with a model and an additional model with adapter. vLLM, for example, can host both a model and a model+adapter under two different model IDs.
118+
119+
The pipeline author needs some way of disambiguating between these multiple models - i.e. the definition of each `LLMBlock` needs to specify a particular model.
120+
121+
Right now the `Pipeline` constructor takes two relevant parameters - the OpenAI client instance, and the model ID for the default model. It's important to note that this model ID is defined by the user at runtime, and it may not match the model IDs that the pipeline author used.
122+
123+
The use cases will be:
124+
125+
1. Most LLMBlock definitions will use the default teacher model - and we can make the semantic that if the pipeline author doesn't specify a model in an `LLMBlock`, the default in `PipelineContext.model_id` is used.
126+
2. In cases where a model+adapter is to be served, the pipeline author should choose a descriptive model ID using `block.gen_kwargs.model_id` and the user should ensure that this is the model ID that is served.
127+
128+
For example, a pipeline author might define:
129+
130+
```yaml
131+
version: "1.0"
132+
blocks:
133+
- name: gen_questions
134+
type: LLMBlock
135+
config:
136+
config_path: configs/skills/freeform_questions.yaml
137+
add_num_samples: True
138+
gen_kwargs:
139+
model_id: mycustomadapter
140+
output_cols:
141+
- question
142+
drop_duplicates:
143+
- question
144+
```
145+
146+
and the user will be required to define a serving configuration like:
147+
148+
```bash
149+
--lora-modules=mycustomadapter=path/to/my_custom_adapter
150+
```
151+
152+
### Re-use of Built-in Pipelines
153+
154+
A custom pipeline may want to extend an existing built-in pipeline. In that
155+
case, a new block type, `ImportBlock`, may be used to import the blocks from
156+
another configuration file.
157+
158+
```yaml
159+
version: "1.0"
160+
blocks:
161+
- name: import_from_full
162+
type: ImportBlock
163+
path: configs/full/synth_freeform_skills.yaml
164+
- name: custom_post_processing_block
165+
type: LLMBlock
166+
...
167+
```
168+
169+
### CLI Integration
170+
171+
As of the current version of `ilab`, it supports `simple` and `full` as parameters to `--pipeline` to select one of the two types of built-in pipelines included in the library.
172+
173+
Once we have support for loading custom pipelines, we need a way for these to be specified with the CLI. We believe the most common case for custom pipelines is for them to extend the `full` pipeline and, as such, we should support extending existing pipelines with a custom pipeline rather than simply specifiying a single pipeline.
174+
175+
Here is a proposed CLI UX for this:
176+
177+
> `ilab data generate`
178+
179+
Use the default pipeline, `simple`.
180+
181+
> `ilab data generate --pipeline full`
182+
183+
Use the built-in `full` pipeline.
184+
185+
> `ilab data generate --pipeline path/to/custom_pipeline_directory/`
186+
187+
Use a custom pipeline configuration. The custom pipeline may include references to the built-in flows to be used in conjunction with custom ones, but those details are contained within the yaml files in the custom directory.
188+
189+
### File and Directory Structure
190+
191+
The existing contents of `default_flows.py` will become these files in the source tree:
192+
193+
```text
194+
src/
195+
instructlab/
196+
sdg/
197+
pipelines/
198+
simple/
199+
knowledge.yaml
200+
freeform_skills.yaml
201+
grounded_skills.yaml
202+
full/
203+
knowledge.yaml # also contains the current contents of mmlu_bench.yaml
204+
freeform_skills.yaml
205+
grounded_skills.yaml
206+
```
207+
208+
When the `--pipeline` option to `ilab data generate` is used to point to a
209+
custom directory, we will assume that the same 3 files are present. All three
210+
files will be loaded and used according to the type of taxonomy additions
211+
present when running `ilab data generate`.
212+
213+
### Future CLI Improvements
214+
215+
A possible improvement would be to have a well-defined place on the filesystem where custom pipeline configs can be automatically loaded and included as options to the `--pipeline` parameter.
216+
217+
For example, if the config format included new parameters, `name: full-extended` and `extends: full`, and the CLI discovered and loaded it automatically, we could support `--pipeline full-extended` without needing the additional `--pipeline-extend` option.
218+
219+
`/usr/share/instructlab/sdg/` is a proposed location for this as a place for a distribution of InstructLab to include pre-defined custom pipelines, at least for Linux. See the [Filesystem Hierarchy Standard](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch04s11.html) for more details on why this path is appropriate for this use-case.
220+
221+
It would also make sense to support a configuration directory for user's own custom pipeline configurations. Assuming there is a base config directory, these could go in a `sdg` subdirectory. There is a separate proposal that discusses a proposed configuration location: <https://github.com/instructlab/dev-docs/pull/104>. Note this is separate from the distribution-provided, read-only pipelines discussed above with a different location.
222+
223+
If we have a location with pipeline examples then a nice to have would be to have a `ilab data generate --list-pipelines`.
224+
225+
## Alternative Approaches
226+
227+
Alternatives already considered and discarded are listed below.
228+
229+
### No Custom Flows
230+
231+
It would be preferable to not support custom flows, especially so early in the project because:
232+
233+
- We will need an extensive API to support this customization, and we will need to be careful about making incompatible changes to that API once it has been adopted.
234+
- We would learn more about the pipelines that users are creating if they were added to the library.
235+
236+
This approach was discarded because of strong demand from downstream users to define custom flows to encapsulate proprietary pipeline configuration..
237+
238+
### Custom Flows as Code
239+
240+
If we have an API for creating flows, users could define these custom flows in Python rather than with a configuration file format.
241+
242+
This approach was discarded because of a desire by downstream users to separate reusable logic from proprietary pipeline configuration.
243+
244+
The initial version of the initial SDG library API design (#98) proposed using YAML files and this was changed to Python code based on this feedback:
245+
246+
> Does this need to be a yaml file?
247+
>
248+
> or is it actually a Python dict passed to the library?
249+
>
250+
> I actually think it would be a nice simplification to not worry about config files at all, and from the library perspective, assume configuration is passed in via data structures.
251+
>
252+
> How that config is constructed could be a problem of the library consumer. Maybe they hardcode it. maybe they allow a subset to be considered. Some could be driven by CLI args, for example.
253+
254+
Since adopting YAML may now appear contradictory to that feedback, it is useful to understand how the feedback relates to this new design:
255+
256+
1. The feedback assumes that YAML will be used for custom pipelines, but wonders whether it would be better to implement that in the CLI instead of the library.
257+
2. Not called out is that at the time it was unclear whether custom pipeline definitions would also need to include custom model serving configuration - if so, the model serving configuration would not belong in the SDG library. It is now better understood that no model serving configuration needs to be included in the pipeline definitions. (See above)
258+
3. The POC implementation of this format makes it clear - in a way that wasn't clear from an API design - that using the YAML format within the library is an improvement.

0 commit comments

Comments
 (0)