-
Notifications
You must be signed in to change notification settings - Fork 67
feat: split process_data out from run_training #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -162,13 +162,10 @@ class TrainingArgs(BaseModel): | |
# this field determines if ibm_legacy_tmpl should be used instead | ||
use_legacy_tmpl: bool = False | ||
|
||
# this field specifies the filepath to the training dataset before processing | ||
# this field specifies the filepath to the training dataset | ||
data_path: str | ||
ckpt_output_dir: str | ||
|
||
# this field defines where we should be saving the processed version of the training dataset | ||
# after we have tokenized it | ||
data_output_dir: str | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIU instructlab CLI currently uses this field to pass --data-output-dir from the user. It also assumes the default behavior - Would it be possible / desirable to make the transition smoother, e.g. by
Or is the strategy here is to rewrite CLI + bump the minimal version for training library? Even if the CLI training rewrite for the new interface is the plan, it would be easier on the remaining CLI maintainers if the change happens through gradual deprecation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this is handled in the current refactor @cdoern is working on. Assuming it doesn't handle this case, this is how I recommend we handle it based on how we made the initial refactor to move to the newer version of data processing:
|
||
ckpt_output_dir: str | ||
|
||
max_seq_len: int | ||
max_batch_len: int | ||
|
@@ -207,9 +204,6 @@ class TrainingArgs(BaseModel): | |
# quantize_dtype: QuantizeDataType = QuantizeDataType.NONE | ||
lora: LoraOptions | None = None | ||
|
||
# This field defines whether or not data processing will occur inside of `run_training()` | ||
process_data: Optional[bool] = True | ||
|
||
# This field specifies whether only the last checkpoint should be retained. When set to true, it | ||
# will overwrite the previous checkpoint directory, keeping only one directory called | ||
# "last_epoch". This works alongside the '--checkpoint_at_epoch' flag. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, Opinion, Not required: in some codebases I worked with, it's common to import
os.path
instead ofos
to give a hint that we care about the path functions specifically and not just anyos
stuff.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@booxter or better yet,
from os import path
😉