-
Notifications
You must be signed in to change notification settings - Fork 3.1k
LLama3 MVP #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
LLama3 MVP #802
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0c647e5
to
ce772eb
Compare
b5bf445
to
a94e63e
Compare
…gs and lm-head for now)
b220db7
to
4c44fcb
Compare
ac89723
to
805e271
Compare
… (was discarding imaginary part)
hard-code a hf token to make the tests run
…fficient workaround to make it correct
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This implements the minimum necessary changes to get an implementation of LLama3 that is functional.
Key fixes compared to the current llama3 branch:
When trying to set up CI, we run into the problem that even the 1B model is too large to fit for training; I've tried two different things:
a) run reference code with
--device=cpu
. This works, but we see numerical differences quite prominently, would need to increase tolerances by ~10x for fp32 modeb) use torchao's CPUOffloadOptimizer. Works, but introduces another dependency on the python side.
Also changes the numerics, but only for the optimizer, so it doesn't break the gradient step.EDIT: CPUOffloadOptimizer is not compatible with gradient clipping, so I had to add a terrible hack :( The loss values set as targets intest_llama3.cu
are generated from the.py
file, but I ran it on a larger GPU so that these are without offloading.What is still missing:
Q: Do we really want to store the hidden dimension size as a floating-point factor , followed by 1024-roudning? Instead of just specifying
hidden_dim
explicitly? The factor would make sense if it was the same across all models, but it differs.I do have code for all of these, but I'd like to keep the PRs at a managable size, so these changes aren't included here.