Fix gradient tests #801

ngc92 · 2025-04-13T08:35:16Z

The order of tensors inside our buffer differs from the order in which we run our gradient checks.
As the layout of the reference and test tensor are the same, we're still comparing corresponding elements, but the mapping of (named) tensors and tolerances into the buffer is wrong; potentially, we didn't compare some elements, and compared others twice.

This PR reorders the tensor names and thresholds to correspond to our actual model definition, and adapts the thresholds accordingly (this is much better visible by looking at the diff of just the second commit). For layernorms, the thresholds had to be increased quite a lot (maybe it should have been suspicious that the last LN needed so much larger tolerances than the others; now they're more equal), but for some other tensors we could actually tighten the error bounds.

The error thresholds have been tested on an A6000.

karpathy · 2025-05-10T23:24:07Z

good catch, agree on the order.

ngc92 added 2 commits April 13, 2025 10:01

reorder checks to match actual tensor order

166f5e6

adjusted error thresholds

42aa891

ngc92 mentioned this pull request Apr 14, 2025

LLama3 MVP #802

Merged

karpathy merged commit f1e2ace into karpathy:master May 10, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix gradient tests #801

Fix gradient tests #801

Uh oh!

ngc92 commented Apr 13, 2025

Uh oh!

karpathy commented May 10, 2025

Uh oh!

Uh oh!

Uh oh!

Fix gradient tests #801

Fix gradient tests #801

Uh oh!

Conversation

ngc92 commented Apr 13, 2025

Uh oh!

karpathy commented May 10, 2025

Uh oh!

Uh oh!

Uh oh!