Skip to content

Stav/remove test compare to python vm #2086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: starkware-development
Choose a base branch
from

Conversation

Stavbe
Copy link
Collaborator

@Stavbe Stavbe commented May 4, 2025

Initially, we planned to fill holes only in the memory used for prover_input_info, but we realized it would be better for Stone to receive a memory file without holes as well —since this they are already computed by the VM.
Therefore, we no longer intend to compare the memory against the Python VM’s memory, as they will not be identical.


This change is Reviewable

@Stavbe Stavbe changed the base branch from main to starkware-development May 4, 2025 11:53
Copy link

github-actions bot commented May 4, 2025

**Hyper Thereading Benchmark results**




hyperfine -r 2 -n "hyper_threading_main threads: 1" 'RAYON_NUM_THREADS=1 ./hyper_threading_main' -n "hyper_threading_pr threads: 1" 'RAYON_NUM_THREADS=1 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 1
  Time (mean ± σ):     26.330 s ±  0.006 s    [User: 25.528 s, System: 0.798 s]
  Range (min … max):   26.325 s … 26.335 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 1
  Time (mean ± σ):     25.640 s ±  0.033 s    [User: 24.768 s, System: 0.868 s]
  Range (min … max):   25.616 s … 25.663 s    2 runs
 
Summary
  hyper_threading_pr threads: 1 ran
    1.03 ± 0.00 times faster than hyper_threading_main threads: 1




hyperfine -r 2 -n "hyper_threading_main threads: 2" 'RAYON_NUM_THREADS=2 ./hyper_threading_main' -n "hyper_threading_pr threads: 2" 'RAYON_NUM_THREADS=2 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 2
  Time (mean ± σ):     14.610 s ±  0.084 s    [User: 25.537 s, System: 0.835 s]
  Range (min … max):   14.551 s … 14.669 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 2
  Time (mean ± σ):     13.935 s ±  0.014 s    [User: 24.843 s, System: 0.913 s]
  Range (min … max):   13.925 s … 13.945 s    2 runs
 
Summary
  hyper_threading_pr threads: 2 ran
    1.05 ± 0.01 times faster than hyper_threading_main threads: 2




hyperfine -r 2 -n "hyper_threading_main threads: 4" 'RAYON_NUM_THREADS=4 ./hyper_threading_main' -n "hyper_threading_pr threads: 4" 'RAYON_NUM_THREADS=4 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 4
  Time (mean ± σ):     10.845 s ±  0.480 s    [User: 38.708 s, System: 0.984 s]
  Range (min … max):   10.505 s … 11.184 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 4
  Time (mean ± σ):     10.431 s ±  0.528 s    [User: 36.853 s, System: 1.077 s]
  Range (min … max):   10.057 s … 10.804 s    2 runs
 
Summary
  hyper_threading_pr threads: 4 ran
    1.04 ± 0.07 times faster than hyper_threading_main threads: 4




hyperfine -r 2 -n "hyper_threading_main threads: 6" 'RAYON_NUM_THREADS=6 ./hyper_threading_main' -n "hyper_threading_pr threads: 6" 'RAYON_NUM_THREADS=6 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 6
  Time (mean ± σ):     10.772 s ±  0.052 s    [User: 38.666 s, System: 0.992 s]
  Range (min … max):   10.735 s … 10.809 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 6
  Time (mean ± σ):     10.070 s ±  0.212 s    [User: 37.750 s, System: 1.062 s]
  Range (min … max):    9.920 s … 10.219 s    2 runs
 
Summary
  hyper_threading_pr threads: 6 ran
    1.07 ± 0.02 times faster than hyper_threading_main threads: 6




hyperfine -r 2 -n "hyper_threading_main threads: 8" 'RAYON_NUM_THREADS=8 ./hyper_threading_main' -n "hyper_threading_pr threads: 8" 'RAYON_NUM_THREADS=8 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 8
  Time (mean ± σ):     10.467 s ±  0.079 s    [User: 39.127 s, System: 1.005 s]
  Range (min … max):   10.412 s … 10.523 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 8
  Time (mean ± σ):     10.194 s ±  0.066 s    [User: 37.636 s, System: 1.086 s]
  Range (min … max):   10.147 s … 10.241 s    2 runs
 
Summary
  hyper_threading_pr threads: 8 ran
    1.03 ± 0.01 times faster than hyper_threading_main threads: 8




hyperfine -r 2 -n "hyper_threading_main threads: 16" 'RAYON_NUM_THREADS=16 ./hyper_threading_main' -n "hyper_threading_pr threads: 16" 'RAYON_NUM_THREADS=16 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 16
  Time (mean ± σ):     10.417 s ±  0.114 s    [User: 39.511 s, System: 1.097 s]
  Range (min … max):   10.337 s … 10.498 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 16
  Time (mean ± σ):     10.216 s ±  0.243 s    [User: 37.826 s, System: 1.167 s]
  Range (min … max):   10.044 s … 10.388 s    2 runs
 
Summary
  hyper_threading_pr threads: 16 ran
    1.02 ± 0.03 times faster than hyper_threading_main threads: 16


Copy link

github-actions bot commented May 4, 2025

Benchmark Results for unmodified programs 🚀

Command Mean [s] Min [s] Max [s] Relative
base big_factorial 2.175 ± 0.021 2.157 2.215 1.00 ± 0.01
head big_factorial 2.169 ± 0.016 2.150 2.206 1.00
Command Mean [s] Min [s] Max [s] Relative
base big_fibonacci 2.110 ± 0.025 2.091 2.175 1.00 ± 0.02
head big_fibonacci 2.105 ± 0.025 2.076 2.158 1.00
Command Mean [s] Min [s] Max [s] Relative
base blake2s_integration_benchmark 7.684 ± 0.100 7.592 7.952 1.00
head blake2s_integration_benchmark 7.944 ± 0.060 7.883 8.036 1.03 ± 0.02
Command Mean [s] Min [s] Max [s] Relative
base compare_arrays_200000 2.210 ± 0.013 2.201 2.242 1.00 ± 0.01
head compare_arrays_200000 2.201 ± 0.007 2.190 2.216 1.00
Command Mean [s] Min [s] Max [s] Relative
base dict_integration_benchmark 1.447 ± 0.011 1.434 1.470 1.01 ± 0.01
head dict_integration_benchmark 1.436 ± 0.015 1.423 1.477 1.00
Command Mean [s] Min [s] Max [s] Relative
base field_arithmetic_get_square_benchmark 1.236 ± 0.006 1.223 1.246 1.00
head field_arithmetic_get_square_benchmark 1.240 ± 0.005 1.233 1.250 1.00 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base integration_builtins 7.705 ± 0.041 7.639 7.755 1.00
head integration_builtins 7.972 ± 0.041 7.906 8.033 1.03 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base keccak_integration_benchmark 7.927 ± 0.085 7.853 8.149 1.00
head keccak_integration_benchmark 8.337 ± 0.071 8.282 8.504 1.05 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base linear_search 2.212 ± 0.014 2.199 2.244 1.00
head linear_search 2.227 ± 0.029 2.206 2.301 1.01 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base math_cmp_and_pow_integration_benchmark 1.538 ± 0.009 1.523 1.550 1.00
head math_cmp_and_pow_integration_benchmark 1.551 ± 0.014 1.538 1.587 1.01 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base math_integration_benchmark 1.473 ± 0.005 1.464 1.479 1.00
head math_integration_benchmark 1.476 ± 0.009 1.470 1.498 1.00 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base memory_integration_benchmark 1.234 ± 0.004 1.229 1.241 1.00
head memory_integration_benchmark 1.240 ± 0.016 1.228 1.274 1.00 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base operations_with_data_structures_benchmarks 1.552 ± 0.007 1.544 1.568 1.00
head operations_with_data_structures_benchmarks 1.580 ± 0.003 1.577 1.585 1.02 ± 0.00
Command Mean [ms] Min [ms] Max [ms] Relative
base pedersen 537.5 ± 0.8 536.5 539.0 1.00
head pedersen 538.3 ± 3.2 534.7 545.9 1.00 ± 0.01
Command Mean [ms] Min [ms] Max [ms] Relative
base poseidon_integration_benchmark 640.7 ± 7.0 634.4 658.7 1.00
head poseidon_integration_benchmark 658.5 ± 4.7 651.8 666.7 1.03 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base secp_integration_benchmark 1.853 ± 0.005 1.848 1.864 1.00
head secp_integration_benchmark 1.878 ± 0.006 1.866 1.888 1.01 ± 0.00
Command Mean [ms] Min [ms] Max [ms] Relative
base set_integration_benchmark 635.0 ± 2.0 632.4 639.1 1.00
head set_integration_benchmark 635.7 ± 7.8 629.6 655.0 1.00 ± 0.01
Command Mean [s] Min [s] Max [s] Relative
base uint256_integration_benchmark 4.336 ± 0.036 4.301 4.429 1.00 ± 0.01
head uint256_integration_benchmark 4.325 ± 0.015 4.304 4.348 1.00

Copy link

codecov bot commented May 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.55%. Comparing base (33d75ca) to head (62792f5).

Additional details and impacted files
@@                    Coverage Diff                    @@
##           starkware-development    #2086      +/-   ##
=========================================================
- Coverage                  96.62%   96.55%   -0.08%     
=========================================================
  Files                        102      102              
  Lines                      44388    43250    -1138     
=========================================================
- Hits                       42889    41759    -1130     
+ Misses                      1499     1491       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Stavbe Stavbe marked this pull request as ready for review May 4, 2025 12:26
@Stavbe Stavbe requested a review from DavidLevitGurevich May 4, 2025 12:26
@Stavbe Stavbe self-assigned this May 4, 2025
@JulianGCalderon
Copy link
Contributor

Hi @Stavbe!

Is there a way to keep the memory comparison anyway? Comparing with cairo-lang is a great way for testing the Cairo VM, and ensuring that we don't break compatibility.

We came up with two solutions:

  • Add support for both unfilled and filled memory (maybe with a flag --unfilled-memory). That way, we can keep the memory comparisons. The flag --memory would output the filled memory (incompatible with cairo-lang).
  • Sync these changes with cairo-lang, so that cairo-lang also fills memory. That way we can update the VM behaviour, but still compare the results to cairo-lang.

@JulianGCalderon
Copy link
Contributor

Another alternative could be to add a script that fills the memory holes when necessary, and leaving the VM output without holes. Would this work?

Copy link
Collaborator

@DavidLevitGurevich DavidLevitGurevich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Stavbe)

@Stavbe Stavbe force-pushed the stav/remove_test_compare_to_python_vm branch from 62792f5 to 87648df Compare May 15, 2025 14:25
@Stavbe Stavbe force-pushed the stav/remove_test_compare_to_python_vm branch from 87648df to 69efbc5 Compare May 15, 2025 14:35
Copy link
Collaborator Author

@Stavbe Stavbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JulianGCalderon,
We moved the hole-filling logic to run only in proof mode, so I updated this PR to perform memory comparison checks only in the other cases.

Reviewable status: 1 of 7 files reviewed, 2 unresolved discussions (waiting on @DavidLevitGurevich and @Stavbe)


vm/src/tests/compare_outputs_dynamic_layouts.sh line 197 at r3 (raw file):

    echo "Running cairo-lang with case: $case"
    cairo-run --program "$full_program" \
        --layout "dynamic" --cairo_layout_params_file "$full_layout" --proof_mode \

proof mode


vm/src/tests/compare_factorial_outputs_all_layouts.sh line 14 at r3 (raw file):

    # Run cairo_lang
    echo "Running cairo_lang with layout $layout"
    cairo-run --layout $layout --proof_mode  --program $factorial_compiled --trace_file factorial_py.trace --memory_file factorial_py.memory --air_public_input factorial_py.air_public_input --air_private_input factorial_py.air_private_input

proof mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants