Skip to content

fix(eval): iterative evaluation improvements; SWE-Bench multimodal fixes #7739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Apr 8, 2025

Conversation

xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Apr 7, 2025

  • This change is worth documenting at https://docs.all-hands.dev/
  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality that this introduces.


Give a summary of what the PR does, explaining any non-trivial design decisions.

  • Improve iterative evaluation to retry on empty patch
  • Make some of eval logging as debug level
  • Fix json.loads issue for SWE-Bench Multimodal instances
  • Handle SWE-Bench init script in different location (for multimodal)

Link of any specific issues this addresses.


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:8160f2b-nikolaik   --name openhands-app-8160f2b   docker.all-hands.dev/all-hands-ai/openhands:8160f2b

juanmichelini and others added 30 commits March 12, 2025 17:26
…proved method localization. Edit distance of unsolved generated patchs vs golden patches is low in many cases.
@xingyaoww xingyaoww changed the title Misc fix for eval fix(eval): iterative evaluation improvements; SWE-Bench multimodal fixes Apr 8, 2025
@xingyaoww xingyaoww marked this pull request as ready for review April 8, 2025 14:50
@xingyaoww xingyaoww enabled auto-merge (squash) April 8, 2025 15:07
@xingyaoww xingyaoww disabled auto-merge April 8, 2025 15:09
Copy link
Contributor

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@xingyaoww
Copy link
Collaborator Author

Did a quick validation evaluation run, this give us 30/50. Good to merge!

@xingyaoww xingyaoww merged commit ddda30d into main Apr 8, 2025
21 checks passed
@xingyaoww xingyaoww deleted the xw/fix-eval branch April 8, 2025 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants