Skip to content

Commit 983cc30

Browse files
committed
Added readme to explain AgentCoder use in HumanEvalComm
1 parent 122ab58 commit 983cc30

File tree

2 files changed

+53
-11
lines changed

2 files changed

+53
-11
lines changed

Diff for: README.md

-11
Original file line numberDiff line numberDiff line change
@@ -55,17 +55,6 @@ Here are some examples:
5555

5656
In this work, for open-source models in phase 0-2, we run sockeye scripts (./scripts/sockeye_scripts/*.sh) to run model inferences in Sockeye (https://arc.ubc.ca/compute-storage/ubc-arc-sockeye), due to resource limitations of the authors' desktop.
5757

58-
## Running AgentCoder baseline
59-
60-
- Requires CodeGeeX library and a few changes to its files.
61-
- Inside the human-eval-comm directory, run the following command:
62-
```
63-
git clone https://github.com/THUDM/CodeGeeX
64-
```
65-
- Then navigate to CodeGeeX/codegeex/benchmark/execution.py and make the following changes:
66-
- change every instance of "test_code" to "full_code"
67-
- change every instance of "generation" to "completion"
68-
6958
## Acknowledgements
7059
This code is heavily influenced by the Nondeterminism evaluation research of ChatGPT (https://github.com/CodeHero0/Nondeterminism-of-ChatGPT-in-Code-Generation), and by IdentityChain(https://github.com/marcusm117/IdentityChain/tree/main) on testing models including StarCoderBase and CodeLlama.
7160

Diff for: README_AgentFramework.md

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# The AgentFramework baseline
2+
3+
- The original AgentCoder framework, as described in the paper "AgentCoder: Multi-Agent Code Generation with Effective Testing and Self-Optimisation", is adapted for use with HumanEvalComm to generate code with clarifying questions.
4+
- Relevant Links:
5+
- [Paper link](https://arxiv.org/abs/2312.13010)
6+
- [GitHub Link](https://github.com/huangd1999/AgentCoder).
7+
- AgentCoder is one of the leading agent-based code generation pipelines, with reported pass@1 scores on the HumanEval benchmark of 96.3 using GPT4 and 79.9 using chatGPT, making it ideal for use as a baseline for our project.
8+
- AgentCoder consists of three steps:
9+
- programmer.py : Generates code based on the given problem
10+
- designer.py: Generates test cases for the problem
11+
- executor.py: Runs the code generated by "programmer.py" locally on the test cases generated by "designer.py" and improves the code.
12+
13+
## Running AgentCoder baseline
14+
15+
- AgentCoder requires a functional installation of the CodeGeeX library. To set up the environment:
16+
- Inside the human-eval-comm directory, run the following command to clone the CodeGeeX repository:
17+
```
18+
git clone https://github.com/THUDM/CodeGeeX
19+
```
20+
- Then, navigate to CodeGeeX and install the required dependencies:
21+
```
22+
pip install -r requirements.txt
23+
```
24+
- Then navigate to CodeGeeX/codegeex/benchmark/execution.py and make the following changes:
25+
- change every instance of "test_code" to "full_code"
26+
- change every instance of "generation" to "completion"
27+
- These changes accommodate updates made to the CodeGeeX library after the original release of AgentCoder. More details on this issue can be found in this [GitHub discussion.](https://github.com/huangd1999/AgentCoder/issues/1)
28+
29+
- Now, navigate to human-eval-comm/ and run the following steps to run AgentCoder on HumanEvalComm
30+
```
31+
./scripts/script_stepwise_phase123_unix.sh "AgentCoder" 0 0 164
32+
./scripts/script_stepwise_phase123_unix.sh "AgentCoder" 1 0 164
33+
./scripts/script_stepwise_phase123_unix.sh "AgentCoder" 2 0 164
34+
```
35+
36+
- The first step runs AgentCoder on round 1, asking the model to either generate code or clarifying questions. For this step, we had to change the functioning of AgentCoder. Originally made to run on the HumanEval dataset, we wish to run AgentCoder on our own created HumanEvalComm dataset, which has modified prompts which ask the model to generate clarifying questions.
37+
- Due to this, for round 1, we use only programmer.py from AgentCoder since the response is not limited to code but to clarifying questions. This leads to the generation of a completion list that contains either clarifying questions or code.
38+
- For round 2, we generate answers to these questions.
39+
- For round 3, we run the entirety of AgentCoder, with all the context from the previous two rounds.
40+
41+
## Changes made to the original repository
42+
43+
- "max_workers" was set to "1" from an original value of "5" since our code only sends one request at a time to AgentCoder to avoid concurrency issues and ensuring that responses are processed sequentially.
44+
- Instead of loading the HumanEval dataset from HuggingFace, we pass the HumanEvalComm data as a list of dictionaries directly into AgentCoder.
45+
- File names now include the task_id to monitor individual entries in the HumanEvalComm dataset, allowing precise tracking of each task's results.
46+
- We changed the original prompt to programmer.py to include a "clarity_prompt", enabling the model to generate clarifying questions instead of code when necessary.
47+
- In executor.py, we reduced the number of epochs to 1 to limit LLM calls and manage our budget efficiently.
48+
- Since our data modified the problem statements to have functions by the name of "candidate", we add checks in executor.py to ensure that either the "entry_point" is the original function name or "candidate".
49+
- Robust error handling mechanisms were introduced throughout the code to ensure smooth execution and prevent unexpected crashes.
50+
51+
## Attribution and Licensing
52+
- Authors: Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, Heming Cui
53+
- License: MIT License

0 commit comments

Comments
 (0)