Skip to content

Commit 8b2ac6e

Browse files
committed
cleanup
1 parent c90bb74 commit 8b2ac6e

5 files changed

+30
-46
lines changed

Diff for: HumanEval/HumanEvalCommPlayground.json

-40
This file was deleted.

Diff for: README.md

+27-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,31 @@
1-
# (WIP) HumanEvalComm: Evaluating Communication Skill of Code LLM
1+
# HumanEvalComm: Evaluating the Communication Skill of Code LLM and LLM Agent
22

33
## Overview
4-
Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. Based on the observation that top-level software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. By asking probing questions in various topics before generating the final code, the challenges of programming with LLMs, such as unclear intent specification, lack of computational thinking, and undesired code quality, may be alleviated. This, in turn, increases confidence in the generated code. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs toward greater confidence in generated code. We created a new benchmark, HumanEvalComm, by removing necessary information in problem descriptions.
4+
Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there
5+
is still a gap between LLMs being capable coders and being top-tier software engineers. The most recent trends are using agent-based
6+
LLMs to iterate the code generation process. Based on the observation that top-level software engineers often ask clarifying questions
7+
to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation
8+
tasks. For this purpose, we define communication skills of LLMs as “being able to ask clarifying questions when the description
9+
of the code generation problem has issues”. In this study, we restrict these issues to three matters from the software requirement
10+
engineering field: inconsistent requirements, ambiguous requirements, and incomplete requirements. By asking probing questions
11+
about requirements of problem descriptions before generating the final code, the challenges of programming with LLMs, such as
12+
unclear intent specification may be alleviated, resulting to a correct code in the initial iterations.
13+
14+
In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code
15+
generation. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues mentioned
16+
above, inconsistency, ambiguity, incompleteness. We then experimented on HumanEvalComm with different Code LLMs, and a new
17+
LLM Agent approach, Code Clarification and Generation Agent (Okanagan), to identify and ask questions in ambiguous parts from code
18+
and descriptions for further refining the generated code. We defined Communication Rate and Good Question Rate as the evaluation
19+
metrics to represent the ratio of questions asked and questions with good quality in responses. We found that 95% of responses from
20+
Code LLMs still generate code even when half of the problem descriptions are randomly removed. More than 80% of responses from
21+
Code LLMs still generate code even when the problem descriptions are manually modified according to the taxonomy of clarification
22+
types, with a lower test pass rate due to a lack of necessary information. Compared with Code LLMs, we also found that the proposed
23+
LLM Agent approach, Okaganan, effectively increased Communication Rate and Good Question Rate by an absolute 59% and 5%,
24+
respectively. This resulted in an increase in Test Pass Rate and Pass@1 by 25% and 15%, respectively. This indicates more effective
25+
communication capability for LLM Agent compared with Code LLMs.
526

627
## Acknowledgements
7-
This code is heavily influenced by the Nondeterminism evaluation research of ChatGPT (https://github.com/CodeHero0/Nondeterminism-of-ChatGPT-in-Code-Generation)
28+
This code is heavily influenced by the Nondeterminism evaluation research of ChatGPT (https://github.com/CodeHero0/Nondeterminism-of-ChatGPT-in-Code-Generation), and by IdentityChain(https://github.com/marcusm117/IdentityChain/tree/main) on testing models including StarCoderBase and CodeLlama.
29+
30+
## Reference
31+
Wu, Jie JW, Fatemeh Hendijani Fard. "Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agents." In Arxiv.

Diff for: generate_response.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1237,4 +1237,4 @@ def test_codellama(tokenizer, model, user_input, max_length):
12371237
tokenizer.save_pretrained(args.saved_model_path)
12381238
model.save_pretrained(args.saved_model_path)
12391239
elif args.dataset.startswith('HumanEval'):
1240-
HumanEval_experiment(args.dataset, './HumanEval/'+args.dataset+'.jsonl', args.option, args.model, args.topn, args.temperature, args, model, tokenizer)
1240+
HumanEval_experiment(args.dataset, './Benchmark/'+args.dataset+'.jsonl', args.option, args.model, args.topn, args.temperature, args, model, tokenizer)

Diff for: json_to_jsonl.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ def convert_to_jsonl(input_file, output_file):
3030
f.write('\n') # Add a newline character to separate each object
3131

3232
# Example usage:
33-
# python json_to_jsonl.py ./HumanEval/HumanEvalComm.json ./HumanEval/HumanEvalComm.jsonl
33+
# python json_to_jsonl.py ./Benchmark/HumanEvalComm.json ./Benchmark/HumanEvalComm.jsonl
3434
if __name__ == "__main__":
3535
if len(sys.argv) != 3:
3636
print("Usage: python convert_to_jsonl.py input_file output_file")

Diff for: measurement_summary_draw_heatmap.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -345,7 +345,7 @@ def store_data_in_xlsx(correlation, file_suffix):
345345
with open('./dataset/code_contests_test.json', 'r') as f:
346346
problem_list = json.load(f)
347347
elif dataset == 'HumanEval' or dataset == 'HumanEvalComm':
348-
with open('./HumanEval/HumanEval.jsonl', 'r') as f:
348+
with open('./Benchmark/HumanEval.jsonl', 'r') as f:
349349
for line in f.readlines():
350350
problem_list.append(json.loads(line))
351351
elif dataset == 'APPS':

0 commit comments

Comments
 (0)