cleanup

jie-jw-wu · jie-jw-wu · commit 8b2ac6eb3963 · 2024-04-08T17:01:49.000-07:00
diff --git a/HumanEval/HumanEvalCommPlayground.json b/HumanEval/HumanEvalCommPlayground.json
diff --git a/README.md b/README.md
@@ -1,7 +1,31 @@
-# (WIP) HumanEvalComm: Evaluating Communication Skill of Code LLM
+# HumanEvalComm: Evaluating the Communication Skill of Code LLM and LLM Agent
 
 ## Overview
-Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. Based on the observation that top-level software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. By asking probing questions in various topics before generating the final code, the challenges of programming with LLMs, such as unclear intent specification, lack of computational thinking, and undesired code quality, may be alleviated. This, in turn, increases confidence in the generated code. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs toward greater confidence in generated code. We created a new benchmark,  HumanEvalComm, by removing necessary information in problem descriptions. 
+Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there
+is still a gap between LLMs being capable coders and being top-tier software engineers. The most recent trends are using agent-based
+LLMs to iterate the code generation process. Based on the observation that top-level software engineers often ask clarifying questions
+to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation
+tasks. For this purpose, we define communication skills of LLMs as “being able to ask clarifying questions when the description
+of the code generation problem has issues”. In this study, we restrict these issues to three matters from the software requirement
+engineering field: inconsistent requirements, ambiguous requirements, and incomplete requirements. By asking probing questions
+about requirements of problem descriptions before generating the final code, the challenges of programming with LLMs, such as
+unclear intent specification may be alleviated, resulting to a correct code in the initial iterations.
+
+In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code
+generation. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues mentioned
+above, inconsistency, ambiguity, incompleteness. We then experimented on HumanEvalComm with different Code LLMs, and a new
+LLM Agent approach, Code Clarification and Generation Agent (Okanagan), to identify and ask questions in ambiguous parts from code
+and descriptions for further refining the generated code. We defined Communication Rate and Good Question Rate as the evaluation
+metrics to represent the ratio of questions asked and questions with good quality in responses. We found that 95% of responses from
+Code LLMs still generate code even when half of the problem descriptions are randomly removed. More than 80% of responses from
+Code LLMs still generate code even when the problem descriptions are manually modified according to the taxonomy of clarification
+types, with a lower test pass rate due to a lack of necessary information. Compared with Code LLMs, we also found that the proposed
+LLM Agent approach, Okaganan, effectively increased Communication Rate and Good Question Rate by an absolute 59% and 5%,
+respectively. This resulted in an increase in Test Pass Rate and Pass@1 by 25% and 15%, respectively. This indicates more effective
+communication capability for LLM Agent compared with Code LLMs.
 
 ## Acknowledgements
-This code is heavily influenced by the Nondeterminism evaluation research of ChatGPT (https://github.com/CodeHero0/Nondeterminism-of-ChatGPT-in-Code-Generation)
+This code is heavily influenced by the Nondeterminism evaluation research of ChatGPT (https://github.com/CodeHero0/Nondeterminism-of-ChatGPT-in-Code-Generation), and by IdentityChain(https://github.com/marcusm117/IdentityChain/tree/main) on testing models including StarCoderBase and CodeLlama.
+
+## Reference
+Wu, Jie JW, Fatemeh Hendijani Fard. "Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agents." In Arxiv.
diff --git a/generate_response.py b/generate_response.py
@@ -1237,4 +1237,4 @@ def test_codellama(tokenizer, model, user_input, max_length):
         tokenizer.save_pretrained(args.saved_model_path)
         model.save_pretrained(args.saved_model_path)
     elif args.dataset.startswith('HumanEval'):
-        HumanEval_experiment(args.dataset, './HumanEval/'+args.dataset+'.jsonl', args.option, args.model, args.topn, args.temperature, args, model, tokenizer)
+        HumanEval_experiment(args.dataset, './Benchmark/'+args.dataset+'.jsonl', args.option, args.model, args.topn, args.temperature, args, model, tokenizer)
diff --git a/json_to_jsonl.py b/json_to_jsonl.py
@@ -30,7 +30,7 @@ def convert_to_jsonl(input_file, output_file):
                 f.write('\n')  # Add a newline character to separate each object
 
 # Example usage:
-# python json_to_jsonl.py ./HumanEval/HumanEvalComm.json ./HumanEval/HumanEvalComm.jsonl
+# python json_to_jsonl.py ./Benchmark/HumanEvalComm.json ./Benchmark/HumanEvalComm.jsonl
 if __name__ == "__main__":
     if len(sys.argv) != 3:
         print("Usage: python convert_to_jsonl.py input_file output_file")
diff --git a/measurement_summary_draw_heatmap.py b/measurement_summary_draw_heatmap.py
@@ -345,7 +345,7 @@ def store_data_in_xlsx(correlation, file_suffix):
             with open('./dataset/code_contests_test.json', 'r') as f:
                 problem_list = json.load(f)
         elif dataset == 'HumanEval' or dataset == 'HumanEvalComm':
-            with open('./HumanEval/HumanEval.jsonl', 'r') as f:
+            with open('./Benchmark/HumanEval.jsonl', 'r') as f:
                 for line in f.readlines():
                     problem_list.append(json.loads(line))
         elif dataset == 'APPS':