-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HumanEvalFix integration #1908
HumanEvalFix integration #1908
Conversation
@Muennighoff This is awesome, thank you! Is it ready to try out, or still under development? |
@Muennighoff I see there are some todos in your doc. If it is done, could you plz elaborate more context in PR discription? If you are still dev, you can make this PR as a draft one. |
[core] | ||
max_iterations = 100 | ||
cache_dir = "/tmp/cache" | ||
ssh_hostname = "localhost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably also want to include enable_auto_lint = true
. Evaluation of CodeActAgent on SWE-bench-lite
shows that this option could give the LLM a hint of indentation errors, and thus boosts the final score (if the language is python).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@li-boxuan fixed
I've converted to draft, sorry! One todo is regarding programming languages - I'm unsure if it makes sense to also add evaluation for the other prog. langs in HumanEvalFix (Rust, C++, Java, JS, Go) or only Python? |
Also cc @tangxiangru who is also working on the integration |
I haven't read your paper yet, so please take my thoughts with a grain of salt:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
evaluation/humanevalfix/README.md
Outdated
|
||
You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`. | ||
|
||
## Evaluate Generated Patches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this section necessary for HumanEvalFix? If not we can remove it!
instance.declaration + instance.buggy_solution + '\n' + instance.test | ||
) | ||
path = os.path.join( | ||
workspace_mount_path, f'{instance.task_id.replace("/", "__")}.py' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do instance.task_id.replace("/", "__")
because instance id in the task has /
which will be interpreted as a new folder and can cause issues
|
||
# reset workspace to config | ||
config.workspace_base = workspace_mount_path | ||
config.workspace_mount_path = workspace_mount_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these two lines so that the new mount path can be handled by the sandbox
echo "WARNING: You are about to enable the execution of untrusted model-generated code by setting the environment variable HF_ALLOW_CODE_EVAL to '1'." | ||
echo "It is highly unlikely that model-generated code will do something overtly malicious in response to this test suite, however, it may act destructively due to a lack of model capability or alignment." | ||
echo "Please confirm that you have read the disclaimer, taken the necessary precautions, and wish to proceed (y/n):" | ||
read user_input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add this interactive command line to set HF_ALLOW_CODE_EVAL
to 1 after user acknowledge the warning.
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Amazing thanks so much for taking a look & your fixes! I have moved the PR out of draft mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave some nits. Most LGTM
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
fix a bug: ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError> concurrent.futures.process._RemoteTraceback:
added an example
added: enable_auto_lint = true
test_result = {'result': {}, 'metadata': {}} | ||
code_metric = load('Muennighoff/code_eval_octopack') | ||
timeout = LANGUAGE_TO_TIMEOUT[language] | ||
num_workers = LANGUAGE_TO_NUM_WORKERS[language] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added this, otherwise will be:
ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError>
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/process.py", line 263, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 217, in process_instance
test_result = get_test_result(instance, path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 126, in get_test_result
num_workers=num_workers,
^^^^^^^^^^^
NameError: name 'num_workers' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 340, in _invoke_callbacks
callback(self)
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 343, in update_progress
output = future.result()
^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
NameError: name 'num_workers' is not defined
20:48:42 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:44 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
ERROR:root: File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 340, in _invoke_callbacks
callback(self)
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 343, in update_progress
output = future.result()
^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 374, in
future.result()
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
ERROR:root:<class 'NameError'>: name 'num_workers' is not defined
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [07:13<00:00, 86.64s/it]
Exception ignored in: <function _ExecutorManagerThread.init..weakref_cb at 0x309c5ede0>
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/process.py", line 310, in weakref_cb
AttributeError: 'NoneType' object has no attribute 'util'
[core] | ||
max_iterations = 100 | ||
cache_dir = "/tmp/cache" | ||
ssh_hostname = "localhost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@li-boxuan fixed
add: evaluate package
update poetry.lock
update poetry.lock
Integrates HumanEvalFix from https://arxiv.org/abs/2308.07124