Running HumanEval safely with Riza

We're Riza and we make running untrusted code safe and easy. This post is the beginning of a series about evaluating LLM codegen capabilities using our Code Interpreter API.

The most obvious way to evaluate an LLM's code generating ability is to ask it to produce some code and then run the code to see if it works correctly. And in fact this is how the most popular LLM codegen evaluation framework, human-eval, performs its evaluation.

This poses an obvious problem though, and it's spelled out right at the top of the human-eval README:

This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.

An LLM probably won't write code that sends all of your environment variable secrets to a Pastebin, but it might...

Because Riza is purpose-built to run untrusted code, it's easy to use our API as the execution engine for HumanEval evaluations. The following steps show you how, but at a high level instead of running exec() directly on your machine you hand off the code to the Riza Code Interpreter API. After importing and initializing the Riza API client library package, it's just two additional lines of code:

diff --git a/human_eval/execution.py b/human_eval/execution.py
index bc509f5..bd002ed 100644
--- a/human_eval/execution.py
+++ b/human_eval/execution.py
@@ -44,30 +46,32 @@ def check_correctness(problem: Dict, completion: str, timeout: float,
             try:
                 exec_globals = {}
                 with swallow_io():
                     with time_limit(timeout):
-# WARNING
-# This program exists to execute untrusted model-generated code. Although
-# it is highly unlikely that model-generated code will do something overtly
-# malicious in response to this test suite, model-generated code may act
-# destructively due to a lack of model capability or alignment.
-# Users are strongly encouraged to sandbox this evaluation suite so that it
-# does not perform destructive actions on their host or network. For more
-# information on how OpenAI sandboxes its code, see the accompanying paper.
-# Once you have read this disclaimer and taken appropriate precautions,
-# uncomment the following line and proceed at your own risk:
-#                         exec(check_program, exec_globals)
+                        resp = riza.command.exec(language="PYTHON", code=check_program)
+                        assert resp.exit_code == 0
                 result.append("passed")
             except TimeoutException:
                 result.append("timed out")
             except BaseException as e:
                 result.append(f"failed: {e}")

Here are the steps required to get this working yourself. You can adapt the following to use the newer simple-evals framework, which we'll cover in a future post.

1. Clone the human-eval repository and install dependencies

git clone https://github.com/openai/human-eval
cd human-eval
pip install -e .   # you can ignore the entry point error
pip install rizaio # the Riza API client library
pip install groq   # for generating evaluation data via Groq

2. Gather and configure API keys

You'll need a Riza API key from the Riza dashboard. Set it as the value of an environment variable named RIZA_API_KEY.

export RIZA_API_KEY=<your key value here>

To generate evaluation data you'll need access to a model that can generate code. We use Meta's llama3 70b model hosted by Groq for this example, because they have an easy-to-use API and a generous free tier. Get a Groq API key from https://console.groq.com/ and set it as the value of an environment variable named GROQ_API_KEY.

export GROQ_API_KEY=<your-api-key-here>

3. Generate sample evaluation data

Copy and paste the following Python code into a file named generate_samples.py. Modify this as needed to improve model prompting and generate sample data from other models.

from groq import Groq
from human_eval.data import write_jsonl, read_problems

client = Groq()

def generate_one_completion(prompt):
    print("Generating one sample completion")
    completion = client.chat.completions.create(
        model="llama3-70b-8192",
        messages=[
            {"role": "system", "content": "You write Python to solve problems. You only complete the Python code, nothing else. You don't use backticks in your responses."},
            {"role": "assistant", "content": prompt}
        ]
    )
    return completion.choices[0].message.content

problems = read_problems()

num_samples_per_task = 1
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in problems
    for _ in range(num_samples_per_task)
]
write_jsonl("samples-llama3-70b.jsonl", samples)

Run the above script to generate samples written to samples-llama3-70b.jsonl. Note that this can take a decent amount of time, but usually less than ten minutes.

python generate_samples.py

4. Modify execution.py to run code on Riza

Open the file named execution.py and add the following lines to the top of the file underneath the existing import statements.

import rizaio

riza = rizaio.Riza()

In the same file, below the big warning about untrusted code on line 53, instead of uncommenting the line that reads exec(check_program, exec_globals), add the following two lines.

resp = riza.command.exec(language="PYTHON", code=check_program)
assert resp.exit_code == 0

5. Modify evaluate_functional_correctness.py to increase the execution timeout

We want to know that failures are due to non-functioning code and not network latency or other unrelated timeout issues. So open up the file named evaluate_functional_correctness.py and bump the execution timeout up to 35 seconds.

diff --git a/human_eval/evaluate_functional_correctness.py b/human_eval/evaluate_functional_correctness.py
index 9247a68..ea3f6b6 100644
--- a/human_eval/evaluate_functional_correctness.py
+++ b/human_eval/evaluate_functional_correctness.py
@@ -9,7 +9,7 @@ def entry_point(
     sample_file: str,
     k: str = "1,10,100",
     n_workers: int = 4,
-    timeout: float = 3.0,
+    timeout: float = 35.0,
     problem_file: str = HUMAN_EVAL,
 ):

6. Evaluate the samples from step 3

We're finally ready to see how llama3 70b performed.

python human_eval/evaluate_functional_correctness.py \
samples-llama3-70b.jsonl

If all goes well you should see a progress bar as each sample is evaluated. In our most-recent evaluation of llama3 70b it achieved a pass@1 of ~0.44, which means it generated correct code for roughly 44% of the 164 HumanEval problems on the first try.

If you'd like to see us run other LLM codegen evaluations, drop us a message in Discord.