Hey there! We're Riza and we make running untrusted code safe, easy, and even a little bit fun. If you're using an LLM to generate code, we'll run it safely in our cloud or on your infra.
Read any announcement post for a new or updated large language model and you'll find an assortmentment of benchmark data to demonstrate how well the model performs various tasks. One benchmark that we always look for is HumanEval, an "evaluation set ... to measure functional correctness for synthesizing [Python] programs from docstrings" from OpenAI.
It includes 164 problem prompts, each with a Python function defintion and docstring that describes the expected behavior. Each problem also includes a canonical solution and function to verify correctness, but these are not shared within the prompt for obvious reasons.
In a previous post, we talked about how to run HumanEval safely. Today we're going to examine cases where the generated programs don't work correctly.
We asked OpenAI's GPT-4o model to generate 10 solutions to each of HumanEval's 164 problems, and overall it performs quite well. The Pass@10 rate was 97.0%. Individual Pass@1 rates for each sample ranged from 88% to 94%. OpenAI's evaluation of GPT-4o achieves a (presumably) Pass@1 rate of 90.2%, so these results aren't surprising.
But when you consistently solve more than 90% of problems, success isn't as interesting as failure.
Total failures
A Pass@10 of 97.0% still leaves five problems that GPT-4o can't solve when given ten attempts. Let's dig into those.
msg="run check" task=HumanEval/91 pass=false passes=0 failures=10 samples=10
msg="run check" task=HumanEval/115 pass=false passes=0 failures=10 samples=10
msg="run check" task=HumanEval/132 pass=false passes=0 failures=10 samples=10
msg="run check" task=HumanEval/134 pass=false passes=0 failures=10 samples=10
msg="run check" task=HumanEval/145 pass=false passes=0 failures=10 samples=10
We'll look at one attempted solution GPT-4o came up with for each of these five problems, and figure out what went wrong.
HumanEval/91
Success rate: 0/10
def is_bored(S):
"""
You'll be given a string of words, and your task is to count the number
of boredoms. A boredom is a sentence that starts with the word "I".
Sentences are delimited by '.', '?' or '!'.
For example:
>>> is_bored("Hello world")
0
>>> is_bored("The sky is blue. The sun is shining. I love this weather")
1
"""
# Split the input string S by '.', '?' or '!' to identify sentences
sentences = S.split('.') + S.split('?') + S.split('!')
# Initialize the count of boredom sentences
boredom_count = 0
# Iterate through each sentence
for sentence in sentences:
# Strip leading and trailing whitespaces from the sentence
trimmed_sentence = sentence.strip()
# Check if the sentence starts with 'I'
if trimmed_sentence.startswith('I '):
# Increment the boredom count
boredom_count += 1
return boredom_count
Evaluation output:
AssertionError: 3 != 1 : "I love It !"
AssertionError: 4 != 2 : "I feel good today. I will be productive. will kill It"
What went wrong?
Sentences can be delimited by periods, question marks, or exclamation marks. The generated code splits by all three and combines the results, resulting in duplicate sentences.
The canonical solution uses the re
package, but here's a fix that's more in
line with the style of the generated code.
- sentences = S.split('.') + S.split('?') + S.split('!')
+ sentences = S.replace('.', '!').replace('?', '!').split('!')
HumanEval/115
Success rate: 0/10
def max_fill(grid, capacity):
import math
"""
You are given a rectangular grid of wells. Each row represents a single well,
and each 1 in a row represents a single unit of water.
Each well has a corresponding bucket that can be used to extract water from it,
and all buckets have the same capacity.
Your task is to use the buckets to empty the wells.
Output the number of times you need to lower the buckets.
Example 1:
Input:
grid : [[0,0,1,0], [0,1,0,0], [1,1,1,1]]
bucket_capacity : 1
Output: 6
Example 2:
Input:
grid : [[0,0,1,1], [0,0,0,0], [1,1,1,1], [0,1,1,1]]
bucket_capacity : 2
Output: 5
Example 3:
Input:
grid : [[0,0,0], [0,0,0]]
bucket_capacity : 5
Output: 0
Constraints:
* all wells have the same length
* 1 <= grid.length <= 10^2
* 1 <= grid[:,1].length <= 10^2
* grid[i][j] -> 0 | 1
* 1 <= capacity <= 10
"""
# Count total units of water in the grid
water_units = sum(sum(row) for row in grid)
# Calculate the number of bucket operations needed
bucket_operations = math.ceil(water_units / capacity)
return bucket_operations
What went wrong?
Okay, I'm not going to lie I had to read the prompt a few times to understand this one. The issue with the generated code is that treats the all water units together, not per well. If you have two wells that need to be empty and a huge bucket, it acts as though you can use the full bucket for more than one well.
HumanEval/132
Success rate: 0/10
def is_nested(string):
'''
Create a function that takes a string as input which contains only square brackets.
The function should return True if and only if there is a valid subsequence of brackets
where at least one bracket in the subsequence is nested.
is_nested('[[]]') ➞ True
is_nested('[]]]]]]][[[[[]') ➞ False
is_nested('[][]') ➞ False
is_nested('[]') ➞ False
is_nested('[[][]]') ➞ True
is_nested('[[]][[') ➞ True
'''
stack = []
for char in string:
if char == '[':
stack.append(char)
elif char == ']':
if stack:
stack.pop()
if stack: # if there's still an opening bracket in the stack after pop
return True
else:
return False
# Check if there was any nesting
return False
Evaluation output:
assert candidate('[[]') == False
What went wrong?
The code assumes that any item on the stack indicates that a group is fully closed, even if a close bracket isn't included.
HumanEval/134
Success rate: 0/10
def check_if_last_char_is_a_letter(txt):
'''
Create a function that returns True if the last character
of a given string is an alphabetical character and is not
a part of a word, and False otherwise.
Note: "word" is a group of characters separated by space.
Examples:
check_if_last_char_is_a_letter("apple pie") ➞ False
check_if_last_char_is_a_letter("apple pi e") ➞ True
check_if_last_char_is_a_letter("apple pi e ") ➞ False
check_if_last_char_is_a_letter("") ➞ False
'''
# Handle the case when the input string is empty
if len(txt) == 0:
return False
# Strip the text to ignore any trailing spaces for analysis
stripped_txt = txt.rstrip()
# In case the string is only spaces (empty post-strip)
if len(stripped_txt) == 0:
return False
# Check if last character is a letter
if stripped_txt[-1].isalpha():
# Further check if it is a part of a word; whether followed by a non-letter
preceding_index = len(stripped_txt) - 2
if preceding_index < 0 or stripped_txt[preceding_index].isspace():
return True
else:
return False
return False
Evaluation output:
assert candidate("eeeee e ") == False
What went wrong?
Trailing whitespace is important in this function, so we shouldn't throw it away.
- stripped_txt = txt.rstrip()
+ stripped_txt = txt
HumanEval/145
Success rate: 0/10
def order_by_points(nums):
"""
Write a function which sorts the given list of integers
in ascending order according to the sum of their digits.
Note: if there are several items with similar sum of their digits,
order them based on their index in original list.
For example:
>>> order_by_points([1, 11, -1, -11, -12]) == [-1, -11, 1, -12, 11]
>>> order_by_points([]) == []
"""
def sum_of_digits(n):
return sum(int(digit) for digit in str(abs(n)))
# Sort based on the sum of digits, with stable sorting to maintain original order for ties
return sorted(nums, key=sum_of_digits)
What went wrong?
Taking the absolute value of the number before turning it into a string? That's not going to work.
The canonical solution looks like this:
def digits_sum(n):
neg = 1
if n < 0: n, neg = -1 * n, -1
n = [int(i) for i in str(n)]
n[0] = n[0] * neg
return sum(n)
return sorted(nums, key=digits_sum)
Partial failures
GPT-4o also struggled with HumanEval/93, only coming up with a correct solution twice.
HumanEval/93
msg="run check" task=HumanEval/93 pass=true passes=2 failures=8 samples=10
HumanEval/93
Success rate: 2/10
def encode(message):
"""
Write a function that takes a message, and encodes in such a
way that it swaps case of all letters, replaces all vowels in
the message with the letter that appears 2 places ahead of that
vowel in the english alphabet.
Assume only letters.
Examples:
>>> encode('test')
'TGST'
>>> encode('This is a message')
'tHKS KS C MGSSCGG'
"""
vowels = 'aeiouAEIOU'
replacements = {'a': 'c', 'e': 'g', 'i': 'k', 'o': 'q', 'u': 'w',
'A': 'C', 'E': 'G', 'I': 'K', 'O': 'Q', 'U': 'W'}
encoded_message = ''
for char in message:
if char in vowels:
encoded_message += replacements[char]
else:
encoded_message += char.swapcase()
return encoded_message
Evaluation output:
AssertionError: 'tGst' != 'tgst'
AssertionError: 'mwDcSkR' != 'mWDCSKR'
AssertionError: 'yGs' != 'ygs'
AssertionError: 'tHkS kS c MgSScGg' != 'tHKS KS C MGSSCGG'
AssertionError: 'K dqnT kNQW wHCT TQ wRKTG' != 'k dQnT kNqW wHcT Tq wRkTg'
What went wrong?
This is a logic bug in that it either swaps the case or replaces the vowel, not both.
- encoded_message += replacements[char]
+ encoded_message += replacements[char].swapcase()
Overall results
Here's a table with the number of problems that failed a given number of times.
Failure count | Number of problems |
---|---|
0 | 125 |
1 | 15 |
2 | 6 |
3 | 8 |
4 | 0 |
5 | 2 |
6 | 3 |
7 | 1 |
8 | 3 |
9 | 1 |
10 | 5 |
Automatic fixes?
Join us next time where we'll feed these failures back into the model and see how many HumanEval problems it can solve with some additional error context.