What GPT-4o Can't Code

Hey there! We're Riza and we make running untrusted code safe, easy, and even a little bit fun. If you're using an LLM to generate code, we'll run it safely in our cloud or on your infra.

Read any announcement post for a new or updated large language model and you'll find an assortmentment of benchmark data to demonstrate how well the model performs various tasks. One benchmark that we always look for is HumanEval, an "evaluation set ... to measure functional correctness for synthesizing [Python] programs from docstrings" from OpenAI.

It includes 164 problem prompts, each with a Python function defintion and docstring that describes the expected behavior. Each problem also includes a canonical solution and function to verify correctness, but these are not shared within the prompt for obvious reasons.

In a previous post, we talked about how to run HumanEval safely. Today we're going to examine cases where the generated programs don't work correctly.

We asked OpenAI's GPT-4o model to generate 10 solutions to each of HumanEval's 164 problems, and overall it performs quite well. The Pass@10 rate was 97.0%. Individual Pass@1 rates for each sample ranged from 88% to 94%. OpenAI's evaluation of GPT-4o achieves a (presumably) Pass@1 rate of 90.2%, so these results aren't surprising.

But when you consistently solve more than 90% of problems, success isn't as interesting as failure.

Total failures

A Pass@10 of 97.0% still leaves five problems that GPT-4o can't solve when given ten attempts. Let's dig into those.

msg="run check" task=HumanEval/91 pass=false passes=0 failures=10 samples=10
msg="run check" task=HumanEval/115 pass=false passes=0 failures=10 samples=10
msg="run check" task=HumanEval/132 pass=false passes=0 failures=10 samples=10
msg="run check" task=HumanEval/134 pass=false passes=0 failures=10 samples=10
msg="run check" task=HumanEval/145 pass=false passes=0 failures=10 samples=10

We'll look at one attempted solution GPT-4o came up with for each of these five problems, and figure out what went wrong.

HumanEval/91

Success rate: 0/10

def is_bored(S):
    """
    You'll be given a string of words, and your task is to count the number
    of boredoms. A boredom is a sentence that starts with the word "I".
    Sentences are delimited by '.', '?' or '!'.
   
    For example:
    >>> is_bored("Hello world")
    0
    >>> is_bored("The sky is blue. The sun is shining. I love this weather")
    1
    """
    # Split the input string S by '.', '?' or '!' to identify sentences
    sentences = S.split('.') + S.split('?') + S.split('!')
    # Initialize the count of boredom sentences
    boredom_count = 0
    # Iterate through each sentence
    for sentence in sentences:
        # Strip leading and trailing whitespaces from the sentence
        trimmed_sentence = sentence.strip()
        # Check if the sentence starts with 'I'
        if trimmed_sentence.startswith('I '):
            # Increment the boredom count
            boredom_count += 1
    return boredom_count

Evaluation output:

AssertionError: 3 != 1 : "I love It !"
AssertionError: 4 != 2 : "I feel good today. I will be productive. will kill It"

What went wrong?

Sentences can be delimited by periods, question marks, or exclamation marks. The generated code splits by all three and combines the results, resulting in duplicate sentences.

The canonical solution uses the re package, but here's a fix that's more in line with the style of the generated code.

-    sentences = S.split('.') + S.split('?') + S.split('!')
+    sentences = S.replace('.', '!').replace('?', '!').split('!')

HumanEval/115

Success rate: 0/10

def max_fill(grid, capacity):
    import math
    """
    You are given a rectangular grid of wells. Each row represents a single well,
    and each 1 in a row represents a single unit of water.
    Each well has a corresponding bucket that can be used to extract water from it, 
    and all buckets have the same capacity.
    Your task is to use the buckets to empty the wells.
    Output the number of times you need to lower the buckets.

    Example 1:
        Input: 
            grid : [[0,0,1,0], [0,1,0,0], [1,1,1,1]]
            bucket_capacity : 1
        Output: 6

    Example 2:
        Input: 
            grid : [[0,0,1,1], [0,0,0,0], [1,1,1,1], [0,1,1,1]]
            bucket_capacity : 2
        Output: 5
    
    Example 3:
        Input: 
            grid : [[0,0,0], [0,0,0]]
            bucket_capacity : 5
        Output: 0

    Constraints:
        * all wells have the same length
        * 1 <= grid.length <= 10^2
        * 1 <= grid[:,1].length <= 10^2
        * grid[i][j] -> 0 | 1
        * 1 <= capacity <= 10
    """
    # Count total units of water in the grid
    water_units = sum(sum(row) for row in grid)
    
    # Calculate the number of bucket operations needed
    bucket_operations = math.ceil(water_units / capacity)
    
    return bucket_operations

What went wrong?

Okay, I'm not going to lie I had to read the prompt a few times to understand this one. The issue with the generated code is that treats the all water units together, not per well. If you have two wells that need to be empty and a huge bucket, it acts as though you can use the full bucket for more than one well.

HumanEval/132

Success rate: 0/10

def is_nested(string):
    '''
    Create a function that takes a string as input which contains only square brackets.
    The function should return True if and only if there is a valid subsequence of brackets 
    where at least one bracket in the subsequence is nested.

    is_nested('[[]]') ➞ True
    is_nested('[]]]]]]][[[[[]') ➞ False
    is_nested('[][]') ➞ False
    is_nested('[]') ➞ False
    is_nested('[[][]]') ➞ True
    is_nested('[[]][[') ➞ True
    '''
    stack = []
    for char in string:
        if char == '[':
            stack.append(char)
        elif char == ']':
            if stack:
                stack.pop()
                if stack:  # if there's still an opening bracket in the stack after pop
                    return True
            else:
                return False
    
    # Check if there was any nesting
    return False

Evaluation output:

assert candidate('[[]') == False

What went wrong?

The code assumes that any item on the stack indicates that a group is fully closed, even if a close bracket isn't included.

HumanEval/134

Success rate: 0/10

def check_if_last_char_is_a_letter(txt):
    '''
    Create a function that returns True if the last character
    of a given string is an alphabetical character and is not
    a part of a word, and False otherwise.
    Note: "word" is a group of characters separated by space.

    Examples:
    check_if_last_char_is_a_letter("apple pie") ➞ False
    check_if_last_char_is_a_letter("apple pi e") ➞ True
    check_if_last_char_is_a_letter("apple pi e ") ➞ False
    check_if_last_char_is_a_letter("") ➞ False 
    '''
    # Handle the case when the input string is empty
    if len(txt) == 0:
        return False
    
    # Strip the text to ignore any trailing spaces for analysis
    stripped_txt = txt.rstrip()

    # In case the string is only spaces (empty post-strip)
    if len(stripped_txt) == 0:
        return False

    # Check if last character is a letter
    if stripped_txt[-1].isalpha():
        # Further check if it is a part of a word; whether followed by a non-letter
        preceding_index = len(stripped_txt) - 2
        if preceding_index < 0 or stripped_txt[preceding_index].isspace():
            return True
        else:
            return False
    
    return False

Evaluation output:

assert candidate("eeeee e ") == False

What went wrong?

Trailing whitespace is important in this function, so we shouldn't throw it away.

-    stripped_txt = txt.rstrip()
+    stripped_txt = txt

HumanEval/145

Success rate: 0/10

def order_by_points(nums):
    """
    Write a function which sorts the given list of integers
    in ascending order according to the sum of their digits.
    Note: if there are several items with similar sum of their digits,
    order them based on their index in original list.

    For example:
    >>> order_by_points([1, 11, -1, -11, -12]) == [-1, -11, 1, -12, 11]
    >>> order_by_points([]) == []
    """
    def sum_of_digits(n):
        return sum(int(digit) for digit in str(abs(n)))

    # Sort based on the sum of digits, with stable sorting to maintain original order for ties
    return sorted(nums, key=sum_of_digits)

What went wrong?

Taking the absolute value of the number before turning it into a string? That's not going to work.

The canonical solution looks like this:

    def digits_sum(n):
        neg = 1
        if n < 0: n, neg = -1 * n, -1 
        n = [int(i) for i in str(n)]
        n[0] = n[0] * neg
        return sum(n)
    return sorted(nums, key=digits_sum)

Partial failures

GPT-4o also struggled with HumanEval/93, only coming up with a correct solution twice.

HumanEval/93

msg="run check" task=HumanEval/93 pass=true passes=2 failures=8 samples=10

HumanEval/93

Success rate: 2/10

def encode(message):
    """
    Write a function that takes a message, and encodes in such a
    way that it swaps case of all letters, replaces all vowels in
    the message with the letter that appears 2 places ahead of that
    vowel in the english alphabet.
    Assume only letters.

    Examples:
    >>> encode('test')
    'TGST'
    >>> encode('This is a message')
    'tHKS KS C MGSSCGG'
    """
    vowels = 'aeiouAEIOU'
    replacements = {'a': 'c', 'e': 'g', 'i': 'k', 'o': 'q', 'u': 'w',
                     'A': 'C', 'E': 'G', 'I': 'K', 'O': 'Q', 'U': 'W'}

    encoded_message = ''
    for char in message:
        if char in vowels:
            encoded_message += replacements[char]
        else:
            encoded_message += char.swapcase()

    return encoded_message

Evaluation output:

AssertionError: 'tGst' != 'tgst'
AssertionError: 'mwDcSkR' != 'mWDCSKR'
AssertionError: 'yGs' != 'ygs'
AssertionError: 'tHkS kS c MgSScGg' != 'tHKS KS C MGSSCGG'
AssertionError: 'K dqnT kNQW wHCT TQ wRKTG' != 'k dQnT kNqW wHcT Tq wRkTg'

What went wrong?

This is a logic bug in that it either swaps the case or replaces the vowel, not both.

-            encoded_message += replacements[char]
+            encoded_message += replacements[char].swapcase()

Overall results

Here's a table with the number of problems that failed a given number of times.

Failure count	Number of problems
0	125
1	15
2	6
3	8
4	0
5	2
6	3
7	1
8	3
9	1
10	5

Automatic fixes?

Join us next time where we'll feed these failures back into the model and see how many HumanEval problems it can solve with some additional error context.

Total failures

HumanEval/91

What went wrong?

HumanEval/115

What went wrong?

HumanEval/132

What went wrong?

HumanEval/134

What went wrong?

HumanEval/145

What went wrong?

Partial failures

HumanEval/93

HumanEval/93

What went wrong?

Overall results

Automatic fixes?

AI writes code. Riza runs it.

Read next

Introducing the self-hosted Riza Code Interpreter