Evaluating LLMs on COBOL

Gabriel Gordon-Hall

Gabriel Gordon-Hall

March 30th, 2024

How well can LLMs write COBOL?

LLMs are fast-changing the way that we write software. Over a million developers now pay for GitHub Copilot and recent breakthroughs in LLM reasoning have brought the dream of a fully AI Software Engineer closer to reality. But while it’s not hard to find a demo of an LLM coding a website or a clone of Flappy Bird, not much is known about their ability to write code in older ‘legacy’ languages like COBOL.

The opportunity for LLM COBOL generation is huge. Although the language was first released in 1959, it continues to power critical systems - 95% of US ATM transactions are processed in COBOL. But it's not taught in computer science courses or bootcamps, and the engineers who write it professionally are steadily retiring. If LLMs could understand and write COBOL they could help maintain the 800 billion lines still in production today.  

So, how well can LLMs write COBOL? As far as we know, nobody has publicly tried to answer this question. Until now…

Introducing COBOLEval

Today we’re releasing COBOLEval, the first evaluation benchmark for LLM code completions in COBOL. It consists of 146 challenging coding problems that have been converted into COBOL from the widely-used HumanEval Python generation benchmark. Each problem is paired with an average of 6 test cases. An LLM-generated solution has to pass all of them to be correct. We’re also releasing a test harness that you can use to evaluate your own models, as well as mAInframer-1 - a series of open-source models based on CodeLlama that we’ve fine-tuned specifically to write COBOL - which outperform GPT-4.

You can get started with COBOLEval here: https://github.com/BloopAI/cobolEval

From HumanEval to COBOLEval

Functions

Converting HumanEval to COBOL isn’t as straightforward as it sounds. Each HumanEval problem consists of a prompt, a typed Python function signature and docstring, which is passed directly to an LLM, which then implements the body of the function.

from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

But this immediately poses a problem. COBOL is a procedural programming language; it doesn’t have functions.

It does, however, have subprograms. So we transform each problem into a COBOL program where arguments and return variables are defined in the LINKAGE SECTION so they can be passed and read from a calling program.

       IDENTIFICATION DIVISION.
       PROGRAM-ID. HAS-CLOSE-ELEMENTS.

       ENVIRONMENT DIVISION.

       INPUT-OUTPUT SECTION.

       DATA DIVISION.

       LINKAGE SECTION.

       01 LINKED-ITEMS.
           05 L-NUMBERS OCCURS 100 TIMES INDEXED BY NI COMP-2.
           05 L-THRESHOLD COMP-2.
           05 RESULT PIC 9.

      * Check if in given list of numbers, are any two numbers closer to each other than
      * given threshold.
      * >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
      * False
      * >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
      * True
      *

      * Complete the WORKING-STORAGE SECTION and the PROCEDURE DIVISION
      * Store the result in the RESULT variable and mark the end of your program with END PROGRAM

       WORKING-STORAGE SECTION.

This is the format of COBOLEval prompts that are passed to the LLM.

Types

COBOL’s type system is radically different to modern programming languages. Variables are declared with PICTURE clauses (PIC for short) which specify the number of characters they occupy in memory. For example, PIC X(100) is a string with 100 characters, while PIC S9(10) is a 10-digit signed integer. Unlike Python, COBOL doesn’t have variable-length strings, integers or arrays, so we fix these lengths to upper-bounds (no COBOLEval problems accept or return an array with a length greater than 100).

This gives us our mapping from Python to COBOL types:

Int => PIC S9(10)
Float => COMP-2
Str => PIC X(100)
List[Int] => OCCURS 100 TIMES INDEXED BY I PIC S9(10)

Note that some Python types, like Any or Dict, can’t easily be represented in COBOL. We ignore HumanEval problems that accept or return them. Luckily there aren’t many of them.

Working-Storage Section

The veteran COBOL programmers among you will have spotted a bug in the prompt program above. The LINKAGE SECTION and WORKING-STORAGE SECTION are the wrong way round. If we tried to compile it, we’d get an error.

Why are they in the wrong order? The problem is that - unlike modern languages (you’ll be used to hearing that by now) - COBOL does not have local variables. All variables - even temporary ones - that are used in the program logic have to be declared ahead o time in the WORKING-STORAGE SECTION.

Clearly, the LLM needs to generate the WORKING-STORAGE SECTION so that it can declare variables it will use in its solution. But at the same time, it needs to know the variables that have already been declared in the LINKAGE SECTION. COBOLs strict structure prevents an LLM from generating solutions in neat left-to-right order.

We offer two possible approaches. One is illustrated above: we swap the order of the sections in the prompt, and have the model generate the WORKING-STORAGE SECTION and the PROCEDURE DIVISION one after the other. We then reinsert the implemented WORKING-STORAGE SECTION into the program at the correct position. This approach is simple, but requires the LLM generalise beyond its training data (it won’t have seen many programs where the sections are out-of-order).

The other approach is to use a technique called infilling. Here, we decompose the program into prefix, middle, and suffix (delimited by the special tokens <PRE>, <MID> and <SUF>) and generate a completion in the order: prefix, suffix, middle. This allows us to fill-in code in the middle of a program.

If we wanted to fill in the third line of this Python function

def factorial(n):
  if n > 1:
    return n * factorial(n - 1)
  elif n ==1:
    return 1

Our infilling prompt would look like this (where the prompt is green, and the completion is red):

<PRE>def factorial(n):
  if n > 1:
<SUF>  elif n ==1:
    return 1
<MID>   return n * factorial(n - 1)

Back to COBOL, we can generate both the WORKING-STORAGE SECTION and the PROCEDURE DIVISION by prompting the LLM like this:

<PRE>
       IDENTIFICATION DIVISION.
       PROGRAM-ID.  SUM-OF-CUBES.
       ENVIRONMENT DIVISION.
       
       INPUT-OUTPUT SECTION.

       DATA DIVISION.
<SUF>
       LINKAGE SECTION.

       01 LINKED-ITEMS.
           05 L-MAX-STEP PIC S9(10).
           05 RESULT PIC S9(10).

      * 
      * Given an integer number, return the sum of all the integers below it.
      * 
      * Example:
      * 
      * sum_of_cubes(3) == 1**3 + 2**3 == 9
      * sum_of_cubes(5) == 100
      *  

      * Store the result in the RESULT variable and mark the end of your program with END PROGRAM

       PROCEDURE DIVISION USING LINKED-ITEMS.
       
           PERFORM VARYING STEP FROM 0 BY 1 
               UNTIL STEP IS EQUAL TO L-MAX-STEP
               COMPUTE CUBE = STEP ** 3
               ADD CUBE TO CUBE-SUM
           END-PERFORM
           .
           MOVE CUBE-SUM TO RESULT.
           DISPLAY 'THE SUM OF THE CUBES IS ' RESULT.
           GOBACK.

       END PROGRAM SUM-OF-CUBES.
<MID>
       WORKING-STORAGE SECTION.
       
       01 STEP         PIC S9(10).
       01 CUBE         PIC 9(7).
       01 CUBE-SUM     PIC 9(7) VALUE 0.

We cut the WORKING-STORAGE SECTION out of the program, so the LLM first generates the solution logic in the PROCEDURE DIVISION and the special <MID> token, then declares the variables it used by generating the WORKING-STORAGE SECTION.

Note that this works best with models that have been trained to support infilling (e.g. CodeLlama).

Putting it all together

Each HumanEval problem is accompanied by a set of test cases.

def check(candidate):
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
    assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

We generate a COBOLEval calling program for each test case. The calling program defines the data and passes it to the LLM generated solution. It then writes the RESULT variable to a file.

       IDENTIFICATION DIVISION.
       PROGRAM-ID. CAR-RACE-COLLISION-CALL.

       ENVIRONMENT DIVISION.

       INPUT-OUTPUT SECTION.

       FILE-CONTROL.

       SELECT OUTPUT-FILE ASSIGN TO "CAR-RACE-COLLISION.TXT"
           ORGANIZATION IS LINE SEQUENTIAL
           STATUS IS OUTPUT-FILE-STATUS.

       DATA DIVISION.

       FILE SECTION.
       FD OUTPUT-FILE.
       01 OUTPUT-RECORD PIC S9(10) SIGN LEADING.

       WORKING-STORAGE SECTION.

       01 OUTPUT-FILE-STATUS PIC X(02).

       01 LINKED-ITEMS.
           05 L-N PIC S9(10).
           05 RESULT PIC S9(10).

       PROCEDURE DIVISION.

       MOVE 10 TO L-N

       CALL "CAR-RACE-COLLISION" USING LINKED-ITEMS

       OPEN OUTPUT OUTPUT-FILE

       IF OUTPUT-FILE-STATUS NOT = "00"
           DISPLAY "ERROR OPENING OUTPUT FILE"
           STOP RUN
        END-IF

       MOVE RESULT TO OUTPUT-RECORD
       WRITE OUTPUT-RECORD

        IF OUTPUT-FILE-STATUS NOT = "00"
            DISPLAY "ERROR WRITING TO OUTPUT FILE"
            STOP RUN
        END-IF

        CLOSE OUTPUT-FILE

Solution outputs are compared to the true values and a score is calculated. The COBOLEval repo includes an evaluation harness that fully automates this process. Note: COBOLEval uses the open-source GnuCOBOL compiler.

So how good are they?

We have a benchmark, so how well can state-of-the-art LLMs write COBOL? We calculated pass@1 scores (with temperature = 0) for some widely used models.

ModelPass@1% Compile
GPT 3.5 Turbo4.1119.17
GPT-48.9047.94
CodeLlama-7b0.6825.34
CodeLlama-13b1.3613.01
CodeLlama-34b2.0578.76

GPT-4 - the best-performing model - generates a correct solution for 10.27% of problems. Compare this to HumanEval, where it solves 67% of problems. CodeLlama, one of the best open-source coding models, fares even worse, with the 34b variant only clocking 2%. COBOLEval is hard.

Looking at the failure cases, we can see that state-of-the-art LLMs struggle to generate COBOL that even compiles. Only 47.94% of GPT-4 generated solutions compile with GnuCOBOL.

          PERFORM VARYING I FROM Y BY -1 UNTIL I < X
               IF I MOD 2 = 0 THEN
                   MOVE I TO MAX-EVEN
                   EXIT PERFORM
               END-IF
           END-PERFORM

Here GPT-4 has tried to use the MOD function without preceding it with the FUNCTION keyword.

mAInframer-1

We’re also releasing mAInframer-1, a series of models that we’ve fine-tuned to write COBOL. You can download them here: https://huggingface.co/bloopai

ModelPass@1% Compile
mAInframer-7b6.1669.17
mAInframer-13b8.954.1
mAInframer-34b10.2773.97

All three mAInframer models considerably outperform CodeLlama models, while the 34b model gets a higher pass@1 than GPT-4! More details about how we did that coming soon 😀

What’s next?

There’s clearly a lot of room to improve LLM-generated COBOL. We hope that the community can use COBOLEval to track the performance of the latest models and build LLMs that help maintain the world’s supply of critical COBOL code.