How well can LLMs write COBOL?
LLMs are fast-changing the way that we write software. Over a million developers now pay for GitHub Copilot and recent breakthroughs in LLM reasoning have brought the dream of a fully AI Software Engineer closer to reality. But while it’s not hard to find a demo of an LLM coding a website or a clone of Flappy Bird, not much is known about their ability to write code in older ‘legacy’ languages like COBOL.
The opportunity for LLM COBOL generation is huge. Although the language was first released in 1959, it continues to power critical systems - 95% of US ATM transactions are processed in COBOL. But it's not taught in computer science courses or bootcamps, and the engineers who write it professionally are steadily retiring. If LLMs could understand and write COBOL they could help maintain the 800 billion lines still in production today.
So, how well can LLMs write COBOL? As far as we know, nobody has publicly tried to answer this question. Until now…
Introducing COBOLEval
Today we’re releasing COBOLEval, the first evaluation benchmark for LLM code completions in COBOL. It consists of 146 challenging coding problems that have been converted into COBOL from the widely-used HumanEval Python generation benchmark. Each problem is paired with an average of 6 test cases. An LLM-generated solution has to pass all of them to be correct. We’re also releasing a test harness that you can use to evaluate your own models, as well as mAInframer-1 - a series of open-source models based on CodeLlama that we’ve fine-tuned specifically to write COBOL - which outperform GPT-4.
You can get started with COBOLEval here: https://github.com/BloopAI/cobolEval
From HumanEval to COBOLEval
Functions
Converting HumanEval to COBOL isn’t as straightforward as it sounds. Each HumanEval problem consists of a prompt, a typed Python function signature and docstring, which is passed directly to an LLM, which then implements the body of the function.
from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
But this immediately poses a problem. COBOL is a procedural programming language; it doesn’t have functions.
It does, however, have subprograms. So we transform each problem into a COBOL program where arguments and return variables are defined in the LINKAGE SECTION
so they can be passed and read from a calling program.
IDENTIFICATION DIVISION.
PROGRAM-ID. HAS-CLOSE-ELEMENTS.
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
DATA DIVISION.
LINKAGE SECTION.
01 LINKED-ITEMS.
05 L-NUMBERS OCCURS 100 TIMES INDEXED BY NI COMP-2.
05 L-THRESHOLD COMP-2.
05 RESULT PIC 9.
* Check if in given list of numbers, are any two numbers closer to each other than
* given threshold.
* >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
* False
* >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
* True
*
* Complete the WORKING-STORAGE SECTION and the PROCEDURE DIVISION
* Store the result in the RESULT variable and mark the end of your program with END PROGRAM
WORKING-STORAGE SECTION.
This is the format of COBOLEval prompts that are passed to the LLM.
Types
COBOL’s type system is radically different to modern programming languages. Variables are declared with PICTURE
clauses (PIC
for short) which specify the number of characters they occupy in memory. For example, PIC X(100)
is a string with 100 characters, while PIC S9(10)
is a 10-digit signed integer. Unlike Python, COBOL doesn’t have variable-length strings, integers or arrays, so we fix these lengths to upper-bounds (no COBOLEval problems accept or return an array with a length greater than 100).
This gives us our mapping from Python to COBOL types:
Int => PIC S9(10)
Float => COMP-2
Str => PIC X(100)
List[Int] => OCCURS 100 TIMES INDEXED BY I PIC S9(10)
Note that some Python types, like Any
or Dict
, can’t easily be represented in COBOL. We ignore HumanEval problems that accept or return them. Luckily there aren’t many of them.
Working-Storage Section
The veteran COBOL programmers among you will have spotted a bug in the prompt program above. The LINKAGE SECTION
and WORKING-STORAGE SECTION
are the wrong way round. If we tried to compile it, we’d get an error.
Why are they in the wrong order? The problem is that - unlike modern languages (you’ll be used to hearing that by now) - COBOL does not have local variables. All variables - even temporary ones - that are used in the program logic have to be declared ahead o time in the WORKING-STORAGE SECTION
.
Clearly, the LLM needs to generate the WORKING-STORAGE SECTION
so that it can declare variables it will use in its solution. But at the same time, it needs to know the variables that have already been declared in the LINKAGE SECTION
. COBOLs strict structure prevents an LLM from generating solutions in neat left-to-right order.
We offer two possible approaches. One is illustrated above: we swap the order of the sections in the prompt, and have the model generate the WORKING-STORAGE SECTION
and the PROCEDURE DIVISION
one after the other. We then reinsert the implemented WORKING-STORAGE SECTION
into the program at the correct position. This approach is simple, but requires the LLM generalise beyond its training data (it won’t have seen many programs where the sections are out-of-order).
The other approach is to use a technique called infilling. Here, we decompose the program into prefix, middle, and suffix (delimited by the special tokens <PRE>
, <MID>
and <SUF>
) and generate a completion in the order: prefix, suffix, middle. This allows us to fill-in code in the middle of a program.
If we wanted to fill in the third line of this Python function
def factorial(n):
if n > 1:
return n * factorial(n - 1)
elif n ==1:
return 1
Our infilling prompt would look like this (where the prompt is green, and the completion is red):
<PRE>def factorial(n):
if n > 1:
<SUF> elif n ==1:
return 1
<MID> return n * factorial(n - 1)
Back to COBOL, we can generate both the WORKING-STORAGE SECTION
and the PROCEDURE DIVISION
by prompting the LLM like this:
<PRE>
IDENTIFICATION DIVISION.
PROGRAM-ID. SUM-OF-CUBES.
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
DATA DIVISION.
<SUF>
LINKAGE SECTION.
01 LINKED-ITEMS.
05 L-MAX-STEP PIC S9(10).
05 RESULT PIC S9(10).
*
* Given an integer number, return the sum of all the integers below it.
*
* Example:
*
* sum_of_cubes(3) == 1**3 + 2**3 == 9
* sum_of_cubes(5) == 100
*
* Store the result in the RESULT variable and mark the end of your program with END PROGRAM
PROCEDURE DIVISION USING LINKED-ITEMS.
PERFORM VARYING STEP FROM 0 BY 1
UNTIL STEP IS EQUAL TO L-MAX-STEP
COMPUTE CUBE = STEP ** 3
ADD CUBE TO CUBE-SUM
END-PERFORM
.
MOVE CUBE-SUM TO RESULT.
DISPLAY 'THE SUM OF THE CUBES IS ' RESULT.
GOBACK.
END PROGRAM SUM-OF-CUBES.
<MID>
WORKING-STORAGE SECTION.
01 STEP PIC S9(10).
01 CUBE PIC 9(7).
01 CUBE-SUM PIC 9(7) VALUE 0.
We cut the WORKING-STORAGE SECTION
out of the program, so the LLM first generates the solution logic in the PROCEDURE DIVISION
and the special <MID>
token, then declares the variables it used by generating the WORKING-STORAGE SECTION
.
Note that this works best with models that have been trained to support infilling (e.g. CodeLlama).
Putting it all together
Each HumanEval problem is accompanied by a set of test cases.
def check(candidate):
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False
We generate a COBOLEval calling program for each test case. The calling program defines the data and passes it to the LLM generated solution. It then writes the RESULT
variable to a file.
IDENTIFICATION DIVISION.
PROGRAM-ID. CAR-RACE-COLLISION-CALL.
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT OUTPUT-FILE ASSIGN TO "CAR-RACE-COLLISION.TXT"
ORGANIZATION IS LINE SEQUENTIAL
STATUS IS OUTPUT-FILE-STATUS.
DATA DIVISION.
FILE SECTION.
FD OUTPUT-FILE.
01 OUTPUT-RECORD PIC S9(10) SIGN LEADING.
WORKING-STORAGE SECTION.
01 OUTPUT-FILE-STATUS PIC X(02).
01 LINKED-ITEMS.
05 L-N PIC S9(10).
05 RESULT PIC S9(10).
PROCEDURE DIVISION.
MOVE 10 TO L-N
CALL "CAR-RACE-COLLISION" USING LINKED-ITEMS
OPEN OUTPUT OUTPUT-FILE
IF OUTPUT-FILE-STATUS NOT = "00"
DISPLAY "ERROR OPENING OUTPUT FILE"
STOP RUN
END-IF
MOVE RESULT TO OUTPUT-RECORD
WRITE OUTPUT-RECORD
IF OUTPUT-FILE-STATUS NOT = "00"
DISPLAY "ERROR WRITING TO OUTPUT FILE"
STOP RUN
END-IF
CLOSE OUTPUT-FILE
Solution outputs are compared to the true values and a score is calculated. The COBOLEval repo includes an evaluation harness that fully automates this process. Note: COBOLEval uses the open-source GnuCOBOL compiler.
So how good are they?
We have a benchmark, so how well can state-of-the-art LLMs write COBOL? We calculated pass@1 scores (with temperature = 0) for some widely used models.
Model | Pass@1 | % Compile |
---|---|---|
GPT 3.5 Turbo | 4.11 | 19.17 |
GPT-4 | 8.90 | 47.94 |
CodeLlama-7b | 0.68 | 25.34 |
CodeLlama-13b | 1.36 | 13.01 |
CodeLlama-34b | 2.05 | 78.76 |
GPT-4 - the best-performing model - generates a correct solution for 10.27% of problems. Compare this to HumanEval, where it solves 67% of problems. CodeLlama, one of the best open-source coding models, fares even worse, with the 34b variant only clocking 2%. COBOLEval is hard.
Looking at the failure cases, we can see that state-of-the-art LLMs struggle to generate COBOL that even compiles. Only 47.94% of GPT-4 generated solutions compile with GnuCOBOL.
PERFORM VARYING I FROM Y BY -1 UNTIL I < X
IF I MOD 2 = 0 THEN
MOVE I TO MAX-EVEN
EXIT PERFORM
END-IF
END-PERFORM
Here GPT-4 has tried to use the MOD
function without preceding it with the FUNCTION
keyword.
mAInframer-1
We’re also releasing mAInframer-1, a series of models that we’ve fine-tuned to write COBOL. You can download them here: https://huggingface.co/bloopai
Model | Pass@1 | % Compile |
---|---|---|
mAInframer-7b | 6.16 | 69.17 |
mAInframer-13b | 8.9 | 54.1 |
mAInframer-34b | 10.27 | 73.97 |
All three mAInframer models considerably outperform CodeLlama models, while the 34b model gets a higher pass@1 than GPT-4! More details about how we did that coming soon 😀
What’s next?
There’s clearly a lot of room to improve LLM-generated COBOL. We hope that the community can use COBOLEval to track the performance of the latest models and build LLMs that help maintain the world’s supply of critical COBOL code.