bloop | Building an AI Agent (To Modernise COBOL)

You'd be forgiven for thinking that a company with a (simplified) mission of "convert COBOL to Java with AI" would go to great lengths to protect the AI part. First of all, language models are improving at a pace that makes any approach potentially redundant within 12 months. Secondly, we have a long list of unsolved problems and if we talk about them publicly, there's a chance that more people might become interested in solving them.

Legacy code is everywhere in Fortune 500 companies, yet only a small fraction of these codebases are available as open source. As a result, most large language models (LLMs) are trained primarily on code written in modern languages and therefore are more accurate on tasks involving modern languages. We exploit this imbalance by converting COBOL to Java using a static transpiler, and using LLMs to refactor the Java into more modern and readable Java.

The effectiveness of AI agents in any domain depends on their ability to evaluate their work. Since LLMs are probabilistic and prone to errors, this self-evaluation is crucial. In the domain of code translation, the original codebase's behaviour serves as our ground truth and we test whether each refactor alters the program's behaviour. Additionally, we use static code complexity calculators and model-graded evaluations to ensure that each refactor improves code readability.

With unsupervised evaluation in place, we can make infinite attempts at refactoring a codebase, safe in the knowledge that any errors introduced by the LLM will be caught and corrected.

Gpt-4o's output is limited to a maximum of 4096 tokens, or roughly 350 lines of code (LoC). That's not much when you consider that our first commercial project is a codebase with five million LoC. So we start with localised changes, refactoring each method in place to reduce the overall LoC. COBOL's minimal standard library and static memory allocation create plenty of opportunities to improve the code:In the example above, a 26 LoC method has been refactored into 4 by using Java's built in date formatting utilities instead of manual string manipulation. Once each method has been refactored in place, we find that entire classes start to fit within our 4096 token limit and can be rewritten by the LLM.

That's a very brief summary of our progress so far, and here are some of the open questions we're trying to answer:

Can we fine-tune open-source models like Codestral or Llama-3 to replicate the accuracy and context utilisation of closed-source models? Regulations and internal policies prevent some customers sharing code with external model providers, and have to self-host their models.
How do we evaluate whether a refactor has fundamentally changed the behaviour of the program?
Can we design the system so that it learns from its mistakes over time?

If you think you might be able to help solve some of these open questions, we're hiring. Please reach out by emailing join@bloop.ai. The team is talent dense, based on-site in London and we have paying customers as well as plenty of runway.

Previous: Evaluating LLMs on COBOLNext: Scoring 71% on SWE-bench Verified in half the steps