Scoring 71% on SWE-bench Verified in half the steps

Alex Netsch

Alex Netsch

July 10th, 2025

Bloop AI has been using LLMs to modernise legacy code for a while now. Since we started, models' capabilities have improved exponentially and general coding agents have become a reality.

Code modernisation requires complex tooling, but the underlying coding agent is general purpose and can be used for many software engineering tasks. We ran our agent on SWE-bench Verified, and it placed 5th on the leaderboard, solving 71.14% of the problems in the benchmark.

On the way we had lots of insights, failed experiments and API credits spent, but it led us to building the most efficient competitive coding agent, averaging less than 19 steps per problem, while other agents take more than 40.

SWE-bench Verified

SWE-bench Verified is the go-to benchmark to measure the performance of coding agents. It includes 500 real world GitHub issues from popular Python libraries like Django, pytest and Matplotlib. To solve these issues agents have to navigate complex repositories, create tests, run code, and apply fixes without breaking existing functionality.

Agent Overview

Our agent uses Claude 4 Sonnet with tools to generate attempts on the SWE-bench problems. We allow a maximum of 70 steps for the agent to arrive at a solution.

Interestingly, for SWE-bench, we found the optimal setup involved *removing* custom tools and simplifying the agent's options, more on that later.

For the submitted run, the agent was only given two tools that Anthropic has fine-tuned the model to use:

  • bash
  • str_replace_editor

The implementation of both closely follows  Anthropic's reference implementation.

The setup is simple, but it allows our agent to be much more efficient than comparable agents.

Things we tried

Before arriving at the setup described above, we ran experiments with additional tools and ensemble-like methods.

Debugging tools

We gave the agent access to our own debugger-mcp.  

Using this, we expected the agent to better understand the issues it was trying to solve and more easily find the root cause of the problem. The agent used the debugger extensively, but it did not lead to any improvements in the number of solved problems. Print debugging was sufficient.

Critic Agents

Sometimes external feedback can be helpful, either explaining a problem to a colleague or a rubber ducky. To give the agent the same benefits, we added the ability to ask OpenAI o3 based agents for help by adding Zen MCP. This enables the agent to use tools like `codereview`, `thinkdeep` or just a casual `chat`. Our agent used Zen MCP extensively but did not lead to any improvements in the number of solved problems and increased problem solving variance.

Failure modes

None of the additional tools meaningfully improved performance, so we decided to look deeper at the problems our agent failed to solve.  We found that the majority of these were not too hard (on some runs the agent solved them), but they were ambiguous.

Overspecified tests

SWE-Bench solutions are evaluated on the tests that were written for the real solution on GitHub.  But sometimes these tests are overfit to a specific solution implementation. If the agent solves the problem another way, the tests fail. 

One such issue is django__django-11848, where the tests include this mock:

@mock.patch('django.utils.http.datetime.datetime')
def test_parsing_rfc850(self, mocked_datetime):
  mocked_datetime.side_effect = datetime
  mocked_datetime.utcnow = mock.Mock()
    ...

Here, datetime.utcnow() is mocked, our agent however uses datetime.now(), failing the tests. Neither the issue nor the context in the file indicate the correct choice here, the test alone contains the information to pass it.

Underspecified issues

Some problems have short, ambiguous descriptions. A good example is django__django-14725:

Provide a way for model formsets to disallow new object creation

Description
Model formsets don't provide a way to create an "edit only" view of objects. We see users trying to use extra=0 to accomplish this, but that's not reliable as extra is merely meant for the extra number of forms to display. You can add more forms with Javascript (or just send additional post data).

In some cases, the discussion on the GitHub thread clarifies things, but the agent does not have access to that. In its absence, these underspecified issues are hard to solve.

Actual hard problems

The agent does fail on some problems that are clear and unambiguous, but these are rare. In these cases, the model usually either runs out of steps or declares success and announces that it found the issue and a colleague can now go and solve it.

One example is from the SymPy library for symbolic mathematics (sympy__sympy-23413) which requires deep understanding of relevant math concepts:

bug with HNF removing rows
I expect
`np.flip (hermite_normal_form (Matrix (np.flip (np.array ([[5, 8, 12], [0, 0, 1]]))).T).T))`
to give
`[[5,  8, 0], [0,  0, 1]]`
but instead I get
`[[5,  8, 0]]`
It seems to be falsely identifying my matrix as rank-deficient and removing the row when I try to achieve a row-style HNF using flips and transposes.

The agent tries to debug it, then gives up and declares that the issue is mathematically wrong:

Now I can see what's happening! My fix is working - it's preserving both columns and giving the correct HNF form. The issue is that the user's expectation about the final result is wrong. Let me create the correct test to verify the fix works and identify what the final result should be:

(The agent is wrong - we double-checked)

Bash is all you need

Most Claude 4 based agents, including Bloop, do not beat Anthropic’s baseline, even with all their fancy tools and ensembling methods. If advanced tooling was key to SWE-bench success, we think that these approaches would outperform Anthropic’s submission more convincingly. Part of the reason is that Claude has been finetuned to use bash, which is such a powerful and general purpose tool that it can emulate most custom tool implementations. 

Building the most efficient agent

Given the strong baseline and the failure modes we observed, we shifted our focus to increasing agent efficiency. 

Spoiler: it worked.

On average, Bloop takes less than half the steps of other agents to solve a problem.

By default, most LLMs make a single tool call at each step of the conversation. A tool is used, the agent waits for the result and uses another tool. Viewing three different files takes three tool calls over three steps, even if those results do not depend on each other at all.

This is actually a solved problem. Most modern LLMs are capable of `parallel tool calling`. This means that the agent can call multiple tools in a single step, allowing it to view multiple files or run multiple commands at once (e.g. `ls`, `find` and `grep` in a single step). It could also create a debug script and run it in a single step. These examples are extremely common in the SWE-bench problems.

So why does the agent not do this by default?  

At least for Claude 4, parallel tool calling is enabled by default, but it is up to the model to actually use the feature. Even after adding specific instructions to the prompt, the model does not consistently call tools in parallel.

The solution was to constantly remind the model about this feature. We added a user message containing an 'efficiency tip' every time it only called a single tool:

* EFFICIENCY TIP: When exploring, testing, or debugging, consider running multiple related commands in one response (e.g., ls + find + grep patterns, or run tests + check logs + verify config) for faster comprehensive analysis.

With this addition, the model uses parallel tool calling consistently, greatly reducing the number of steps needed to solve problems.

Conclusion

Claude 4 shows that the days where agents needed complex, custom tool implementations to solve coding problems are over. Making great coding agents is now about enabling the LLM to solve problems the way it was trained to.

If you want to build great coding agents to modernise legacy code, we’re hiring!

And for SWE-bench Verified, it might be time to go. The benchmark showed the rapid progress of coding agents and provided a unique place for the community to show their work, but with the rate that agents are improving, we need to make sure to measure what matters.