Multi-language Evidence: Guiding Coding Agents Beyond Python

Back to overall page

This page provides supporting evidence that the value of using CodeHealth as a compass for coding agents generalizes beyond Python.

For this series of experiments, we deployed a self-hosted instance of the open-source coding agent gptme on infrastructure equipped with NVIDIA A100 GPUs. On the same server, the agent used Qwen3-Coder-30b as the underlying LLM. This is the strongest medium-sized model we have worked with and, critically, it supports tool calling and integration with the CodeScene MCP server. For each file, the agent was allowed up to 50 refactoring iterations.

Test pass rate as a function of CodeHealth

gptme with MCP
Figure 1. gptme with Qwen3 and CodeHealth guidance via MCP.

Takeaways

File-level refactoring results

Each horizontal line represents the CodeHealth journey of a single file, sorted vertically by their original CodeHealth:

Detailed refactoring results with gptme
Figure 2. Detailed results from refactoring with gptme guided by CodeHealth. Note that the apparent bumps in the C++ and Java plots are effects of the file sampling strategy.

Takeaways


This research was conducted at CodeScene and Lund University with support from Vinnova, Sweden’s Innovation Agency.

CodeScene logo Lund University logo Vinnova logo