Kong engineers have implemented a multi-agent AI workflow to automate the elimination of flaky tests in their CI/CD environment. The system, based on an Orchestrator-Investigator-Verifier architecture, demonstrated the ability to solve complex debugging tasks that had previously remained unresolved for years.

What Happened
The Kong team used a multi-agent system based on Claude 3 Opus to analyze logs and generate fixes, and Claude 3 Haiku to verify the results. During the experiment, the agents successfully fixed 12 out of 15 of the most problematic tests. Among the achievements was the elimination of a bug in the configuration loader that specialists had been unable to fix for five years.
Context
The problem of "flaky" tests (unstable tests that yield different results with the same code) is a critical challenge for the reliability of CI/CD pipelines. Traditional approaches to using LLMs are often limited by context windows and the lack of an iterative verification cycle, which makes solving deep technical problems difficult.
Why It Matters for the Industry
This case marks a transition from simply using LLMs as chatbots to creating specialized multi-agent systems. The architecture, which divides roles into Orchestrator, Investigator, and Verifier, allows for an efficient implementation of the "observe-fix-verify" cycle, while optimizing token costs and context management when performing complex engineering tasks.
Why It Matters for Users
For engineers, this is practical proof that AI agents can effectively work with real technical debt and automate routine code debugging. Implementing such tools into the Software Development Life Cycle (SDLC) allows specialists to free up time for designing new features and reduces operational costs for infrastructure maintenance.
What Is Not Yet Known / Limitations
There is a need to consider potential business risks and intellectual property (IP) protection issues when delegating debugging tasks to autonomous agents.
Sources
Author
Look at AI, Editorial Staff
