The open-source tool clawmark has been released—a CLI utility written in Rust designed to conduct objective A/B testing of CLAUDE.md instruction files using the SWE-bench Lite benchmark.

image

What Happened

Developer emiliolugo introduced clawmark, which allows for comparing the effectiveness of two different instruction configuration variants (CLAUDE.md). The process involves running Claude locally, generating patches, and subsequently performing automatic evaluation via the official SWE-bench harness within a Docker environment. Upon completion, the tool generates a summary report with the testing results.

Context

Traditionally, configuring system prompts and CLAUDE.md files for AI agents relies on developer intuition, which makes objective quality assessment difficult. clawmark moves this process into the engineering domain, using standardized SWE-bench Lite tasks to verify an agent's actual ability to solve software problems rather than just following textual instructions.

Why It Matters for the Industry

The tool facilitates the transition from intuitive prompt engineering to an eval-driven development methodology. This allows developers of LLM-based systems to implement a scientific approach to optimizing agent behavior, creating a reliable infrastructure for automated testing and verification of system instructions.

Why It Matters for Users

Developers and engineers gain the ability to quickly and effectively test hypotheses for improving their AI agents' behavior. Using clawmark minimizes the risk of code quality degradation when updating prompts and helps avoid regressions by replacing manual testing with an automated process backed by empirical data.

Sources

Author

Look at AI, Editorial Team