Companies Epoch AI and METR have introduced MirrorCode — a new benchmark designed to test the ability of AI agents to reconstruct software from scratch, based solely on behavior (black-box) and documentation without access to the source code.
What Happened
MirrorCode testing involved 25 programs across six programming languages: Python, C, Rust, Go, OCaml, and Ada. Task performance quality was verified through byte-exact output matching (stdout/stderr). The best result was achieved by the Claude Opus 4.7 model, which scored 56%. Notably, it successfully reconstructed the 16,000-line bioinformatics toolkit gotree in Go in just 14 hours at a cost of $251.
Context
The benchmark utilizes a black-box testing method, which minimizes the risk of hardcoded answers and shifts the focus from syntax checking to evaluating the agents' ability for long-horizon planning. Conducting deep tests on true reasoning requires extremely high computational budgets, reaching tens of billions of tokens and thousands of dollars per single run.
Why It Matters for the Industry
MirrorCode sets a new standard for evaluating agent autonomy, moving from simple chatbots to full-fledged executors of complex action chains. This stimulates the development of tools for evaluating "long-horizon" planning and creates demand for specialized infrastructure to run resource-intensive agentic cycles. In the long term, this could lead to the emergence of commercial tools for automatic legacy code migration.
Why It Matters for Users
For developers and users, this is an important indicator of how realistic it is to delegate the task of creating analogues of existing utilities to AI. While agents are already showing success with simple tools, human involvement remains critically necessary for quality control and system design when implementing large-scale architectural projects (Large tier).
What Is Still Unknown / Limitations
The current high cost of computation makes the mass industrial application of AI agents for solving complex tasks economically unfeasible. Additionally, models can still make errors when dealing with edge cases and complex modular architectures.
Sources
Author
Look at AI, Editorial Team