š¤ Can LLMs rewrite software from scratch?
Epoch AI and METR have introduced MirrorCodeāa benchmark designed to test the ability of AI agents to reconstruct software from scratch, relying only on behavior (black-box) and documentation, without access to the source code. The testing involved 25 programs across six languages (Python, C, Rust, Go, OCaml, Ada), with verification conducted through byte-exact output matching. The leader was Claude Opus 4.7 with a score of 56%, successfully handling the bioinformatics toolkit *gotree* (16,000 lines of Go) in just 14 hours.
š MirrorCode sets a new standard for evaluating the "long-horizon" planning of agents, demonstrating that serious tasks require massive computational budgets (up to several thousand dollars per run). The results show that AI is beginning to understand programming logic beyond mere syntax, but still struggles with edge cases and modular architecture.
š¤ This is a significant step toward understanding how realistic it is to give an AI the task: "write me an analog of this utility." For now, agents handle simple tools well, but for large-scale projects (Large tier), human involvement is still required for quality control and architecture.