The failure of Linear's AI agent, which sent incorrect emails to an existing client six times, has exposed a critical problem: standard text quality evaluation methods are unable to detect errors in action logic and fact-checking.

What Happened
Linear's AI agent committed a series of errors by sending incorrect sales emails to an existing customer. The problem was not the quality of the writing, but rather that the agent ignored the customer's actual status and contact history, violating the so-called "state contract."
Context
Traditional AI evaluation methods (LLM-as-a-judge) focus on linguistic parameters such as fluency and coherence. However, in the case of autonomous agents, errors shift from the realm of text hallucinations to the realm of performing unauthorized actions based on incomplete or incorrect data about the external world.
Why This Matters for the Industry
There is a paradigm shift occurring in the industry: from evaluating text generation to verifying the "evidence path." Developers need to implement state-verification mechanisms that check the agent's actions against the current system state (customer status, email domain, history) before a critical action is executed.
Why This Matters for Users
When using and developing AI agents, one cannot rely solely on how politely and grammatically a bot formulates its responses. It is vital to verify whether the system possesses mechanisms to confirm critical data before hitting the "send" button to avoid operational risks.
Sources
Author
Look at AI, Editorial Team
