Rubric: A New Python Framework for Deep Testing of AI Agent Behavior

Rubric has been introduced—an independent Python framework designed for the comprehensive evaluation of AI agent behavior. Unlike classical methods that only check the textual output, Rubric analyzes the system's internal processes: tool calls, passed arguments, the order of execution, latency, and reasoning logic.

What Happened

A developer has introduced Rubric, a tool for testing and benchmarking LLM agents. The framework allows for the analysis of the action trace and reasoning trace. The project supports integration with LangGraph and the OpenAI message format, and also enables automated testing in CI/CD via GitHub Actions.

Context

Traditional LLM evaluation methods often focus exclusively on the final textual response (output-only). However, when working with autonomous agents, the problem of "invisible" regressions arises: changing a prompt or a model might not change the text of the response, but it can disrupt the logic of interacting with external tools or the execution order of critical steps.

Why It Matters for the Industry

The tool allows for a transition from evaluating LLMs as chatbots to testing them as full-fledged operating systems (Agentic OS). For the industry, this means the possibility of implementing automated QA systems where the focus shifts from text generation to verifying the reliability of multi-step task execution and adherence to constraints when using tools.

Why It Matters for Users

Developers gain a ready-made mechanism for creating unit tests for agent logic, which helps minimize risks when updating prompts or switching models. This provides the ability to control not just "pretty answers," but also actual compliance with behavioral rules, such as prohibiting the use of certain tools in specific scenarios.

Sources

GitHub - Kareem-Rashed/rubric-eval

Author

Look at AI, Editorial Team