The transition to using agentic AI for tax reporting reveals a critical liability issue due to the probabilistic nature of large language models. TaxCalcBench tests show that even advanced models demonstrate extremely low accuracy in specialized financial tasks, creating risks for both users and developers.

image

What Happened

TaxCalcBench testing results (July 2025) revealed low accuracy among flagship models when performing tax tasks: GPT-5 scored 41.7%, Gemini 2.5 Pro scored 32.4%, and Claude Opus 4 scored 27.5%. Meanwhile, a specialized architecture, Filed, achieved an accuracy of 72.5%. Notably, the IRS holds the taxpayer fully responsible for any errors.

Context

There is a fundamental gap between the probabilistic approach of modern LLMs and the deterministic requirements of tax law. The situation is complicated by the legal precedent *United States v. Heppner* (February 2026), which confirmed that data processed by AI may not be covered by attorney-client privilege.

Why It Matters for the Industry

The industry needs to move from using general-purpose chatbots to creating specialized multi-agent architectures (Vertical AI) capable of providing deterministic results. New regulatory standards and benchmarking methods for agentic systems are expected, similar to the EU AI Act, as well as the implementation of mandatory verification mechanisms and guardrails.

Why It Matters for Users

Using public LLMs for tax calculations carries immense financial risk—errors can cost users thousands of dollars. Additionally, there is a privacy threat: personal data transmitted to AI may be accessible to third parties and used in legal proceedings.

What Is Not Yet Known / Limitations

Further research is needed regarding how exactly new regulatory frameworks will affect the speed of Vertical AI adoption in the financial sector.

Sources

Author

Look at AI, Editorial Staff