Lessons from Building Local AI Workflows: Multi-Agent System Challenges and Solutions

Developing local multi-agent systems to automate complex tasks, such as video editing, has revealed fundamental limitations in modern language models and architectural approaches.

What Happened

While creating an automated video editing system, three critical technical problems were discovered: the Lost-in-the-Middle effect (LLMs ignoring the central part of the context), the sycophancy effect (the tendency of reviewer agents to agree with generator agents), and Whisper's inaccuracy in determining logical sentence boundaries. To solve these problems, a "sandwich" technique (repeating key information at the beginning and end of a prompt) was applied, model heterogeneity was implemented (using different LLM families for verification), and Whisper was replaced with Vosk for more precise acoustic alignment.

Context

The Lost-in-the-Middle problem is related to an architectural feature of LLMs where the model's attention focuses on the edges of a long context. The sycophancy effect in multi-agent systems (MAS) occurs when using identical models in a "generator-critic" setup leads to a collapse of discussion and a lack of objective verification. Additionally, universal models like Whisper may underperform compared to specialized lightweight solutions in tasks requiring high precision in timestamps and audio segmentation.

Why It Matters for the Industry

This case highlights the need to move from simple call chains to complex hierarchical systems with pronounced architectural heterogeneity. The industry needs to develop Agentic Eval methods to check agent performance for sycophancy and create frameworks that support the integration of specialized tools (audio/vision/math) instead of relying on a single universal model for all tasks.

Why It Matters for Users

Developers creating AI agents should avoid using identical models for both generation and verification to prevent mutual error confirmation. When working with audio and editing tasks, consider using Vosk instead of Whisper to achieve more accurate phrase boundary detection. To improve performance with long contexts, it is recommended to use the "sandwich" technique when structuring prompts.

Sources

Lessons from a Weekend Building Local AI Workflows

Author

Look at AI, Editorial Staff