🤖 Video Report on the SWE-rebench Benchmark
A presentation by Ibragim Badertdinov (Nebius) at the AI Engineer Europe conference has been released. The report focuses on SWE-rebench — a dynamic benchmark that evaluates AI agents on real GitHub tasks, updating them monthly to prevent data leakage. In the current rankings, gpt-5.5 and Junie are leading, while Cursor demonstrates the best cost-efficiency ($0.23 per task).
🌍 Dynamic benchmarks with temporal splitting are becoming critically important for the objective evaluation of agentic systems, as static datasets quickly become contaminated.
👤 SWE-rebench demonstrates the ability of models to act in a real-world environment, helping to distinguish true AI coders from models that simply "remember" solutions.
Source 1: https://youtu.be/wcUJWP6WpGM Source 2: https://swe-rebench.com/
