Bug 修复场景
每次运行从可复现的失败测试开始,以验证通过结束。
| Metric | Codna | Cursor | Cline |
|---|---|---|---|
| Fixes verified | 87/87 | 87/87 | 64/87 |
| Accuracy | 100% | 100% | 73.6% |
| Avg wall time | 13.4s | 22.6s | 72.9s |
| Avg tokens / fix | 16.2K | 81.0K | 64.8K |
| Avg cost / fix | $0.021 | n/a | $0.139 |
| Repo understanding | 0 tokens | LLM context | LLM context |
方法论
基准测试集涵盖单个 Bug、同一代码库多个 Bug,以及智能体从未见过的代码库。每个场景仅在测试通过后计入统计。
每次运行从可复现的失败测试开始,以验证通过结束。
Codna 在 9.2 秒内理解了 130 个代码库,覆盖 110 种语言。
Codna 向智能体发送证据包,而非整个代码库。
Measured spend to ship one verified fix — about 28× cheaper than a typical agentic edit.
Measured 2026-06-15 on identical isolated checkouts. Every scenario counts only when your own test passes. Download the raw dataset (JSON) →
Codna ran head-to-head against OpenAI Codex CLI and Google Gemini CLI across 8 real bug-fix scenarios. Every fix was verified by the project's own tests. No synthetic tasks, no self-reported numbers.
A fix counts only when the test suite passes. Codna went 8 for 8. A fix that compiles but breaks tests is not counted.
Codna sends the AI agent an evidence bundle measured at roughly 600 tokens — 162x less context than reading the full repository. That translates directly to lower API costs, typically pennies per fix.
The deterministic engine maps the dependency and blast-radius graph in about 60 ms using zero LLM tokens. The agent receives only the relevant evidence, so there is far less to process before a fix is produced.
Yes. In measured testing, Codna mapped 130 repositories in 9.2 seconds, using zero tokens for the mapping step.
The benchmark compared Codna against the default configurations of OpenAI Codex CLI and Google Gemini CLI as available at time of testing. Codna is bring-your-own-key, so you choose the underlying model.
codna fix . --issue "the failing test"