ADE-Bench Benchmark Results

altimate-code + DeepSeek V4 Pro achieves 78.0% pass rate (32/41 tasks on DuckDB) — matching Sonnet 4.6 at a fraction of the cost.

Model

Database

About ADE-Bench

ADE-Bench is a benchmark created by Benn Stancil (founder of Mode) in collaboration with dbt Labs. It evaluates AI agents on real-world analytics and data engineering tasks using actual dbt projects and databases. Each task runs in a Docker container sandbox; the agent attempts to resolve the task, and success is measured by whether all dbt tests pass afterward. Tasks include realistic data problems: vague requests like "it’s broken," debugging, schema issues, and complex analytics queries.

Test Configuration

Harness and LLM	altimate-code (DeepSeek V4 Pro)
Model source	OpenRouter (deepseek/deepseek-chat-v4-pro)
Database	DuckDB (local)
Total Tasks	41
Max Retries on failures	3
Best Run (pass@3)	32/41 (78.0%)
Single-run range	26–28/41 (63–68%)

Benchmark Comparison

Agents evaluated on ADE-Bench with DuckDB.

altimate-code(DeepSeek V4 Pro · DuckDB) — 32/4178%

altimate-code(Sonnet 4.6 · DuckDB) — 32/4178%

dbt Labs(Sonnet 4.5 · DuckDB) — ~25/4359%

Source →

Claude Code(Sonnet 4.6 · baseline · DuckDB) — ~17/4340%

Key Insight: The Harness Matters More Than the Model

Across both benchmarks, altimate-code on Sonnet 4.6 beats competitors running Opus 4.6 — a more capable, more expensive model. Purpose-built tooling and deterministic operations outperform raw model capability alone.

The harness — not the model — is the differentiator.

Per-Task Results — DuckDB

Best Run — 32 passed, 9 failed out of 41 tasks

#	Task	Result	Score	Pass Rate
1	airbnb001	✓	10/10	100%
2	airbnb002	✓	11/11	100%
3	airbnb003	✓	7/7	100%
4	airbnb004	✓	2/2	100%
5	airbnb005	✓	4/4	100%
6	airbnb006	✓	7/7	100%
7	airbnb007	✓	11/11	100%
8	airbnb008	✓	4/4	100%
9	airbnb009	✗	0/1	0%
10	analytics_engineering001	✓	1/1	100%
11	analytics_engineering002	✓	2/2	100%
12	analytics_engineering003	✓	2/2	100%
13	analytics_engineering004	✗	1/2	50%
14	analytics_engineering005	✓	3/3	100%
15	analytics_engineering006	✗	4/7	57%
16	analytics_engineering007	✓	10/10	100%
17	analytics_engineering008	✓	1/1	100%
18	asana001	✓	2/2	100%
19	asana002	✓	3/3	100%
20	asana003	✗	16/17	94%
21	asana004	✗	5/6	83%
22	asana005	✗	7/8	87%
23	f1001	✓	6/6	100%
24	f1002	✗	9/10	90%
25	f1003	✓	4/4	100%
26	f1004	✓	2/2	100%
27	f1005	✓	4/4	100%
28	f1006	✓	4/4	100%
29	f1007	✓	6/6	100%
30	f1009	✓	1/1	100%
31	f1010	✓	2/2	100%
32	f1011	✗	5/6	83%
33	intercom001	✓	2/2	100%
34	intercom002	✓	4/4	100%
35	intercom003	✓	2/2	100%
36	quickbooks001	✓	12/12	100%
37	quickbooks002	✓	8/8	100%
38	quickbooks003	✗	5/14	35%
39	quickbooks004	✓	48/48	100%
40	simple001	✓	1/1	100%
41	simple002	✓	1/1	100%

Sources

ADE-Bench on GitHub (dbt-labs) — Benchmark repository and methodology
Snowflake Blog: Cortex Code CLI Expands Support — Cortex Code benchmark results (65%, Opus 4.6)
dbt Labs Blog: Introducing ADE-Bench — dbt Labs benchmark results (59%, Sonnet 4.5)