Text-to-SQL LLM Benchmark: Accuracy and Latency (2026)

A text-to-SQL LLM benchmark answers one deceptively simple question: when a user asks a database something in plain English, does the model write SQL that returns the correct rows? In 2026 the headline numbers are easy to find and easy to misread. A model that scores beautifully on one dataset can collapse on a harder one, and a leaderboard percentage tells you nothing about latency, cost, or how the system behaves on your messy production schema.

This matters now because text-to-SQL has moved from demo to deployment. Analysts, support agents, and internal copilots increasingly route natural-language questions straight into a SQL generator. A wrong answer that looks confident is worse than no answer. So you need a benchmark you can actually run, not a screenshot of someone else’s leaderboard.

This post gives you a reproducible methodology for measuring text to SQL accuracy and latency, grounded in the public Spider and BIRD literature. What this covers: why text-to-SQL is hard, a full evaluation harness, indicative results by model tier, an error taxonomy, accuracy-boosting techniques, and the trade-offs nobody puts on a slide.

Context: why text-to-SQL is hard

Translating natural language to SQL looks like ordinary code generation, but it carries four hidden difficulties that make it uniquely brittle.

The first is schema linking. The model must map vague human phrases onto exact table and column names it has never seen before. “Top customers last quarter” implies a customers table, an orders join, a date filter, and a ranking. None of those words appear literally in the schema. Schema linking is the single largest source of error in the research literature, and it gets worse as schemas grow wider and more cryptically named.

The second is joins. Real databases normalize data across many tables. Answering a single question often requires three or four joins along the correct foreign-key paths. Pick the wrong join key and the query runs cleanly but returns silently wrong numbers. This is the most dangerous failure mode, because nothing crashes.

The third is dialect. SQL is not one language. Date arithmetic, string functions, window syntax, and type casting differ across PostgreSQL, MySQL, BigQuery, Snowflake, and SQLite. A query that is valid in one engine is a syntax error in another. Benchmarks that only test SQLite hide this entirely.

The fourth is ambiguity. Human questions are underspecified. “Revenue” might mean gross, net, or recognized. “Active users” needs a definition. A good system either resolves ambiguity from context or asks, but most benchmarks reward a single confident guess.

A fifth, quieter difficulty is compositional reasoning. Many analytical questions require nested subqueries, window functions, or common table expressions chained in a specific order. The model has to plan a multi-step query, not pattern-match a template. Difficulty scales sharply here. Single-table lookups are nearly solved; deeply nested analytical queries remain where most of the accuracy gap lives, and they are exactly the queries enterprises care about most.

There is also a values problem that schema alone cannot solve. A filter like “status is cancelled” needs to know the column actually stores “CANCELLED” or the code “3”. Without sample values, the model guesses the literal and the query runs but matches nothing. This is why serving sample rows in the prompt matters so much, and why a benchmark that omits them flatters every model it measures.

These difficulties are why benchmark design is itself a research problem. The academic baseline is Spider, the Yale cross-domain dataset, and the harder, more realistic BIRD benchmark built on large, dirty databases. For a broader map of the field, the survey Next-Generation Database Interfaces catalogs methods and datasets in depth. If you evaluate LLM outputs more generally, our LLM-as-judge evaluation pipeline covers the surrounding harness discipline.

The landscape has shifted in an important way. A few years ago, the leading text-to-SQL systems were elaborate pipelines with dedicated schema-linking modules, grammar-constrained decoders, and hand-built intermediate representations. By 2026 the center of gravity has moved to general-purpose LLMs wrapped in a thin agentic loop: retrieve schema, generate, execute, repair. The model does the heavy lifting; the scaffolding mostly feeds it the right context and verifies its output. That shift is why a benchmark must measure the whole system, not the bare model. The same model can look mediocre or excellent depending on the scaffolding around it.

This also reframes what “the model’s accuracy” even means. There is no single accuracy for a model in isolation. There is accuracy for a model plus a specific schema serialization, plus a specific retrieval strategy, plus a specific repair budget. Quoting a model name with a percentage and nothing else is like quoting an engine’s horsepower without the car. The number is real but it tells you very little about how the thing actually drives on your roads.

To make the difficulty concrete, consider a single question: “Which sales reps beat their quota in the last two quarters?” A human analyst pictures a reps table, a sales table, a quotas table, a date filter, an aggregation, and a comparison. The model must infer all of that from words that name none of those tables. It has to join reps to sales on a key it must guess. Then it sums sales per rep per quarter, finds the matching quota row, filters to the last two quarters in the right dialect, and compares. One wrong link anywhere yields a clean query with wrong numbers. That is text-to-SQL in one sentence.

Benchmark methodology – the core

The whole point of a text-to-SQL LLM benchmark is that the score is reproducible. If you cannot rerun it on your own schema, it is marketing. This section specifies a harness you can implement in an afternoon. The pipeline is shown in fig1.

Datasets: Spider, BIRD, and Spider 2.0

Use at least two datasets, because a single one will lie to you.

Spider is the classic cross-domain benchmark. It spans 200-plus databases across many domains, with questions labeled by difficulty. It is the right baseline for measuring whether a model generalizes across schemas it was not trained on. Its weakness is that the databases are small and relatively clean.

BIRD raises the bar deliberately. It uses large, real, often messy databases, adds external-knowledge requirements, and cares about query efficiency, not just correctness. The well-known general finding is that execution accuracy on BIRD sits far below Spider for the same model. That gap is the most honest signal of production readiness you can get.

Spider 2.0 pushes further toward enterprise reality, with complex analytical workflows, multiple SQL dialects, and far larger schemas. If your use case is enterprise analytics, weight it heavily. Run all three and report them separately. An averaged single number hides exactly the difficulty profile you need to see.

Beyond the public sets, build a private evaluation set from your own schema and real user questions. Fifty to a hundred carefully labeled pairs are enough to start. This set is the only one immune to contamination, because no model has seen it. It is also the only one whose distribution matches your actual users. Public benchmarks tell you whether a model can do text-to-SQL in general; your private set tells you whether it can do yours. Refresh it as your schema and question patterns drift, or it slowly stops representing production.

Metrics: execution accuracy, exact-set match, VES

Three metrics matter, and they measure different things.

Execution accuracy (EX) runs both the predicted and the gold SQL against the database and compares result sets. It is the metric that matters in production, because users care about answers, not syntax. EX correctly credits two different queries that return the same rows.

Exact-set match (ESM) compares the SQL structure component by component after normalization, ignoring literal values. It is stricter and order-insensitive across clauses. ESM can penalize a correct query that simply phrases the logic differently, so never report it alone.

Valid-efficiency score (VES), introduced with BIRD, rewards queries that are both correct and efficient. A query that returns the right rows after a full table scan when an indexed path existed scores lower. VES is what separates a benchmark that cares about real databases from one that does not.

A practical note on EX: it can over-credit. Two queries returning the same rows on this particular database may diverge on another. A query that hard-codes a value instead of joining can pass on test data and fail in production. So pair EX with a spot check of the SQL on your hardest questions. Execution accuracy is the best single metric, but it is not a substitute for reading the query on the cases that matter.

You may also want a soft latency metric: time to first token versus time to validated answer. An interactive copilot lives and dies on perceived latency, and self-correction rounds inflate the gap between those two. Report both. A model with a fast first token but three repair rounds can feel slower than a steadier model that gets it right the first time.

Harness, prompting, and self-correction

Hold these constant or your numbers are noise.

Schema serialization. Decide exactly how the schema enters the prompt: CREATE TABLE statements, a compact column list, plus sample rows and foreign-key hints. Sample rows help enormously with value linking. Serialization format alone can swing accuracy by several points, so pin one format per run.

Prompting strategy. Specify zero-shot versus few-shot, and how exemplars are chosen. Retrieval-selected few-shot examples beat random ones. Document the exact prompt template, temperature, and token budget so anyone can rerun it.

Few-shot selection deserves its own attention, because it is a quiet accuracy lever. Random exemplars help a little. Exemplars retrieved by similarity to the current question help a lot, because they show the model the right query shape for this kind of ask. The strongest variant retrieves examples that share schema structure or question intent, not just surface wording. Whatever you pick, freeze it. If exemplar selection drifts between two model runs, you are no longer comparing the models, only the example pools they happened to draw.

The number of examples also trades off against latency and cost. More exemplars usually lift accuracy with diminishing returns, while every added example inflates the prompt and the bill. Sweep this as a hyperparameter on a small validation slice, pick a count, and hold it constant for the headline benchmark. Report the count alongside the score. A result obtained with eight retrieved exemplars is not comparable to one obtained zero-shot, and pretending otherwise quietly corrupts the comparison.

Self-correction. Allow the model one or two rounds to repair its own SQL using the database error trace or empty result. The harness executes the candidate, feeds back the error, and regenerates. Cap the rounds and log them, because each round adds latency and cost. Measure latency end to end, including every self-correction round, not just the first token.

A minimal harness is small. The loop below is intentionally framework-free so you can read every step and port it to your own stack.

def evaluate(question, schema_text, gold_sql, db, model, max_repairs=1):
    prompt = build_prompt(question, schema_text, few_shot=select_examples(question))
    sql = model.generate(prompt, temperature=0)
    for _ in range(max_repairs):
        ok, rows, err = db.run(sql)          # sandboxed, read-only, timeout
        if ok:
            break
        sql = model.generate(repair_prompt(prompt, sql, err), temperature=0)
    ok, rows, _ = db.run(sql)
    gold_rows = db.run(gold_sql)[1]
    return {
        "ex": set_match(rows, gold_rows),    # execution accuracy
        "latency_s": db.elapsed,             # includes repair rounds
        "tokens": model.last_usage,          # for cost
    }

The set comparison should be order-insensitive unless the question asks for ordering. Run every database in a disposable sandbox with a statement timeout and a strict read-only role, so a runaway join cannot harm anything. Record latency and token usage on every call, because those two columns are what turn an accuracy benchmark into a cost-and-latency benchmark.

Finally, fix randomness. Set temperature to zero for the headline accuracy run, and report a separate sampled run if you use self-consistency. Seed any exemplar retrieval. A benchmark you cannot rerun to the same number is not a benchmark; it is an anecdote with error bars you never measured.

What the results actually show

Here is where honesty matters most. The numbers below are indicative ranges synthesized from the public Spider and BIRD literature, not fresh measurements from a controlled lab run. They show the shape of the landscape so you can plan, then verify on your own data. Treat every cell as a planning prior, not a fact. The accuracy-versus-latency trade-off is sketched conceptually in fig2.

Model tier	Spider EX (indicative)	BIRD EX (indicative)	Relative latency	Relative cost
Frontier (largest hosted)	~80-90% (indicative)	~55-70% (indicative)	High	High
Mid (efficient hosted)	~75-85% (indicative)	~45-60% (indicative)	Medium	Medium
Small (open, 7-14B)	~60-75% (indicative)	~30-45% (indicative)	Low	Low
Fine-tuned on schema	~80-90% on trained domains (indicative)	varies, often strong (indicative)	Low	Low after training

Read four lessons from this shape, none of which depend on the exact digits.

First, the BIRD-versus-Spider gap is large and universal. Every tier drops substantially moving from Spider to BIRD. That gap, not the Spider headline, predicts production behavior. A model that “passes” at 85% on Spider may answer barely half of hard real questions correctly.

Second, schema linking dominates the errors at every tier, which the next section unpacks. The frontier advantage is partly better schema linking and partly better recovery via self-correction.

Third, fine-tuned small models can rival frontier accuracy on their trained schemas while running far cheaper and faster. They generalize poorly to unseen schemas, so the win is narrow and real. This is the classic specialist-versus-generalist trade.

Fourth, latency and cost scale opposite to convenience. The frontier tier that maximizes accuracy also maximizes latency and per-query cost, which is the central text-to-SQL latency 2026 tension. For an interactive copilot, two extra seconds per query changes the product.

It is worth dwelling on why the ranges are ranges. The exact number a model earns depends heavily on the harness around it. The same model can swing by ten points or more depending on schema serialization, sample-row inclusion, few-shot selection, and whether self-correction is enabled. That sensitivity is precisely why you cannot copy a leaderboard cell into your planning doc. The tier shape is stable; the digit is not. Anyone quoting a single decimal place without naming the harness is selling you a number they cannot defend.

Note also that difficulty bucketing changes the story. Both Spider and BIRD label questions by difficulty. Aggregate accuracy hides the fact that easy single-table questions are largely solved across all tiers, while hard nested-join questions separate the field dramatically. Always report accuracy by difficulty band, not just the headline. A model that is 90% on easy and 40% on hard is a very different product from one that is 65% flat, even though both average the same.

How should you actually pick a tier from this table? Set an accuracy floor on your held-out hard questions, a latency ceiling from your product, and a monthly cost cap. Then choose the cheapest tier that clears all three at once. Most teams discover the mid tier plus good schema retrieval beats the frontier tier alone, at a fraction of the cost. The frontier tier earns its premium only on genuinely hard analytical workloads.

Reproducing this benchmark step by step

The methodology above is only credible if you can run it. Here is the concrete sequence, start to finish, on a single machine.

Step 1: get the datasets. Download Spider and BIRD from their official sites. Each ships SQLite databases, gold SQL, and question files. Load them into a sandbox you can wipe and rebuild. Keep the gold queries separate from anything the model sees.

Step 2: stand up a sandbox executor. Wrap each database in a read-only connection with a statement timeout, say five seconds. Expose one function that runs SQL and returns rows, an error, or a timeout. This single chokepoint is where you enforce safety and where you measure latency. Never run generated SQL against anything you cannot reset.

Step 3: serialize each schema once. For every database, render CREATE TABLE statements, a column list, foreign keys, and three sample rows per table. Cache the serialization. This keeps the prompt input identical across every model you test, which is the only way your comparison is fair.

Step 4: assemble prompts deterministically. Combine question, cached schema, and retrieved exemplars into your fixed template.

Text-to-SQL LLM Benchmark: Accuracy and Latency (2026)

Text-to-SQL LLM Benchmark: Accuracy and Latency (2026)

Context: why text-to-SQL is hard

Benchmark methodology – the core

Datasets: Spider, BIRD, and Spider 2.0

Metrics: execution accuracy, exact-set match, VES

Harness, prompting, and self-correction

What the results actually show

Reproducing this benchmark step by step

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories