Advertisement
Advertisement
Advertisement
14 June 2026·5 min read·By Elena Vance

Google Gemini-SQL2 Achieves 80.04% on BIRD Text-to-SQL Benchmark

Google Research's Gemini-SQL2, powered by Gemini 3.1 Pro, has achieved 80.04% execution accuracy on the BIRD Text-to-SQL Leaderboard.

Google Gemini-SQL2 Achieves 80.04% on BIRD Text-to-SQL Benchmark
```html

Gemini-SQL2 Achieves 80.04% on BIRD Text-to-SQL Benchmark

Gemini-SQL2 hit a milestone. Google Research's latest text-to-SQL capability scored 80.04% execution accuracy on the BIRD Text-to-SQL Leaderboard in the single-model category, and it's powered by Gemini 3.1 Pro. So this system now leads the charge in turning natural language queries into executable SQL statements.

The BIRD benchmark is a rigorous industry standard. It evaluates text-to-SQL performance. And it's built from over 12,000 question-SQL pairs spread across 95 diverse databases, encompassing 37 professional domains and presenting complex challenges such as dirty data values and the need for external knowledge grounding. Execution accuracy is the metric. So the generated SQL must run successfully and produce results that precisely match the intended query. Google emphasized this point, stating that Gemini-SQL2's SQL "doesn't just look right, it also runs successfully.

A New Leader in Text-to-SQL

Google's own internal chart, shared on X, now features Gemini-SQL2 at the top. It surpasses its predecessor, Gemini-SQL. This advancement shows the growing sophistication of AI in understanding and manipulating structured data, and the single trained model track of the leaderboard is particularly telling because it restricts the use of ensemble frameworks that can artificially inflate scores. So it measures the core capability of the model itself.

Google Cloud's previous record on this specific track, noted on November 15, 2025, stood at 76.13%. While human performance on the BIRD benchmark is estimated at 92.96%, Gemini-SQL2's 80.04% demonstrates a substantial leap forward for AI-driven SQL generation, narrowing that gap considerably.

The Significance of Execution Accuracy

The distinction between syntactically valid SQL and execution-verified accurate SQL is critical. It's a huge gap. Many older benchmarks might accept queries that look plausible but would fail when executed or return incorrect data, and that's a serious problem for real-world applications. But the BIRD benchmark's focus on execution accuracy means that Gemini-SQL2 is not just generating code that follows SQL rules. It's generating code that performs the intended task correctly.

The announcement on X laid bare a core truth: generating accurate SQL from natural language is incredibly hard, as "data subtlety & complex business contexts" remain major obstacles. It's a tough problem. But Gemini-SQL2's improved SQL understanding should now boost natural language capabilities across Google's diverse data services, potentially unlocking integrations into BigQuery Studio, AlloyDB AI, and Cloud SQL Studio, all of which already feature Gemini-based SQL generation.

Real-World Applications and Implications

The implications of Gemini-SQL2's performance extend to several key areas:

a computer screen with a lot of data on it
  • Self-service Analytics: Business users could potentially ask complex questions about revenue, churn, or other metrics, and receive accurate, executable SQL queries without needing deep technical expertise. For example, a revenue manager could ask for monthly recurring revenue by region for accounts that churned within 90 days of an upgrade, a query that typically involves joins, window logic, and date arithmetic.
  • Data Engineering Drafts: Developers might use Gemini-SQL2 to draft complex data transformations in BigQuery more rapidly, reviewing and refining the generated SQL instead of writing it from scratch. This could significantly speed up development cycles.
  • Embedded "Ask Your Data" Features: Software-as-a-service (SaaS) platforms could integrate more robust natural language query interfaces, allowing their users to explore data directly. While 80% accuracy sets a strong foundation, the current score still suggests that human review remains necessary for critical applications, as approximately one in five queries might still require correction.

Google hasn't detailed which specific products will integrate Gemini-SQL2. They also haven't released an API or a specific model string for it yet. But they've provided a schema-grounded pattern as a template, complete with instructions to swap the model string once Gemini-SQL2 becomes available, that shows how to use current Gemini models via the google-genai SDK. It's a blueprint.

"data subtlety & complex business contexts make generating accurate SQL from natural language notoriously hard."

That X post from Google Research sums up the problem. Gemini-SQL2 is their answer. It performed well on BIRD, a tough benchmark with dirty values and grounding needs, so this system feels like a real step toward making complex data analysis both more accurate and more accessible through natural language interfaces.

It's a positive community signal. The first three hours of the Gemini-SQL2 post on X and LinkedIn, tracked on June 12, 2026, sent that clear message. But the high bookmark-to-like ratio on X suggests approval, not controversy, though we can't be sure of the full picture because a detailed sentiment analysis of the comments wasn't available yet.

While Gemini-SQL2 is not yet publicly available as a standalone product or API, its performance on the BIRD benchmark marks a notable advancement in the field of text-to-SQL translation, signaling a future where interacting with databases through natural language becomes more reliable and efficient.

```

Frequently Asked Questions

What is Gemini-SQL2's execution accuracy on the BIRD benchmark?

Gemini-SQL2 achieved 80.04% execution accuracy on the BIRD Text-to-SQL Leaderboard in the single-model category. This milestone was powered by Gemini 3.1 Pro.

Why is the BIRD benchmark considered rigorous for evaluating text-to-SQL performance?

The BIRD benchmark is built from over 12,000 question-SQL pairs across 95 diverse databases, covering 37 professional domains. It presents complex challenges such as dirty data values and the need for external knowledge grounding.

How does Gemini-SQL2's performance compare to human accuracy on the BIRD benchmark?

Human performance on the BIRD benchmark is estimated at 92.96%, while Gemini-SQL2 achieved 80.04%. This demonstrates a substantial leap forward for AI-driven SQL generation, narrowing the gap considerably.

When did Google Cloud achieve its previous record on the single-model track of the BIRD leaderboard?

Google Cloud's previous record on the single-model track was noted on November 15, 2025, and it stood at 76.13%. Gemini-SQL2 surpassed this record with its 80.04% accuracy.

What real-world applications are implied by Gemini-SQL2's performance?

Potential applications include self-service analytics for business users, data engineering drafts for developers, and embedded 'Ask Your Data' features in SaaS platforms. However, human review remains necessary for critical applications, as approximately one in five queries might still require correction.

Elena Vance
Written by
Artificial Intelligence Correspondent

Elena Vance reports on artificial intelligence, from frontier research labs to the products reshaping everyday work. She focuses on how machine learning is moving out of the lab and into the real world, and what that shift means for readers.

💬 Comments (0)

Sign in to leave a comment.

No comments yet. Be the first!

Advertisement