Skip to main content
  1. Posts/

How Query Engines Work 1. The small compiler hiding behind every SQL query

·10 mins
Rafael Fernandez
Author
Rafael Fernandez
Mathematics, programming, and life stuff
How Query Engines Work - This article is part of a series.
Part 1: This Article

Most of us use SQL every week and still treat the engine on the other side as a black box. We write a query, hit enter, and wait. If it is fast, great. If it is slow, we rewrite it, add an index, or stare at EXPLAIN ANALYZE hoping the plan will explain itself.

The good news is that the internals are not alien at all. A query engine looks a lot like a compiler: it parses text, builds an intermediate representation, rewrites that representation to make it better, and then executes the result. The difference is that instead of producing a binary that runs tomorrow, it produces an execution plan that runs right now, against your data. Once you see it that way, a lot of database behavior stops feeling magical.

This post walks through that pipeline end to end. Not as a vague overview, but in Rust, with real code, against a small CSV file, looking at what each stage actually produces.

The five-second version
#

You write this:

SELECT department, AVG(salary) as avg_salary
FROM employees
WHERE hire_date > '2020-01-01'
GROUP BY department
ORDER BY avg_salary DESC

Between you and the results, four things happen:

  1. Parsing. The SQL text becomes a syntax tree. Pure grammar, no meaning.
  2. Logical planning. The syntax tree becomes a tree of relational operations. The engine checks that employees exists, that salary is a number, that AVG is a valid aggregate. This is type checking for data.
  3. Optimization. The plan is rewritten. Filters move closer to the data source. Columns that nobody asked for are dropped before they leave disk. The query computes the same thing, but faster.
  4. Physical execution. The abstract plan becomes concrete operators: a CSV scanner, a hash aggregation, a sort. Data flows through these operators in batches of columnar records, and results come out the other end.

That is the whole trick. Every query engine you have ever used, from SQLite to Spark to BigQuery, implements some variation of this pipeline. The details change. The shape does not.

Why this matters if you write SQL for a living
#

Most developers interact with query engines the way most drivers interact with engines: they press the pedal and expect the car to move. That works until something goes wrong. A query that took 200 milliseconds yesterday takes 40 seconds today. The dashboard is timing out. Someone is paging you. At that point, you need a mental model, not luck.

If you understand the pipeline, you can read the EXPLAIN ANALYZE output and see exactly where the problem is. A sequential scan where there should be an index lookup. A nested loop join where a hash join would be three orders of magnitude faster. Fifty columns being read from a Parquet file when only three are needed.

If you do not understand the pipeline, you rewrite the SQL, cross your fingers, and run it again. Sometimes it works. Sometimes you lose the afternoon.

Understanding the internals does not make you a database engineer. It just gives you a better way to think when a query goes sideways.

The tiny dataset we will use
#

Every example in this post runs against the same CSV file. Eight employees, five columns, nothing fancy:

id,name,department,salary,hire_date
1,Alice,Engineering,95000,2021-03-15
2,Bob,Sales,87000,2019-06-01
3,Carol,Engineering,102000,2022-01-10
4,Dave,Marketing,78000,2018-11-20
5,Eve,Engineering,110000,2023-05-01
6,Frank,Sales,92000,2020-09-15
7,Grace,Marketing,85000,2021-07-22
8,Hank,Engineering,98000,2017-03-01

Save it as employees.csv. We are going to push this tiny file through the same machinery that larger engines use on tables with millions or billions of rows.

Stage 1: Parsing
#

A SQL string is just text. Before the engine can do anything meaningful, it needs structure. The parser turns raw characters into a tree of nodes that represent the grammar of the query, not its meaning. It does not know whether employees is a real table. It does not know whether salary is a number. It checks syntax, nothing more.

DataFusion delegates this work to sqlparser-rs, one of the most widely used SQL parsers in the Rust ecosystem. We can call it directly:

use sqlparser::dialect::GenericDialect;
use sqlparser::parser::Parser;

fn main() {
    let sql = "SELECT department, AVG(salary) \
               FROM employees \
               WHERE hire_date > '2020-01-01' \
               GROUP BY department \
               ORDER BY AVG(salary) DESC";

    let dialect = GenericDialect {};
    let ast = Parser::parse_sql(&dialect, sql).unwrap();

    println!("{:#?}", ast);
}

What comes out is a deeply nested Rust enum. A Statement::Query containing a Select with projection items, a TableWithJoins, a WHERE expression, a GROUP BY clause, and an ORDER BY list. The structure mirrors the grammar of SQL, not the semantics. It is the same relationship that Rust’s syn crate has to Rust source code: faithful to the text, blind to the meaning.

If you write SELECT foo FROM bar WHERE baz > 42, the parser will happily accept it even if none of those names exist anywhere. That awkward conversation happens in the next stage.

Stage 2: Logical planning
#

The AST enters the planner. The planner has something the parser did not: a catalog, a registry of tables, their schemas, and their column types. Now the engine can answer questions. Does employees exist? Yes. Does it have a column called salary? Yes. Is salary a type that supports AVG? Yes. Is hire_date comparable to a string literal? We can coerce that.

The output is a logical plan: a tree of relational operators.

Sort: AVG(salary) DESC
  Projection: department, AVG(salary)
    Aggregate: groupBy=[department], aggr=[AVG(salary)]
      Filter: hire_date > '2020-01-01'
        Scan: employees

Read it bottom-up, like a call stack. The data starts at Scan, flows up through Filter (which discards rows hired before 2020), into Aggregate (which groups by department and computes averages), through Projection (which selects the output columns), and finally into Sort.

Each node in this tree knows its output schema: what columns it produces and what types they have. The Filter node passes through the same schema as its input. The Aggregate node produces two columns: department (Utf8) and AVG(salary) (Float64). This schema propagation is how the engine catches errors like referencing a column that was dropped by a projection above.

The logical plan describes what to compute. It says nothing about how. There is no mention of hash tables, no mention of sort algorithms, no mention of batch sizes. That separation is the whole point: the optimizer can rearrange this tree freely, and the physical planner can choose the best algorithm for each node independently.

Stage 3: Optimization
#

The logical plan works, but it is naive. It reads all five columns from the CSV even though the query only needs three (department, salary, hire_date). A smarter plan would tell the scanner to skip id and name entirely.

This is what the optimizer does. It applies rewrite rules that transform the plan into an equivalent but more efficient one. The key word is equivalent: the optimizer never changes what the query returns, only how fast it gets there.

The rules that matter most in practice:

Projection push-down. Walk the plan tree, collect every column name that any node actually references, and push that list down to the scan. If the scan is reading a Parquet file with 50 columns and the query only touches 3, this single rule reduces I/O by 94%. For CSV files, the savings come from not building Arrow arrays for unused columns.

Predicate push-down. Move filters as close to the data source as possible. If a filter only references columns from one side of a join, push it below the join so it runs before the expensive operation. For data sources that support it (like Parquet with row group statistics), the predicate can skip entire chunks of data without reading them.

Constant folding. Replace 2 + 3 with 5 at plan time. Replace WHERE 1 = 1 with nothing. Small savings, but they compound.

Join reordering. When joining three or more tables, the order matters enormously. Joining a 10-row lookup table with a billion-row fact table first produces a much smaller intermediate result than the reverse.

In DataFusion, we can see the before and after:

// Before optimization
let naive_plan = ctx.sql(sql).await?.into_unoptimized_plan();
println!("{}", naive_plan.display_indent());

// After optimization
let optimized = ctx.state().optimize(&naive_plan)?;
println!("{}", optimized.display_indent());

The optimized plan looks similar in structure, but with a critical difference at the bottom: the scan now lists only the columns it needs. For our tiny CSV file, the difference is negligible. For a production Parquet table with hundreds of columns, it can mean reading 50 MB instead of 2 GB.

Andy Grove’s book reports a 5x speedup from projection push-down alone on a 17-column CSV. On Parquet, the gains are typically larger because the file format supports true column-level reads, entire column chunks are never decompressed.

Stage 4: Physical execution
#

The optimized logical plan is abstract. It says “aggregate by department” but does not say how. A hash table? A sort-based approach? The physical planner makes these choices and produces a tree of concrete operators.

let physical_plan = ctx.state()
    .create_physical_plan(&optimized)
    .await?;
println!("{}", datafusion::physical_plan::displayable(physical_plan.as_ref())
    .indent(true));

The output names real operators: CsvExec reads the file, CoalesceBatchesExec merges small batches for efficiency, AggregateExec builds a hash map from department to running average, and SortExec orders the results.

These operators communicate by passing Arrow RecordBatches: columnar chunks of data, typically between 1,024 and 65,536 rows. The data never becomes a Vec<Row>. It stays in columnar format from disk to output, which is why modern query engines can process millions of rows per second on a single core. We will explore Arrow in detail in the next post.

The full pipeline in one program
#

Here is everything wired together. Add these dependencies:

[dependencies]
datafusion = "46"
tokio = { version = "1", features = ["full"] }

And the code:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    ctx.register_csv("employees", "employees.csv", CsvReadOptions::new())
        .await?;

    let sql = "SELECT department, AVG(salary) as avg_salary \
               FROM employees \
               WHERE hire_date > '2020-01-01' \
               GROUP BY department \
               ORDER BY avg_salary DESC";

    // Stage 2: Logical plan
    let logical_plan = ctx.sql(sql).await?.into_unoptimized_plan();
    println!("=== Logical Plan ===");
    println!("{}\n", logical_plan.display_indent());

    // Stage 3: Optimized plan
    let optimized_plan = ctx.state().optimize(&logical_plan)?;
    println!("=== Optimized Plan ===");
    println!("{}\n", optimized_plan.display_indent());

    // Stage 4: Physical plan
    let physical_plan = ctx.state()
        .create_physical_plan(&optimized_plan)
        .await?;
    println!("=== Physical Plan ===");
    println!("{}\n", datafusion::physical_plan::displayable(physical_plan.as_ref())
        .indent(true));

    // Execute
    let df = ctx.sql(sql).await?;
    println!("=== Results ===");
    df.show().await?;

    Ok(())
}

Run it. You will see four outputs, one for each stage. The logical plan reads all columns. The optimized plan reads three. The physical plan names concrete operators. Then the results show up:

+-------------+------------+
| department  | avg_salary |
+-------------+------------+
| Engineering | 102333.33  |
| Sales       | 92000.00   |
| Marketing   | 85000.00   |
+-------------+------------+

Hank (hired 2017), Bob (2019), and Dave (2018) were filtered out before the aggregation touched them. The optimizer pushed the predicate below the aggregate. Projection push-down dropped id and name before they left the scanner. Three numbers appeared on screen. Behind them, a small compiler did its job.

The compiler hiding in your database
#

The analogy is not a metaphor. It is a structural correspondence:

Compiler Query engine
Source code SQL text
Lexer + Parser SQL parser (tokenize, build AST)
Type checker Logical planner (validate columns, resolve types, build plan)
IR optimization passes Optimizer rules (push-down, folding, reordering)
Code generation Physical planner (choose algorithms, produce operators)
Runtime execution Query execution (process batches, stream results)

A compiler like rustc goes from source to machine code through multiple intermediate representations, each more concrete than the last. A query engine goes from SQL text to executable operators through the same progression. The difference is lifecycle: the compiler produces an artifact that runs many times; the query engine produces a plan that runs once and is discarded.

This is why the same techniques that made compilers fast in the 1970s, dead code elimination, constant propagation, loop reordering, make query engines fast today. And it is why people who build query engines tend to have the same background as people who build compilers: they think in trees, transformations, and type systems.

What comes next
#

We traced the pipeline from SQL text to results. We saw parsing produce a syntax tree, planning produce a logical tree, optimization rewrite that tree, and execution turn it into batches of columnar data flowing through concrete operators.

But we skipped over a question that sits underneath everything: why columnar? Why do these operators pass data around as columns instead of rows? The answer involves cache lines, SIMD instructions, and a memory format called Apache Arrow that has become the lingua franca of modern data processing.

That is the next post.

Reference:

How Query Engines Work - This article is part of a series.
Part 1: This Article