How Query Engines Work 2. Why modern query engines think in columns

Table of Contents

How Query Engines Work - This article is part of a series.

Part 1: How Query Engines Work 1. The small compiler hiding behind every SQL query

Part 2: This Article

In the previous post, we traced a SQL query through four stages and watched DataFusion produce a result. Right at the end, something important happened so quickly it was easy to miss: the physical operators passed data between each other not as rows, but as columns. The CSV scanner did not produce a list of employee records. It produced a bundle of arrays, one per column, packed contiguously in memory.

That was not a small implementation detail. It was the design decision that made the rest of the engine fast. If post 1 was about the pipeline, this one is about the physical shape of the data moving through it.

Every modern analytical query engine, DataFusion, DuckDB, Polars, Velox, Spark’s Tungsten, ends up in roughly the same place: store data column by column, process it in batches, and avoid touching bytes you do not need. The reason is not taste. It is hardware. Modern CPUs reward columnar layout so heavily that changing only the layout can turn a slow query into a surprisingly fast one on the same machine, over the same data.

This post explains why that happens, and then shows the format that made the idea portable: Apache Arrow. Arrow won because it solved a real pain shared by everyone building data tools, not because someone declared it the standard first.

Rows were built for a different job
#

Traditional databases store data the way you would store a spreadsheet on paper: one row after another.

Memory layout (row-based):
[1, "Alice", "Engineering", 95000, "2021-03-15"]
[2, "Bob",   "Sales",       87000, "2019-06-01"]
[3, "Carol", "Engineering", 102000, "2022-01-10"]
...

This makes perfect sense for transactions. When you INSERT a row, you write one contiguous block. When you UPDATE a field, you find the row and patch it in place. When you fetch a customer by primary key, you read one chunk and get everything about that customer at once.

But analytical queries are not transactions. They scan millions of rows and touch a handful of columns:

SELECT AVG(salary) FROM employees WHERE department = 'Engineering'

This needs two columns: department and salary. In a row-based layout, the CPU loads all five fields of every row into cache just to read two of them. For a production table with 50 columns where the query uses 3, that is 94% wasted memory bandwidth. Every cache line the CPU fetches is mostly garbage.

The machine is working hard. Much of that work is unnecessary.

Three reasons columnar wins
#

Columnar storage inverts the layout. Instead of grouping all fields of one record together, it groups all values of one column together:

Memory layout (columnar):
ids:         [1, 2, 3, ...]
names:       ["Alice", "Bob", "Carol", ...]
departments: ["Engineering", "Sales", "Engineering", ...]
salaries:    [95000, 87000, 102000, ...]
hire_dates:  ["2021-03-15", "2019-06-01", "2022-01-10", ...]

Now reading salary means reading one contiguous array of numbers. No detours through names and dates. No wasted bytes. The CPU loads exactly what it needs and nothing else.

This layout exploits three properties of modern hardware.

Cache lines
#

CPUs do not read individual bytes from RAM. They read cache lines: aligned blocks of 64 bytes. When you access a single integer, the CPU fetches the entire 64-byte line it lives in. Whatever else is in that line comes along for free.

In a row-based layout, a cache line might contain the id, name, department, salary, and hire_date of one employee. If you only need the salary, four out of five fields are waste.

In a columnar layout, a cache line from the salary column contains 16 consecutive salary values (64 bytes / 4 bytes per i32 = 16 values). Every single byte is useful. The CPU fetches less total memory and does more useful work per fetch.

SIMD
#

Modern processors have SIMD registers, 256 or 512 bits wide, that apply one instruction to multiple values simultaneously. An AVX2 instruction can compare eight 32-bit integers against a threshold in a single cycle. An AVX-512 instruction doubles that.

But SIMD requires contiguous, aligned data. You cannot scatter eight salaries across eight different row-structs and expect the hardware to vectorize the comparison. Columnar layout hands the data to the CPU in exactly the format SIMD needs: one contiguous array of homogeneous values.

When DataFusion evaluates WHERE salary > 90000, the underlying Arrow compute kernel can process the salary column in SIMD-width chunks, producing a bitmap of matching indices with minimal branching. Row-based evaluation would check one row at a time, with a branch per row.

Compression
#

Columns contain homogeneous data: all strings, all integers, all dates. Values in the same column tend to cluster. Departments repeat. Salaries fall in a narrow range. Timestamps are monotonically increasing.

This homogeneity makes columnar data extraordinarily compressible:

Dictionary encoding replaces repeated strings with integer codes. A department column with four distinct values becomes four integers plus a lookup table, regardless of row count.
Run-length encoding collapses sequences of identical values. A sorted status column with a million “active” rows followed by ten thousand “inactive” rows compresses to two entries.
Delta encoding stores sorted integers as differences from the previous value. A column of sequential IDs becomes a column of ones.

Row-based data mixes types in every block, defeating all of these schemes. Columnar data hands each scheme a single-type, highly redundant stream. The result: smaller files, less I/O, faster queries. Parquet files are typically 5-10x smaller than equivalent CSV files, and the savings compound with column count.

Apache Arrow: the format everyone eventually converged on
#

Before Arrow, every tool had its own in-memory columnar format. Spark had Tungsten. Pandas had its internal block manager. Impala had its format. When you moved data between tools, you serialized from format A and deserialized into format B. For a Python notebook calling a Spark cluster, this meant copying gigabytes of data through a serialization bottleneck that often dominated the total runtime.

Apache Arrow was born from a simple observation: if every tool used the same in-memory format, data could move between them at the cost of a pointer, not a copy.

Arrow is three things:

A specification for how columnar data is laid out in memory: buffer alignments, null representation, offset schemes for variable-length types.
A set of libraries implementing that specification in Rust, C++, Java, Python, Go, and more.
A contract that any Arrow-aware system can read any other Arrow-aware system’s data without conversion.

Arrow is not a file format. Parquet is a file format (columnar, compressed, on disk). Arrow is the in-memory representation: what the data looks like once it leaves the file and enters the CPU’s domain. Parquet goes to disk. Arrow lives in RAM. DataFusion reads Parquet from disk into Arrow in memory, processes it, and produces Arrow as output. No intermediate format. No copies.

Arrow at the byte level
#

An Arrow array is refreshingly simple. No object headers. No vtable pointers. No indirection. Just flat buffers.

Fixed-width types
#

A column of i32 values [1, NULL, 3, 4] is stored as two buffers:

Validity bitmap:  [1, 0, 1, 1]     (1 bit per value: 1=valid, 0=null)
Data buffer:      [1, ?, 3, 4]     (contiguous i32 values; ? = undefined)

The validity bitmap uses one bit per value. A million-row column needs a 125 KB bitmap. Checking null at index i is one bitwise AND: bitmap[i / 8] & (1 << (i % 8)).

The data buffer is a flat [i32]. Accessing value i is pointer arithmetic: base + i * 4. No indirection. No pointer chasing. The access pattern is perfectly predictable, which means the CPU prefetcher can stay ahead of the computation and virtually eliminate cache misses.

Variable-width types
#

Strings cannot be fixed-width. Arrow handles them with an offset buffer:

Values: ["hello", "world", "!"]

Offsets: [0, 5, 10, 11]    (n+1 integers: start position of each string, plus total length)
Data:    "helloworld!"     (all strings concatenated, no separators, no null terminators)

To read string i: slice data[offsets[i]..offsets[i+1]]. String 1 is data[5..10] = "world". Two array accesses, one slice, zero heap allocations.

Compare this to a Vec<String> in Rust, where every string is a separate heap allocation with its own pointer, length, and capacity. For a million strings, that is a million allocations scattered across the heap. Arrow stores the same million strings in two contiguous buffers: one for offsets, one for characters. The difference in cache behavior is enormous.

Record batches
#

A single column is rarely useful alone. A record batch bundles multiple equal-length columns with a schema:

RecordBatch:
  Schema: {id: Int32, name: Utf8, salary: Float64}
  Columns:
    id:     [1, 2, 3]
    name:   ["Alice", "Bob", "Carol"]
    salary: [95000.0, 87000.0, 102000.0]
  Row count: 3

Query engines stream data as record batches. Typical sizes range from 1,024 to 65,536 rows: large enough to amortize per-batch overhead, small enough to fit in L2 cache, and naturally compatible with streaming (you do not need to load the entire dataset before processing the first batch).

In the previous post, when we saw DataFusion’s operators passing data between stages, this is what they were passing: Arrow RecordBatches. The CSV scanner produced them. The filter operator consumed and produced them. The aggregate operator consumed them and produced a final batch. No format conversion anywhere in the pipeline.

Building Arrow in Rust
#

The arrow crate is one of the foundations under DataFusion. Let us build the same shapes we have been talking about.

[dependencies]
arrow = "55"

Typed arrays
#

use arrow::array::{Int32Array, StringArray, Float64Array};

fn main() {
    // A column of integers with a null at index 1
    let ids = Int32Array::from(vec![Some(1), None, Some(3), Some(4)]);
    println!("ids[0] = {}", ids.value(0));       // 1
    println!("ids[1] null? {}", ids.is_null(1));  // true

    // A column of strings
    let names = StringArray::from(vec!["Alice", "Bob", "Carol", "Dave"]);
    println!("names[2] = {}", names.value(2));    // Carol

    // A column of floats
    let salaries = Float64Array::from(vec![95000.0, 87000.0, 102000.0, 78000.0]);
    println!("salaries[0] = {}", salaries.value(0)); // 95000.0
}

Int32Array::from(vec![Some(1), None, Some(3), Some(4)]) allocates two buffers: a 16-byte data buffer and a 1-byte validity bitmap. The None sets bit 1 to zero. Under the hood, this is exactly the layout we described: flat arrays, no indirection. The type system prevents you from accidentally treating a null value as valid, and the memory layout is identical across every language that speaks Arrow.

Record batches
#

use std::sync::Arc;
use arrow::array::{Int32Array, StringArray, Float64Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;

fn main() -> Result<(), arrow::error::ArrowError> {
    let schema = Schema::new(vec![
        Field::new("id", DataType::Int32, false),
        Field::new("name", DataType::Utf8, false),
        Field::new("department", DataType::Utf8, false),
        Field::new("salary", DataType::Float64, false),
    ]);

    let batch = RecordBatch::try_new(
        Arc::new(schema),
        vec![
            Arc::new(Int32Array::from(vec![1, 2, 3, 4, 5])),
            Arc::new(StringArray::from(vec!["Alice", "Bob", "Carol", "Dave", "Eve"])),
            Arc::new(StringArray::from(vec![
                "Engineering", "Sales", "Engineering", "Marketing", "Engineering",
            ])),
            Arc::new(Float64Array::from(vec![95000.0, 87000.0, 102000.0, 78000.0, 110000.0])),
        ],
    )?;

    println!("Rows: {}, Columns: {}", batch.num_rows(), batch.num_columns());
    // Rows: 5, Columns: 4

    Ok(())
}

RecordBatch::try_new validates at construction time: all columns must have equal length and match the schema types. Pass a Float64Array where the schema expects Int32 and you get an error before any query runs. This is the same principle as Rust’s type system: catch the mistake at the boundary, not in the middle of computation.

Filtering: boolean masks instead of if-else
#

In a row-based world, you filter with a loop and a branch per row. In Arrow, you filter with a boolean mask applied to an entire column:

use arrow::array::{StringArray, Float64Array, BooleanArray};
use arrow::compute;

fn main() {
    let departments = StringArray::from(vec![
        "Engineering", "Sales", "Engineering", "Marketing", "Engineering",
    ]);
    let salaries = Float64Array::from(vec![95000.0, 87000.0, 102000.0, 78000.0, 110000.0]);

    // Build the mask: which rows are Engineering?
    let mask = BooleanArray::from(
        departments.iter()
            .map(|d| Some(d == Some("Engineering")))
            .collect::<Vec<_>>()
    );

    // Apply the mask
    let filtered = compute::filter(&salaries, &mask).unwrap();
    let result = filtered.as_any().downcast_ref::<Float64Array>().unwrap();

    for i in 0..result.len() {
        println!("{}", result.value(i));
    }
    // 95000.0
    // 102000.0
    // 110000.0
}

The mask is itself an Arrow array: one bit per row. The compute::filter kernel uses that mask to extract matching values from the salary column in a tight loop that the compiler can often auto-vectorize into SIMD instructions. No per-row branch. No object-per-row overhead. Just arrays in, arrays out.

Aggregation: compute kernels
#

use arrow::array::Float64Array;
use arrow::compute;

fn main() {
    let salaries = Float64Array::from(vec![95000.0, 102000.0, 110000.0]);

    let sum = compute::sum(&salaries).unwrap();
    let min = compute::min(&salaries).unwrap();
    let max = compute::max(&salaries).unwrap();
    let avg = sum / salaries.len() as f64;

    println!("Sum: {sum}");   // 307000.0
    println!("Min: {min}");   // 95000.0
    println!("Max: {max}");   // 110000.0
    println!("Avg: {avg}");   // 102333.33
}

These kernels are tight loops over contiguous memory. LLVM auto-vectorizes the sum kernel into SIMD additions. For a million-element array, the computation runs at memory bandwidth speed: the CPU adds values as fast as it can fetch them from RAM. This is the payoff of the columnar layout. The data is arranged exactly the way the hardware wants it.

From disk to memory: Parquet to Arrow
#

Parquet is the columnar file format. Arrow is the columnar memory format. Together they form the storage-to-computation bridge:

use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use std::fs::File;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open("employees.parquet")?;
    let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;

    // Projection push-down: only decompress these columns
    let reader = builder
        .with_projection(arrow::compute::ProjectionMask::leaves(
            builder.parquet_schema(),
            vec![2, 3], // department, salary
        ))
        .with_batch_size(4096)
        .build()?;

    for batch in reader {
        let batch = batch?;
        println!("Batch: {} rows, {} columns", batch.num_rows(), batch.num_columns());
    }

    Ok(())
}

The .with_projection() call is projection push-down at the file level. The Parquet reader never decompresses the id, name, or hire_date column chunks. On a table with 50 columns, this means reading and decompressing 2 columns instead of 50. The data enters memory already in Arrow format, ready for compute kernels. No parsing step, no conversion step, no intermediate representation.

The real-world evidence
#

This is not a theoretical win. You can see it in real systems.

DuckDB runs TPC-H analytical benchmarks on a laptop faster than some distributed clusters running row-based databases. A single thread saturates memory bandwidth because the columnar engine wastes no cache line, no SIMD lane, no memory bus cycle on data it does not need.

Polars, built on Arrow in Rust, processes multi-gigabyte CSV files in seconds. The secret is not clever algorithms. It is reading data into Arrow format on ingestion and never leaving it.

Apache Spark learned this lesson the hard way. Early Spark used a row-based Row object for everything. Performance was poor. Project Tungsten introduced an off-heap columnar format, and later Spark adopted Arrow for Python interop (PySpark). The result was order-of-magnitude improvements for aggregation and join-heavy workloads.

DataFusion uses Arrow as the sole data exchange format between operators. A filter produces an Arrow RecordBatch. An aggregate consumes it. No serialization between stages. No format conversion. Data flows through the engine as contiguous memory, from disk to final result.

When rows still win
#

Columnar is not universally better. The tradeoff is real.

Transactions. When you INSERT one row, row storage writes one contiguous block. Columnar storage must append to every column file. This is why PostgreSQL and MySQL use row-based storage: their bread and butter is OLTP, one row at a time, millions of times per second.

Point lookups. Fetching one user by primary key in row storage is one read. In columnar storage, it is one read per column. With an index, the row-based read is one cache line. The columnar read is N cache lines for N columns.

Full-row access patterns. If the query is SELECT * FROM users WHERE id = 42, columnar has no advantage: you need every column anyway.

The split is clean: OLTP favors rows. OLAP favors columns. This is why the industry is converging on hybrid architectures: PostgreSQL for transactions, DuckDB or DataFusion for analytics, Parquet as the bridge between them. Some systems, like AlloyDB and CockroachDB, are experimenting with both layouts in the same engine, choosing per-query.

What comes next
#

We have now covered the pipeline (post 1) and the memory format that makes it fast (this post). The next step is the part where Rust’s type system gets to shine: logical plans and expressions. How a SELECT ... WHERE ... GROUP BY becomes a tree of typed enum variants that the optimizer can pattern-match on, rewrite, and hand to the executor. If the compiler analogy holds, this is where we build the intermediate representation.

Reference:

How Query Engines Work - This article is part of a series.

Part 1: How Query Engines Work 1. The small compiler hiding behind every SQL query

Part 2: This Article

Rows were built for a different job #

Three reasons columnar wins #

Cache lines #

SIMD #

Compression #

Apache Arrow: the format everyone eventually converged on #

Arrow at the byte level #

Fixed-width types #

Variable-width types #

Record batches #

Building Arrow in Rust #

Typed arrays #

Record batches #

Filtering: boolean masks instead of if-else #

Aggregation: compute kernels #

From disk to memory: Parquet to Arrow #

The real-world evidence #

When rows still win #

What comes next #

Rows were built for a different job
#

Three reasons columnar wins
#

Cache lines
#

SIMD
#

Compression
#

Apache Arrow: the format everyone eventually converged on
#

Arrow at the byte level
#

Fixed-width types
#

Variable-width types
#

Record batches
#

Building Arrow in Rust
#

Typed arrays
#

Record batches
#

Filtering: boolean masks instead of if-else
#

Aggregation: compute kernels
#

From disk to memory: Parquet to Arrow
#

The real-world evidence
#

When rows still win
#

What comes next
#