Syntax and Semantics 1: Your code has a grammar problem

Table of Contents

Syntax and Semantics - This article is part of a series.

Part 1: This Article

Part 2: Syntax and Semantics 2: Three ways to define what your code means

Part 3: Syntax and Semantics 3. The Expression Problem — why you cannot have it all

You have been writing formal grammars your entire career. Every enum in Rust, every sealed trait in Scala, every TypeScript discriminated union: these are not just “types.” They are formal grammars defining the syntax of your domain. They say what values can exist, what shapes data can take, what states are expressible. That is exactly what a grammar does.

Most engineers never take a formal methods course. The ones who do tend to forget it the moment the exam ends. But the concepts never left. They migrated into the tools we use daily: type systems, pattern matching, algebraic data types. We use them without recognizing the ancestry.

This series is about making that ancestry visible. Not for academic credit, but because understanding the distinction between syntax (what can be expressed) and semantics (what it means) changes how you design types, validate data, and prevent bugs. In this first post, we start with grammars.

What is a formal grammar
#

In 1956, Noam Chomsky published “Three models for the description of language” and introduced what we now call the Chomsky hierarchy: a classification of grammars by their generative power. A grammar is a set of rules that defines which strings belong to a language. The hierarchy ranks grammars from most restrictive to most powerful.

Type	Name	Rule form	Recognizer	Programming relevance
3	Regular	`A -> aB` or `A -> a`	Finite automaton	Regex, lexical analysis (tokenization)
2	Context-free	`A -> gamma`	Pushdown automaton	Parser grammars, BNF, most syntax definitions
1	Context-sensitive	`alpha A beta -> alpha gamma beta`	Linear-bounded automaton	Type checking, name resolution, scoping rules
0	Unrestricted	`alpha -> beta`	Turing machine	Arbitrary computation

Programming languages live mostly at Type-2 (context-free grammars) for their syntax: the parser checks whether your source code is structurally well-formed. But a program that parses correctly can still be nonsense: referencing undefined variables, applying a function to the wrong type. Catching those errors requires context-sensitive analysis. That is what the type checker does. Syntax says the sentence is grammatical. Semantics says whether it means anything.

BNF: the notation you already know
#

When John Backus and Peter Naur formalized the grammar of ALGOL 60, they gave us Backus-Naur Form (BNF), a notation for writing context-free grammars as production rules. You have seen it in every language specification. Here is a tiny grammar for arithmetic expressions:

<expr> ::= <num>
         | <expr> "+" <expr>
         | <expr> "*" <expr>

Three production rules. An expression is either a number, an addition of two expressions, or a multiplication of two expressions. Now look at the same grammar expressed as a Rust enum:

// Rust: the BNF grammar as a type
enum Expr {
    Num(f64),
    Add(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
}

// Scala: the same grammar as a type
sealed trait Expr
case class Num(value: Double) extends Expr
case class Add(left: Expr, right: Expr) extends Expr
case class Mul(left: Expr, right: Expr) extends Expr

The correspondence is exact. Each variant (Rust) or case class (Scala) is a production rule. The recursive references to Expr / Box<Expr> are nonterminals: symbols that expand into further structure. The leaf values (f64, Double) are terminals: the atoms that cannot be broken down further. The type definition IS the grammar, expressed in a language the compiler can enforce.

AST vs CST: what your parser throws away
#

When a parser reads the string (1 + 2) * 3, it first builds a Concrete Syntax Tree (CST), a tree that preserves every detail of the source text: parentheses, whitespace, operator tokens, comments. The CST is a faithful representation of what was written.

Then the parser transforms it into an Abstract Syntax Tree (AST). The AST drops everything that was only needed for parsing. Parentheses disappear because the tree structure already encodes precedence: Mul(Add(Num(1), Num(2)), Num(3)). Whitespace disappears because it carries no structural information. What remains is pure structure.

When you define enum Expr with those three variants, you are defining the AST, the abstract syntax of your expression language. You are declaring which shapes a value can take, without prescribing how those shapes appear as text. This is a fundamental distinction. The grammar defines the space of valid structures. Parsing is the function that maps concrete text into that space. The type is the grammar. The parser is the bridge.

ADTs as domain grammars
#

The examples so far have been about expression languages, the traditional home of formal grammars. But the insight generalizes. Every algebraic data type you write is a grammar for some domain. Consider an order workflow:

// Rust
enum OrderStatus {
    Pending,
    Confirmed { confirmed_at: DateTime<Utc> },
    Shipped { tracking: String, shipped_at: DateTime<Utc> },
    Delivered { delivered_at: DateTime<Utc> },
    Cancelled { reason: String },
}

// Scala
sealed trait OrderStatus
case object Pending extends OrderStatus
case class Confirmed(confirmedAt: Instant) extends OrderStatus
case class Shipped(tracking: String, shippedAt: Instant) extends OrderStatus
case class Delivered(deliveredAt: Instant) extends OrderStatus
case class Cancelled(reason: String) extends OrderStatus

This is a grammar. It defines the vocabulary of your order domain. Five production rules, each generating a different sentence shape. Pending is a terminal: it takes no arguments, it just is. Shipped is a production with two fields: a tracking string and a timestamp. Together, the variants define what states are expressible.

Notice what this grammar does NOT define. It does not say that a Pending order should transition to Confirmed before it can become Shipped. It does not say that a Cancelled order should never become Delivered. It does not say that tracking must be non-empty. The grammar defines what can be said. It says nothing about what should be said. That distinction, between syntax and semantics, is the entire point of this series.

A well-designed grammar shrinks the space of expressible states to something close to the space of meaningful states. But “close” is not “equal.” There is always a gap.

The gap: syntactically valid, semantically meaningless
#

In 1957, Chomsky wrote a sentence that became famous precisely because it illustrates the boundary between syntax and semantics:

“Colorless green ideas sleep furiously.”

The sentence is grammatically perfect. Subject, adjective, verb, adverb, all in the right positions. A parser would accept it. But it means nothing. The syntax is valid; the semantics are empty.

The same thing happens in code. Consider this perfectly well-typed value:

let status = OrderStatus::Shipped {
    tracking: String::new(), // empty string
    shipped_at: Utc::now(),
};

val status: OrderStatus = Shipped(tracking = "", shippedAt = Instant.now())

The type system accepts it. The grammar says Shipped takes a String and a timestamp, and that is exactly what we provided. But a shipment with an empty tracking number is nonsense. It is the domain equivalent of colorless green ideas. The grammar, your enum, your sealed trait, defines the syntax of your domain. It controls what shapes values can take. But it does not control what those shapes mean.

This is not a failure of the type system. It is a statement about what grammars can and cannot do. Grammars generate structure. Meaning requires something else: a semantic framework that assigns interpretation to the structures the grammar produces. Closing this gap, making syntactically valid but semantically meaningless values impossible, is the central challenge of type-driven design.

What comes next
#

In Part 2, we look at the three classical approaches to defining semantics: operational, denotational, and axiomatic. We show how each one maps to a concrete technique you already use: match arms, pure functions, and type-level contracts. The grammar defines what your domain can say. The semantics define what it should mean.

Syntax and Semantics - This article is part of a series.

Part 1: This Article

Part 2: Syntax and Semantics 2: Three ways to define what your code means

Part 3: Syntax and Semantics 3. The Expression Problem — why you cannot have it all

What is a formal grammar #

BNF: the notation you already know #

AST vs CST: what your parser throws away #

ADTs as domain grammars #

The gap: syntactically valid, semantically meaningless #

What comes next #