Visiting an Abstract Syntax Tree
Joshua Tree National Park (via: Wikimedia Commons)
In my last post, I explored how Crystal parsed a simple program and produced a data structure called an abstract syntax tree (AST). But what does Crystal do with the AST? Why bother going to such lengths to create it?
After Crystal parses my code, it repeatedly steps through all the entries or nodes in the AST and builds up a description of the intended meaning and behavior of my code. This process is known as semantic analysis. Later, Crystal will use this description to convert my program into a machine language executable.
But what does this description contain? What does it really mean for a compiler to understand anything? Let’s pack our bags and visit an abstract syntax tree with Crystal to find out.
The Visitor Pattern
Imagine several tourists visiting a famous tree: Each of them sees the same tree in a different way. The tree doesn’t change, but the perspective of each person looking at it is different. They each take a different photo, or remember different details.
In Computer Science this separation of the data structure (the tree) from the algorithms using it (the tourists) is known as the visitor pattern. This technique allows compilers and other programs to run multiple algorithms on the same data structure without making a mess.
The visitor pattern calls for two functions: accept
and visit
. First, a
node in the data structure “accepts” a visitor:
After accepting a visitor, the node turns around and calls the visit
method on
Visitor
:
The visit
method implements whatever algorithm that visitor is interested in.
This seems kind of pointless… why use accept
at all? We could just call
visit
directly. The key is that, after calling the visitor and passing
itself, the node passes the visitor to each of its children, recursively:
And then the visitor can visit each of the child nodes also. The Visitor
class doesn’t necessarily need to know anything about how to navigate the node
data structure. And more and more visitor classes can implement new algorithms
without changing the underlying data structure and breaking each other.
The Visitor Pattern in the Crystal Compiler
In order to understand what my code means, Crystal reads through my program’s AST over and over again using different visitors. Each algorithm looks for certain syntax, records information about the types and objects my code uses or possibly even transforms my code into a different form.
Crystal implements the basics of the visitor pattern in visitor.cr, inside the superclass of all AST nodes:
class ASTNode def accept(visitor) if visitor.visit_any self if visitor.visit self accept_children visitor end visitor.end_visit self visitor.end_visit_any self end end end
Each subclass of ASTNode
implements its own version of accept_children
.
My Tiny Crystal Program
To get a sense of how the visitor pattern works inside of Crystal, let’s look at one line of code from my last post:
arr = [12345, 67890]
As I explained last month, the Crystal parser generates this AST tree fragment:
Once the parser is finished and has created this small tree, the Crystal
compiler steps through it a number of different times, looking for classes,
variables, type declarations, etc. Each of these passes through the AST is
performed by a different visitor class: TopLevelVisitor
,
InstanceVarsInitializerVisitor
or ClassVarsInitializerVisitor
among many
others.
The most important visitor class the Crystal compiler uses is called simply
MainVisitor
. You can find the code for MainVisitor
in
main_visitor.cr:
# This is the main visitor of the program, ran after types have been declared # and their type declarations (like `@x : Int32`) have been processed. # # This visits the "main" code of the program and resolves calls, instantiates # methods and visits them, recursively, with other MainVisitors. # # The visitor keeps track of a method's variables (or the main program, split into # several files, in case of top-level code). It keeps track both of the type of a # variable at a single point (stored in @vars) and the combined type of all assignments # to it (in @meta_vars). # # Call resolution logic is in `Call#recalculate`, where method lookup is done. class MainVisitor < SemanticVisitor
Since Crystal supports typed parameters and method overloading, the visitor
class implements a different visit
method for each type of node that it visits,
for example:
class MainVisitor < SemanticVisitor def visit(node : Assign) def visit(node : Var) def visit(node : ArrayLiteral) def visit(node : NumberLiteral) Etc…
Now let’s look at three examples of what the MainVisitor
class does with my
code: identifying variables, assigning types and expanding array literals. The
Crystal compiler is much too complex to describe in a single blog post, but
hopefully I can give you glimpse into the sort of work Crystal does during
semantic analysis.
Identifying Variables
Obviously, my example code creates and initializes a variable called arr
:
arr = [12345, 67890]
But how does Crystal identify this variable’s name and value? What does it do
with arr
?
The MainVisitor
class starts to process my code by visiting the root node of
this branch of my AST, the Assign
node:
As you can see, earlier during the parsing phrase Crystal had saved the target
variable and value of this assign statement in the child AST nodes. The target
variable, arr
, appears in the Var
node, and the value to assign is an
ArrayLiteral
node. The MainVisitor
now knows I declared a new variable, called
arr
, in the current lexical scope. Since my program has no classes, methods or
any other lexical scopes, Crystal saves this variable in a table of variables
for the top level program:
Actually, to be more accurate, there will always be many other variables in
this table along with arr
. All Crystal programs automatically include the
standard library, so Crystal also saves all of the top level variables from the
standard library in this table.
In a more normal program, there will be many lexical scopes for different
method and class or module definitions, and MainVisitor
will save each
variable in the corresponding table.
Assigning Types
Probably the most important function of MainVisitor
is to assign a type to each
value in my program. The simplest example of this is when MainVisitor
visits a
NumberLiteral
node:
Looking at the size of the numeric value, Crystal determines the type should be
Int32
. Crystal then saves this type right inside of the NumberLiteral
node:
Strictly speaking, this violates the visitor pattern because the visitors
shouldn’t be modifying the data structure they visit. But the type of each
node, the type of each programming construct in my program, is really an
integral part of that node. In this case the MainVisitor
is really just
completing each node. It’s not changing the structure of the AST in this case…
although as we’ll see in a minute the MainVisitor
does this for other nodes!
Type Inference
Sometimes type values can’t be determined from the intrinsic value of an AST node. Often the type of a node is determined by other nodes in the AST.
Recall my example line of code is:
arr = [12345, 67890]
Here Crystal automatically sets the type of the arr variable to the type of the
array literal expression: Array(Int32)
. In Computer Science, this is known as
type inference. Because Crystal can automatically determine the type of
arr
, I don’t need to declare it explicitly by writing something more
complicated like this:
arr = uninitialized Array(Int32) arr = [12345, 67890]
Type inference allows me to write concise, clean code with fewer type annotations. Most modern, statically typed languages such as Swift, Rust, Julia, Kotlin, etc., use type inference in the same way as Crystal. Even newer versions of Java or C++ use type inference.
The Crystal compiler implements type inference when the MainVisitor encounters
an Assign
AST node, what we saw above.
After encountering the Assign
node, Crystal recursively processes one of the
two child nodes, the ArrayLiteral
value, and its child nodes. When this process
finishes, Crystal knows the type of the ArrayLiteral
node is Array(Int32)
:
I’ll take a closer look at how Crystal processes the ArrayLiteral
node next.
But for now, once Crystal has the type of the ArrayLiteral
node it copies that
type over to the Var
node and sets its type also:
But Crystal does something else interesting here: It sets up a dependency between the two AST nodes: it “binds” the variable to the value:
This binding dependency allows Crystal to later update the type of the arr
variable whenever necessary. In this case the value [12345, 67890]
will always
have the same type, but I suspect that sometimes the Crystal compiler can
update types during semantic analysis. In this way if the Crystal compiler ever
changed its mind about the type of some value, it can easy update the types of
any dependent values. I also suspect Crystal uses these type dependency
connections to produce error messages whenever you pass an incorrect type to
some function, for example. These are just guesses, however; if anyone from the
Crystal team knows exactly what these type bindings are used for let me know.
Update: Ary Borenszweig explained that sometimes the Crystal compiler updates the type of variables based on how they are used. He posted an interesting example on The Crystal Programming Language Forum.
Expanding an Array Literal
So far we’ve seen Crystal set the type of the NumberLiteral
node to Int32
,
and we’ve seen Crystal assign arr
a type of Array(Int32)
. But how did Crystal
determine the type of the array literal [12345, 67890]
?
This is where things get even more complicated. Sometimes during semantic
analysis the Crystal compiler completely rewrites parts of your code, replacing
it with something else. This happens even with my simple example. When visiting
the ArrayLiteral
node, the MainVisitor
expands this simple line of code into
something more complex:
__temp_621 = ::Array(typeof(12345, 67890)).unsafe_build(2) __temp_622 = __temp_621.to_unsafe __temp_622[0] = 12345 __temp_622[1] = 67890 __temp_621
Reading this, you can see how later my compiled program will create the new
array. First Crystal creates an empty array with a capacity of 2, and an
element type of Int32
. typeof(12345, 67890)
returns the type (or multiple
types inside a union type) found in the given set of values, in this case just
Int32
. Later Crystal sets the two elements in the array just by assigning
them.
Crystal achieves this by replacing part of my program’s AST with a new branch:
For clarity, I’m not drawing the AST nodes for the inner assign operations, only the first line:
__temp_621 = ::Array(typeof(12345, 67890)).unsafe_build(2)
Putting It All Together
With this new, updated AST we can see exactly how Crystal determines the type
of my variable arr
. Starting at the root of my AST, MainVisitor
visits all of
the AST nodes in this order in a series of recursive calls:
And it determines the types of each of these nodes as it returns from the recursive calls:
Some interesting details here that I don’t understand completely or have space to explain here:
-
The
TypeOf
node calculates a common union type using a type formula. In this example, it just returnsInt32
because both elements of my array,12345
and67890
, are simple 32 bit integers. -
I believe the
Generic
node refers to a Crystal generic class via thePath
node shown above, in this exampleArray(T)
. WhenMainVisitor
processes theGeneric
node, it setsT
to the typeInt32
, arriving at the typeArray(Int32).class
. -
The
Call
node looks up the method my code is calling (unsafe_build
) and uses the type from that method’s return value. I didn’t have time to explore how method lookup works in Crystal, however, so I’m not sure about this.
Scratching the Surface
Today we looked at a tiny piece of what the Crystal compiler can do. There are
many more types of AST nodes, each of which the MainVisitor
class handles
differently. And there are many different visitor classes also, beyond
MainVisitor
. When analyzing a more complex program Crystal has to understand
class and module definitions, instance and class variables, type annotations,
different lexical scopes, macros, and much, much more. Crystal will need all of
this information later, during the code generation phase, the next step that
follows semantic analysis.
But I hope this article gave you a sense of what sort of work a compiler has to do in order to understand your code. As you can see, for a statically typed language like Crystal the compiler spends much of its time identifying all of the types in your code, and determining which programming constructs or AST nodes have which types.
Next time I’ll look at code generation: Now that Crystal has identified the variables, function calls and types in my code it is ready to generate the machine language code needed to execute my program. To do that, it will leverage the LLVM framework.