Parsing: How Ruby Understands Your Code
I've started working on a new edition of Ruby Under a Microscope that covers Ruby 3.x. I'm working on this in my spare time, so it will take a while. Leave a comment or drop me a line and I'll email you when it's finished.
Update: I’ve made a lot of progress so far this year. I had time to completely rewrite Chapters 1 and 2, which cover Ruby’s new Prism parser and the Ruby compiler which now handles the Prism AST. I also updated Chapter 3 about YARV and right now I’m working on rewriting Chapter 4 which will cover YJIT and possibly other Ruby JIT compilers.
Here’s an excerpt from the new version of Chapter 1. Many thanks to Kevin Newton, who reviewed the content about Prism and had a number of corrections and great suggestions. Also thanks to Douglas Eichelberger who had some great feedback as well.
I’ll post more excerpts from Chapters 2, 3 and 4 in the coming weeks. Thanks for everyone’s interest in Ruby Under a Microscope!
Chapter 1: Tokenization And Parsing
| Tokens: The Words That Make Up the Ruby Language | 3 |
| Which Words Are Reserved Words? | 8 |
| Experiment 1-1: Using Prism to Tokenize Different Ruby Scripts | 10 |
| Parsing: How Ruby Understands Your Code | 13 |
| Identifying Tokens | 14 |
| Parsing Subexpressions Recursively | 16 |
| Comparing Tokens | 17 |
| Operator Precedence | 19 |
| Left and Right Associative Operators | 24 |
| Binding Powers | 28 |
| Experiment 1-2: Using Prism to Parse Different Ruby Scripts | 30 |
| Summary | 33 |
Parsing: How Ruby Understands Your Code
Once Ruby converts your code into a series of tokens, what does it do next? How does it actually understand and run your program? Does Ruby simply step through the tokens and execute each one in order?
No. Your code still has a long way to go before Ruby can run it. The next step on its journey through Ruby is called parsing, where words or tokens are grouped into sentences or phrases that make sense to Ruby. When parsing, Ruby takes into account the order of operations, methods, blocks, and other larger code structures.
Ruby’s parsing engine defines Ruby’s syntax rules. Reading in tokens, Ruby matches the token types and the order the tokens appear with a large series of patterns. These patterns, indeed, are the heart and soul of the Ruby language. How we write a function call, how we define a method using the def keyword, how we write classes and modules - the patterns Ruby looks for define the language.
Ruby’s parse algorithm has three high level steps:
- Identify: First, Ruby identifies what the next token represents. Ruby does this by comparing the token’s type - and possibly the types of the following tokens - with a large series of patterns. If one pattern matches, Ruby understands what your code means. If not, Ruby emits a syntax error.
- Recurse: Secondly, Ruby calls itself. Each value in one of the syntax patterns can itself be a subexpression - a smaller program that Ruby needs to parse. To do this, Ruby calls itself recursively.
- Compare: Third, Ruby compares the current token with the next token to determine which has a higher precedence. This comparison leads Ruby down a specific path, processing the tokens in a certain order.
Let’s break down these ideas further, by following Ruby through the “Hello World” program. Afterwards, we’ll look at a second, slightly more complicated example.
puts "Hello World"
As we saw in the previous section, Ruby first converts the text in this code file into tokens. For Hello World, Ruby’s tokenizer produces these five tokens:
Figure 1-14: Hello World Tokenized
To make the following diagrams simpler, let’s redraw these tokens in a more compact format:
Figure 1-15: Hello World tokens in a more compact format
Using a single gray line of text, Figure 1-15 shows the five tokens from Figure 1-14 in a more compact format. First, PM_TOKEN_IDENTIFIER represents the word “puts” from the beginning of the program. Next, three tokens make up the string literal value: PM_TOKEN_STRING_BEGIN for the first double quote, followed by PM_TOKEN_STRING_VALUE for the words Hello and World, and PM_TOKEN_STRING_END represents the second quote. Finally, the program ends with PM_TOKEN_EOF to mark the end of the source code file.
Now let’s follow Ruby as it processes the Hello World example using the three steps: identify, recurse and compare.
Identifying Tokens
First, identify. How does Ruby understand what the first token, PM_TOKEN_IDENTIFIER, means?
Figure 1-16: Parsing the first token
Figure 1-16 represents the state of Ruby’s parser when it starts to parse this code. At this moment, Ruby is just getting started by inspecting the puts identifier. One of the patterns Ruby looks for matches the identifier; but what does this identifier mean? Ruby knows puts could be a local variable, or it could be the name of a function to call. Since there are no local variables defined in this program, Ruby determines that the puts identifier represents a function the program is calling. (It’s also possible that the program is about to create a new local variable like this: puts = "Hello World". If that were the case, Ruby would see the assignment operator next and parse things differently.)
What happens next? After matching the token to the function call pattern, Ruby records this match in a data structure called an abstract syntax tree (AST). Ruby and most other programming languages use ASTs to record the results of parsing tokens like this. As we’ll see, the AST’s tree structure is well suited for holding the nested, recursive structure of computer programs.
Figure 1-17: The first AST node
Figure 1-17 shows the first node Ruby saves in the AST tree. In a moment, Ruby will begin to add more nodes to the AST.
Before proceeding to the next token, let’s imagine the syntax pattern for a function call:
function-name ( argument1, argument2, argument3, etc. )
Although in Ruby the parentheses are optional, so this pattern also applies:
function-name argument1, argument2, argument3, etc.
NOTE
The original version of the Ruby parser used patterns or grammar rules like this
directly with a tool called a parser generator. However, starting with Ruby 3.3,
Ruby uses a new parser called Prism, which detects these patterns directly using
hand written C code.
After parsing the first token, Ruby inspects the second token. According to the function call pattern, Ruby knows the second token might represent the first argument to the function call. But, how many arguments are there? And what is each argument? The program in Listing 1-11 is very simple, but it could have instead printed a very complex expression - the arguments to puts could have run on for many lines and used hundreds of tokens.
Parsing Subexpressions
Second, recurse. To parse each of the arguments to puts, Ruby has to call itself.
Figure 1-18: Parsing the second token
Figure 1-18 shows two levels of the Ruby parser’s call stack; the top line shows Ruby parsing the puts identifier token, and matching the function call pattern. The second line shows how Ruby called itself to parse the second token, PM_TOKEN_STRING_BEGIN, the leading quote of the string literal. Think of these lines as the backtrace of the Ruby parser.
Figure 1-18 also shows a value 14 on the right side. While calling itself recursively, Ruby passes in a numeric value called the binding power. We’ll return to this later.
Now that Ruby has called itself, Ruby starts the 3-step process all over again: identify, recurse and compare. This time, Ruby has to identify what the PM_TOKEN_STRING_BEGIN token means. This token always indicates the start of a string value. In this example PM_TOKEN_STRING_BEGIN represents the double quote that appears after puts. But the same token might represent a single quote or one of the other ways you can write a string in Ruby, for example using %Q or %q.
Ruby’s new parser, Prism, next parses the string contents directly by processing the following two tokens:
Figure 1-19: Parsing the third and fourth tokens
In this example, Ruby’s parser is done after finding the PM_TOKEN_STRING_END token and can continue to the next step. More complex strings - strings that contain interpolated values using #{} for example - might have required Ruby to call itself yet again to process more nested expressions. But for the simple "Hello World" string Ruby is done.
To record the string value, Ruby creates a new AST node called PM_STRING_NODE.
Figure 1-20: Two AST nodes
Figure 1-20 shows two AST nodes Ruby has created so far: the call node created earlier, and now a new string node.
Ruby’s parser is a recursive descent parser. This Computer Science term describes parsers that resemble the grammar or syntax rules of the programs they parse, and call themselves recursively in a top-down manner as they process nested structures. Many modern programming languages today use this general approach.

