How Rust Implements Tagged Unions
Rust describes itself as:
…a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.
Of course, this is in contrast to C, a different systems programming language that encourages segfaults and makes no guarantees at all about thread safety. Rust improves on C in many ways, most famously with its innovative ownnership model for managing memory.
Another less obvious improvement Rust makes to C has to do with the union keyword. The Rust compiler implements tagged
unions, which prevent you from crashing your program by initializing a union
with one variant and accessing it with another.
But the Rust doesn’t include
the union keyword at all; instead, Rust
uses enum to improve on both C enums and C unions at
the same time.
(Update: I heard on Twitter and in the comments that Rust does include untagged unions for use in FFI to interoperate with C, or for unsafe code building custom unions.)
Not sure what a tagged union is? Or why it’s an improvement over an old fashioned C union? Today I’ll explain. First I’ll start with a quick review of C unions, how they work and why they are dangerous. Then I’ll show you how Rust enums improve on them.
Unions are one of the most dangerous features of C. Here’s an example:
Here the union num_or_str saves either a number or a character pointer but not both. (A union can contain any number of members; for simplicity my example union has only two.) On the right I show how the C compiler would allocate memory for an instance of num_or_str. It allocates enough memory to hold the longest value in the union, but not both values at the same time. The integer is a short, meaning it occupies 16 bits or two bytes, and the string is a char pointer which takes 64 bits or 8 bytes using a modern 64-bit CPU. The two options for what might be stored in the union, num and str in this example, are known as variants.
Why C Unions Are Dangerous
Unions are dangerous because you, the C programmer, need to remember which variant you set in the union. If you save one type of value but then access the other, your program will crash.
For example this code works fine:
But if you forget a_number contains a number, and use a_number as a string instead, your program will crash:
Notice the C compiler didn’t help me here at all. It didn’t display any sort of warning or error when I wrote a_number.str. It silently allowed me to write dangerous code; in fact, union syntax encouraged me to introduce a segmentation fault.
Writing C code with unions is like driving very fast down a highway full of potholes. You might be the best driver in the world, but eventually you’re going to hit one of the holes and crash.
C programmers have been writing code with unions for years - for decades in fact. How have they avoided this problem? There must be a safe way of writing C code with unions.
The most common and robust solution is to keep track of which union variant is valid using an integer value saved right next to the union in memory. This integer is known as a tag, and the combination of the tag and the union is a tagged union.
Here’s an example:
On the right side I’ve allocated some memory right before the union for the tag using a struct. C structs, unlike unions, allocate enough memory to store all of their members at once. Note: Using two bytes to save a small integer value is unnecessary. C programs often use only one byte, or even represent the integer value using a bit mask inside the union’s values. But the principle remains the same.
Now when I save an integer in an instance of the union I can also set the tag to the value 1, for example, which I decide will mean that a_number contains a number:
And if I want to save a string instead, I set the tag to 2, for example:
Later when I access the tagged union, I first check the tag before deciding which variant I can access:
Of course, tagged unions are not foolproof. I invented the tag values 1 and 2 and wrote the code that checks for them. There’s nothing to prevent me from forgetting to save the tag value, saving the wrong tag value or misinterpreting the tag value when I later read it. And, if I ever add new variants to the union, I have to add a new branch to every if statement in my app that checks the tags, handling the new value. Needless to say, the C compiler won’t help me find those if statements or check whether I’ve covered all the possible tag values.
I’m a forgetful and easily distracted person. I need a programming language that will keep me out of trouble. Even with tagged unions I’m sure I would write dangerous, crashing C code before long.
Tagged Unions in Rust
Rust implements tagged unions using the enum keyword. For example, to declare a Rust enum type equivalent to the C tagged union above I write:
The questions for today are: Why are enums equivalent to tagged unions in C? And: What should I draw on the right side? What would I see if I could find and examine an enum in the memory space of a running Rust process?
Saving a Rust Enum
To find out, let’s create an instance of NumOrStr:
Notice that instead of 4, I’ve saved a more recognizable value, 1234. Now, if I compile it with the -—emit asm flag:
…Rust generates a file called union.s which contains the assembly language version of my program. If I open union.s and search for 1234, the integer value I saved above, I see:
I’ve found it; here are the x86 assembly language instructions that initialize a_number. These show me exactly how Rust represents enums in memory, how Rust implements tagged unions.
The only problem is… I have no idea what this means!
The movw x86 Instruction
What does movw mean? And what about -32(%rbp)?
It turns out x86 assembly language isn’t that hard to follow, once you learn the basic syntax. For a quick introduction, see my article from 2016: Learning to Read x86 Assembly Language. Intel, the company that built the microprocessor inside my Mac, defines the mov instruction to mean “move.” (Note: the instructions I show here that rustc —emit asm generates aren’t written using Intel x86 syntax, but with GAS x86 syntax instead.)
Here’s a diagram showing what the first movw instruction moves:
It turns out that movw stands for “move a word.” A word is defined as 16 bits, or 2 bytes. There are a few different variations on move, movb, movw, movl, movq, which move 1 byte, 2 bytes, 4 bytes or 8 bytes respectively.
Next, the $ notation indicates a literal value - in this case zero: $0. Now we can see the first instruction above is moving 2 bytes containing the value zero. Similarly, the second instruction is moving 2 bytes containing the value 1234:
The rbp Register
But where are these movw instructions moving these values to? To understand that we need to understand the odd -32(%rbp) syntax on the right side of the instructions. The % sign indicates a register inside my Mac’s microprocessor, in this case the “base pointer” register. So "bp" means "base pointer." And the “r” prefix in "rbp" means the move instruction is using all 8 bytes (64 bits) of this register’s value.
The -32(%rbp) notation calculates a memory address for the instruction using the contents of the %rbp register - in this case the address of where to move the data to. The expression -32(%rbp) in English means: “Take the 64 bit memory address value from the base pointer register, and subtract 32 from it.”
Compiled Rust programs - all programs really - that run on the x86 platform store values for local variables on the stack, using the base pointer register in this fashion. The base pointer, as it’s name indicates, stores the base address of my program's current stack frame. Each local variable in my code, for example a_number, is saved somewhere on the stack. If you’re not familiar with the concept of a stack, think of it as a convenient place for quickly saving and retrieving values while your program is running.
How Rust Saves an Integer Enum Variant
Taking a step back for a moment, here’s what we’ve learned so far. When I save an enum value containing an integer, Rust saves two values, 0 and 1234:
What does the 0 mean? Rust records a zero to indicate that a_number uses the NumOrStr::Num variant. In other words, a_number is a tagged union, and the zero value is the tag. We know the tag occupies 2 bytes because of the movw instruction above. The integer value itself, 1234, also takes 2 bytes because I declared it using Num(i16), and we saw Rust used a movw to save that also.
How Rust Saves an String Enum Variant
But what about the other variant, the string? When I save a string in NumOrStr, what does Rust do? To find out, I’ll replace my main function from above with this line of code:
The I’ll compile it again using the --emit asm option. Now I find this assembly language code in the union.s file:
Unfortunately this code snippet is much more complex: It first calls String::from passing a string literal, and then saves the string into the enum via a method called drop_in_place. This is much harder to understand.
Rather than trying to figure this out, I decided to debug my Rust sample program using LLDB, and inspect the memory a_string occupies. I found that Rust used 26 bytes to represent the string variant, starting with a 16 bit word containing 1:
This is again the tag; in this case 1 means a_string uses the NumOrStr::Str variant. Following this I found a pointer to the string itself:
Pointers on a 64-bit microprocessor occupy 8 bytes and contain the memory address of something, in this case my string "This is a test.” After the pointer I found two 64 bit values, each containing 15:
These are two attributes of the string: its capacity and length. By inspecting my process's memory I’ve started to learn a bit about how Rust manages memory for strings.
But what’s important for me today is the first word, the value 1. Again, we see the same pattern. Rust saves an integer value, the tag, indicating which variant this instance of the enum uses. Then Rust saves the enum variant’s payload in the memory that follows:
Tagged Unions in Rust and C
Let’s review by declaring a tagged union in C and Rust:
On the left using C, I have to include the tag explicitly in a surrounding struct. Rust handles this for me automatically, saving the tag value inside the enum alongside the enum’s value. The code looks very different, but as we saw above the implementations are_ identical_.
Using a tagged union looks somewhat similar in C and Rust:
But there are very important differences here! Using C, I need to remember to check the tag and to use the proper variant inside the union. The Rust compiler, on the other hand, checks the tag for me automatically and won’t allow me to access the wrong variant. The code inside of if let will never be executed unless the internal tag value matches the NumOrStr::Num variant.
Under the hood, the two languages implement tagged unions the same way. But writing code in C and Rust is very different. C encourages me to write dangerous, crashing code, while Rust prevents me from writing dangerous code in the first place.