Manipulating Trees Using SQL and the Postgres LTREE Extension

December 14th 2017 — Comments and Reactions

Yesterday, I used the LTREE extension to save a tree data structure in a Postgres table. After saving the tree, I used the @> or ancestor operator to count the number of descendant nodes on a given branch.

But that’s not all LTREE can do. Today I’ll show you how to delete, move and copy branches from one place to another in your tree, using @> in combination with other LTREE functions. After that, in my last post in this series, I’ll look at how LTREE works under the hood, at the Computer Science that makes all of this possible.

My Example Tree Again

Here’s the tree I’ve been working with during the last few blog posts:

In my last post, I saved this tree in my database using a series of insert statements:

insert into tree (letter, path) values ('A', 'A');
insert into tree (letter, path) values ('B', 'A.B');
insert into tree (letter, path) values ('C', 'A.C');
insert into tree (letter, path) values ('D', 'A.C.D');
insert into tree (letter, path) values ('E', 'A.C.E');
insert into tree (letter, path) values ('F', 'A.C.F');
insert into tree (letter, path) values ('G', 'A.B.G');

And we saw how easy it is to count the number of tree nodes in a given branch using the @> operator:

select count(*) from tree where 'A.C' @> path;

Cutting Off a Branch

But suppose I wanted to remove these nodes from the tree entirely; that is, suppose I wanted to “cut off this branch” of the tree, so to speak:

How can I do this? Simple! I just use a SQL delete statement:

delete from tree where 'A.C' @> path;

As you can see, I can use @> equally well in delete statements as in select statements.

Replanting a Branch as a New Tree

Now suppose I want to keep this branch, and save it as a separate tree in my table. That is, I want two trees: the original A tree and a new tree consisting of the C branch “replanted” as a new root:

Thinking about this for a moment, moving some nodes from one location to another in my tree means I’ll need to update their path values somehow in my table. That is, I’ll need to use an update statement and not a select or delete statement. But how? Writing an update statement is easy enough, but how do I know what the new path of each tree node will be? Let’s take C as an example. Because C will become the root node of my new tree, I want to change its path from A.C to just C:

update tree set path = 'C' where path = 'A.C';

And I will want to update D, one of C’s children, in a similar way:

update tree set path = 'C.D' where path = 'A.C.D';

I could write a separate update statement for each node, just 4 SQL statements for my example. But imagine I had 100s or 1000s of nodes in my tree. Updating the records one SQL statement at a time would require repeated network connections from my application to Postgres, slowing down the overall operation tremendously.

Instead, I need to update the path of C and each of its descendants all in a single operation. But how can I do this? Two LTREE functions, NLEVEL() and SUBPATH(), can help.

The NLEVEL Function

First, NLEVEL. As you might guess, this returns the number of levels in a given path string:

select letter, path, nlevel(path) from tree;

letter | path  | nlevel 
-------+-------+--------
A      | A     |      1
B      | A.B   |      2
C      | A.C   |      2
D      | A.C.D |      3
E      | A.C.E |      3
F      | A.C.F |      3
G      | A.B.G |      3
(7 rows)

Looking at this, it’s easy to understand what the function returns: For a root node like A, NLEVEL returns 1. For A’s child nodes, A.B and A.C, NLEVEL returns 2, and for the grandchild nodes it returns 3. It simply counts the number of levels in each path string; internally, it parses the path string for period characters.

Before we continue, consider one subtle but important point. Notice that I was able to calculate NLEVEL on all of the records in the tree table with a single SQL statement! Postgres applied the function to all of the matching paths for me. The power of LTREE’s functions is that they seamlessly integrate with SQL, harnessing and extending the power of Postgres.

The SUBPATH Function

LTREE provides another new SQL function that will also help us write a general tree path formula: SUBPATH. As you might guess, this returns a selected substring from a given path. Let’s try running it on my example tree:

select letter, subpath(path, 1) from tree;
ERROR:  invalid positions
STATEMENT:  select letter, subpath(path, 1) from tree;

Oops - I’ve done something wrong here. Calling SUBPATH(path, 1) returns the portion of the path starting with offset 1. Not a character offset, but a path offset. So SUBPATH(path, 1) drops the first level of the path, A in my tree, and returns the remaining portion of each path starting from the second path element. Internally, LTREE parses the periods for me, drops the requested number of path levels and removes the extra leading period.

In the statement above, the error was caused by the root node in the tree: A. This path has only one level, and so LTREE returns an error in this case.

Let’s try using SUBPATH only on the C branch, the branch we want to move:

select letter, subpath(path, 1) from tree where path <@ 'A.C';
letter | subpath 
-------+---------
C      | C
D      | C.D
E      | C.E
F      | C.F
(4 rows)

Now I get only four records in the result, one for C and one for each node that appears under C. And you can see the subpath column contains the portion of the path that appears after A, for each of these 4 records.

And again, notice that I was able to execute the SUBPATH function on all 4 tree records I wanted to, in a single operation. This time, the SUBPATH function worked in concert with the <@ operator. LTREE has made the SQL language I already know how to use even more powerful.

Moving Tree Nodes Using One UPDATE Statement

Now let’s return to the question of moving a branch into a new tree. As this diagram shows, I want to delete C and its children from the A tree, and move them to a new location:

Earlier I considered moving the nodes using a single update statement for each:

update tree set path = 'C' where path = 'A.C';
update tree set path = 'C.D' where path = 'A.C.D';
update tree set path = 'C.E' where path = 'A.C.E';
update tree set path = 'C.F' where path = 'A.C.F';

Now that I know about SUBPATH, it’s easy to write a single SQL update statement that will move all 4 nodes in the C branch in one operation:

update tree set path = subpath(path, 1) where path <@ 'A.C';

I use where path <@ 'A.C' to scope the update to the C branch, and I use subpath(path, 1) to remove the A root element from the path of C and each of its descendants.

I can generalize this a bit more using the NLEVEL function also:

update tree set path = subpath(path, nlevel('A.C')-1) where path <@ 'A.C';

This follows because nlevel('A.C') = 2, and therefore, nlevel('A.C')-1 returns the same formula we had above. Replacing A.C with “BRANCH_PATH” I arrive at a general formula for “replanting” a branch as a new tree using a single SQL statement:

update tree set path = subpath(path, nlevel(BRANCH_PATH)-1) where path <@ BRANCH_PATH

…assuming nlevel(BRANCH_PATH) > 1, that is assuming the branch we want to replant isn’t already a root.

The || Concatenation Operator

This seems somewhat useful, but what if I want to move a branch from one location in my tree to some other location, not necessary to the root? This is a more general problem. For example, suppose I want to move the C branch under G, like this:

To write a formula for this transformation using SQL, we need to use one more important LTREE operator: the || or concatenation operator. Let’s try it out with an example first:

select 'A.B.G' || path as concatenated from tree;
concatenated 
--------------
A.B.G.A
A.B.G.A.B
A.B.G.A.C
A.B.G.A.C.D
A.B.G.A.C.E
A.B.G.A.C.F
A.B.G.A.B.G
(7 rows)

You can see LTREE has automatically added A.B.G along with a period separator to each path in my table. And it has done this for all the paths in my table in a single operation.

Moving a Branch

Now using || I can write a single SQL statement to move a tree branch from one location to another. First, of course, I need to scope the SQL operation to the target branch using the ancestor operator:

select 'A.B.G' || path as concatenated from tree where path <@ 'A.C';
concatenated 
---------------
A.B.G.A.C
A.B.G.A.C.D
A.B.G.A.C.E
A.B.G.A.C.F
(4 rows)

I get the same results as above, but now only for the tree nodes I want to move.

But my next problem is the new paths above start with A.B.G.A.C…. Instead, I want them to be A.B.G.C…. I need to remove that extra A character from the new paths, using the SUBPATH operator:

select 'A.B.G' || subpath(path, 1) as concatenated from tree where path <@ 'A.C';
concatenated 
--------------
A.B.G.C
A.B.G.C.D
A.B.G.C.E
A.B.G.C.F
(4 rows)

And finally, converting this into an update statement:

update tree set path = 'A.B.G' || subpath(path, 1) where path <@ 'A.C'

…I have the single SQL statement I need!

And generalizing this, we arrive at a SQL formula you could use in your own Postgres database:

update tree set path = DESTINATION_PATH || subpath(path, nlevel(SOURCE_PATH)-1)
where path <@ SOURCE_PATH;

Copying a Branch

One last puzzle: How can I copy a tree branch instead of moving it? I just use an insert SQL statement instead of update. Simple, right?

But how, exactly? I need to insert multiple rows, one record for each node in the branch I copy. Again, I could write a series of insert statements like this:

insert into tree (letter, path) values ('C', 'A.B.G.C');
insert into tree (letter, path) values ('D', 'A.B.G.C.D');
insert into tree (letter, path) values ('E', 'A.B.G.C.E');
insert into tree (letter, path) values ('F', 'A.B.G.C.F');

But using LTREE functions and operators, I can achieve this using a single SQL statement! I just have to write SQL that will insert the result of a select, like this:

insert into tree (letter, path) (
    select letter, 'A.B.G' || subpath(path, 1) from tree where 'A.C' @> path
)

Executing this, Postgres will first find all the nodes inside the branch I want to copy, and recalculate their paths. Then it will insert that result set into the tree as a copy, leaving my original branch unchanged!

By writing this tree-related logic using LTREE operators in SQL, I ask Postgres to do all of the hard work of manipulating and copying the path strings for me. I don’t have to write application code to keep track of these strings, and no data needs to be transmitted back and forth between my application server and the database server.

What’s Next? LTREE Internals

In my last post about LTREE, I’ll look closely at how it works internally. It’s easy enough to imagine how simple functions like NLEVEL, || or SUBPATH work. That’s not the interesting part for me. These functions are shorthand for fairly simple string operations.

The special sauce that makes LTREE such a powerful tool is that it integrates with Postgres GiST indexes. By using an index, Postgres can execute any of the SQL expressions I wrote above equally fast on 7000 records as it would on 7! How? The only way to find out is by Looking Inside Postgres at a GiST Index.