Trees in SQL, adjacency lists and sorting.
BUU
created: 2006-01-05 15:59:35
Recently I found myself wishing to store a tree in a database. A secondary goal was to do so very, very fast. (Why? Because). After much pondering and head scratching I came up with what is basically an adjacency list. Here is an example:

Give a tree that looks like:
            A
           / \
          B   C
         / \   \
        D   E   F
I store that in a table that looks like this:
node_id | ancestor | distance
-----------------------------
B       | A        | 1
C       | A        | 1
D       | B        | 1
D       | A        | 2
E       | B        | 1
E       | A        | 2
F       | C        | 1
F       | A        | 2
Note that nodes D, E and F have multiple rows in the table.

The advantage of this lay out is that it is very, very fast to select from. I can get any number of children with a single query. For example, all of the children of B would be: SELECT * FROM table WHERE ancestor = 'B' which returns "D" and "E". The downside is that it takes more room on disk. However, some basic testing with SQLite showed me that I could fit some 5 million of these rows in a table as small as 70 megs, so I'm not overly worried about that.

As I've described it, my table works great. Selecting is very fast, I can find any level I want easily, it's fast to add new ones and so on. My single problem is that I figure out how to order my nodes once I've selected them. Give the above tree/table, my ideal order looks like: A, B, D, E, C, F.

So I get to my question. How can I sort my nodes so that they come out in the order I want? I'd definitely prefer something that would be pure sql, mostly for speed purposes. I'm also open to suggestions of what I could add to my table to enable easier/faster ordering, or perhaps a better way to store this.

I've considered several other methods of storing trees in my database. The first method I considered was simply having each node store a parent node. This is space efficient but very slow, a simple 10 node tree could take up to 10 queries to return all the nodes. The second thought was using Celko's left/right number tree, which is fast to select from and fairly easy to utilize, but when you add a node it requires updating every other node in the tree, I thought this would be inefficient for large trees.
Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-05 16:05:22

"A Look at SQL Trees" might be of interest; and there's more in the next month's article.

Not that it's really Perl related, though . . . :)

Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-05 16:21:40
One way I have done it in the past is using coded strings.... used it for threaded discussions... so you have say a 3 character key... top level is
AAA 
AAB
...
000
children of AAA are
AAAAAA
AAAAAB
...
AAA000
etc... it limits how wide you can go and how deep, but it works well for querying...

                - Ant
                - Some of my [/index.pl?node_id=56739#Best|best] work - (1 2 3)

Re^2: Trees in SQL, adjacency lists and sorting.
BUU
created: 2006-01-05 16:28:16
It's an interesting thought and I tried doing something similar to that (but with slashes) and I couldn't make it work. Do you have any specific suggestions on how I should implement it given my current structure?
Re^3: Trees in SQL, adjacency lists and sorting.
created: 2006-01-05 16:50:43
Well.. obviously you have a large unique field for the string....

I did the following... I had a unique varchar(255) that held the tree string and used 2 digit base 36 ids... obviously you could extend the character set, but I figured 1296 children per leaf was enough. If you only need binary then one char is fine, 0-1 or something... you could probably encode something, but that may make queries difficult... obviously with 2 char codes you can only get 127 levels in a tree with a 255 char field.

I used a couple of self written functions to turn the codes to and from numbers:

sub de36 {
  my($a,$b) = (uc substr($_[0],-2,1),uc substr($_[0],-1,1));
  $CHARS{$a}*36+$CHARS{$b};
}

sub en36 {
  $CHARS[$_[0]/36].$CHARS[$_[0]%36];
}
To insert I would would first grab the max id within the group I wanted, something like
SELECT MAX(thread) FROM messages WHERE thread LIKE "${parent}__"
matching those threads that were in the same parent....

you can get all immediate children with stuff like

SELECT * FROM messages WHERE thread LIKE "${parent}__"
and you can get a history with something like
SELECT * FROM messages WHERE thread = LEFT("$thread",LENGTH($thread))
sorting is, in this case, super easy.

I am using MySQL syntax but it should be clear enough...

You'll have to work out how to make it binary yourself, if you need it... not terribly difficult, select and compare, then add accordingly... table locking could probably help, too. :)

                - Ant
                - Some of my [/index.pl?node_id=56739#Best|best] work - (1 2 3)

Re^2: Trees in SQL, adjacency lists and sorting.
created: 2006-01-06 11:00:49
This style of notation can also be used with polyheirarchies (items w/ more than one parent), if you leave a couple of reserved characters for the notation. Here's an example that my teacher gave us for a class on thesaurus design:
Facet A : Subject
  A1 : Science
  A1.1 : Physics
  A1.1.1 : Optics
Facet B : Grade Level
  B1 : High School
  B1.1 : 10th Grade

So, '10th grade optics' would be 'A1.1.1B1.1'. We use a delimitor between items, so we aren't restricted to subsets that each have a unique character in them, allowing for wider branching. (depth is still a problem when dealing with fixed line lengths)

The problem is, it's fine for humans to understand, but a little messy to sort on in SQL. You can get around this by adding another delimiter between facets, and ending every notation w/ a null heirarchy: ';A1.1.1.;B1.1.'. We can then get all items relating to 10th grade physics w/ LIKE '%;A1.1.%;B1.1.%', without fear of getting 'A1.10AB1.1', and still getting items that may have even more facets, eg a movie for 10th grade physics ';A1.1;B1.1;C1'

The only requirement is that you keep each of the facets in order. (what order doesn't matter, so long as you're consistent)

If you'd like a more complex example:

If we were to try to categorize cooking implements:

Facet A : Use
  A1 : Utensil
  A1.1 : Spoon
  A1.2 : Whisk
  A2 : Pan
  A2.1 : Roasting Pan
  A2.1.1 : Covered Roasting Pan
  A2.1.2 : Uncovered Roasting Pan
Facet B : Material
  B1 : Metal
  B1.1 : Stainless Steel
  B1.2 : Aluminum
  B1.3 : Cast Iron
  B1.3.1 : Enameled Cast Iron
  B2 : Ceramic
  B2.1 : Glass
  B2.1.1 : Pyrex
  B2.2 : Stone
Facet C : Size
  C1 : 1 qt
  C2 : 2 qt
  C2.1 : 2 qt, shallow

Therefore, a standard 9x13 pyrex dish would be ';A2.1.2.;B2.1.1.;C2.1.', and we could find all 2qt glass pans with '%;A2.%B2.1.%;C2.%'

Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-05 16:31:12
"My single problem is that I figure out how to order my nodes once I've selected them. Give the above tree/table, my ideal order looks like: A, B, D, E, C, F"

So if you do SELECT * FROM table WHERE ancestor = 'A' you want the ordered list A, B, D, E, C, F. That sounds like an preorder traversal of the tree. Doing that in pure SQL is a bit byond me, but you mnight google for "sql preorder tree" to find something.

HTH

Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-05 20:56:11
Here's a link to another node about storing trees in SQL: Tree Structure and Db. It has lots of replies which link to articles about trees and sql. (Sorry I can't describe it more specifically -- I bookmarked it a while ago and still haven't gone back to read it.)
Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-05 21:42:43
This is, yet again, another node displaying a misunderstanding of relational theory. These things are sets, which are inherently unordered. Ordering is provided as an additional feature, like triggers. It's not part of the relational theory and, in fact, violates relational theory. That said ...
  1. I'm assuming you're going to have a row that says "A | NULL | 0". You need to represent the root somewhere. Implicit representation is not going to cut it.
  2. This is going to be an absolute nightmare to update. For example, if you remove a node and promote one of the children (a common operation in red-black trees, for example), You're going to have to recalculate in your application every single depth. Furthermore, if your application code is wrong in any way, the database isn't going to help you catch your error, particularly in the depth area.
  3. Trees are ordered with respect to siblings. In your case, you're assuming that B comes before C and D comes before E. How are you planning on capturing that in your schema? You're storing depth (which is easily derived and doesn't really help improving select speed), but you're not storing sibling sort order (which isn't derivable because you cannot depend on the order the RDBMS will store the rows).
  4. Assuming you have all of that, let's see what your sort would look like in Perl:
    my @nodes = map {
        $_->[0]
    } sort {
        # What goes here??
    } map {
        [ $_->{node_id}, $_->{depth}, $_->{sibling_order} ]
    } $sth->fetchall_hashref();
    

    If you wanted a level-order traversal, you'd use:

        $a->{depth} <=> $b->{depth}
            ||
        $a->{sibling_order} <=> $b->{sibling_order}
    
    Except, that gives you A-B-C-D-E-F-G, which isn't what you want.

    Take a look at [cpan://Tree], particularly the traversal code I wrote in order to provide traversals as an iterator into the tree versus an array (which is what you're trying to do in an ORDER BY clause). The code is non-trivial, particularly in that it requires maintaining a stack to keep track of what comes next.

In short, you cannot do it with ANSI SQL-1992, SQL-1999, or even SQL-2003. You would have to use a vendor-specific extension that allows user-defined comparators for sorting. I recommend PostgreSQL as MySQL, Oracle, Sybase, and SQL*Server all don't provide this. SQLite definitely doesn't provide this.


My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re^2: Trees in SQL, adjacency lists and sorting.
BUU
created: 2006-01-05 23:17:21
So what's the Right Way to store a tree? You seem to have fairly strong opinions on the matter and I'm always willing to learn. As for the every single depth bit, if I delete a node I can update all its children in a single query.
Re^3: Trees in SQL, adjacency lists and sorting.
created: 2006-01-06 09:18:12
Data structures exist to organize data in ways that are useful for what we're going to do with it. For example, if all you're doing is storing a HoHoHoHoH that you want to walk but never search, then a database is an extremely poor solution and something like [cpan://DBM::Deep] is preferable. If you have a bunch of items that are on the same level and you want to do random-access searches based on arbitrary criteria, then a relational database is much better than [cpan://DBM::Deep].

How does this relate to trees? You have a perfectly fine way to store a tree in a database . . . depending on what you're trying to do with it. You're using a variant of the self-referential method which stores every single node-to-node relationship, not just the immediate ones. It looks like it would be extremely fast at finding all descendents and all ancestors of a given node, doing each one in one query. However, it will be no better than the standard self-referential method in finding all siblings of a given node.

It sounds like you really want to use the nested-sets method. It's where you don't store a reference to your parent node, but instead store a left and right bound which indicates where in the nested sets you stand. To find all your ancestors, you find all nodes where "my_id BETWEEN left AND right". To find all your descendents, you find all nodes where "id BETWEEN my_left AND my_right". It's no better than the standard self-referential in finding all siblings.

Both methods (yours and self-referential), it sounds, have near-equal benefits when it comes to speeding up searches. Yours is slightly better when doing inserts and updates at the cost of significantly more disk, but worse when doing deletes. That may be an acceptable tradeoff depending on your intended usage. For example, if your spec says you will never delete a node except in rare situations, this is probably a good direction to look at.

As for the every single depth bit, if I delete a node I can update all its children in a single query.

I don't think so. Let's say you have the tree:

       A
      /
     B
    /
   C
  /
 D
This translates to the following table:
 id | ancestor | depth
 A  |  NULL    |   0
 B  |  A       |   1
 C  |  A       |   2
 C  |  B       |   1
 D  |  A       |   3
 D  |  B       |   2
 D  |  C       |   1

Let's say you remove B. That means you have to do the following:

  1. Remove the entry where id == B
  2. Decrement the depth for all nodes which have an ancestor relationship for B, but only for those rows which have a depth greater than the distance between the node and B.
  3. Remove all entries where ancestor == B

Otherwise, you'll have an invalid depth in your table between some descendent of B and at least one of its ancestors.

Personally, I'm not sold on the depth thing. I think that storing every node-to-node relationship is definitely a solid strategy. I don't know what having the depth gets you. Personally, I'd use this as an adjunct table to the standard self-referential. Yes, you'd have duplicated information between the main table and the relationships table (the direct parent-child relationship would be duplicated), but if that's the optimization I need, then that's fine because the performance cost of the loss of normalization is made up by the speed gains I get with the caching.


My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-05 22:50:18
You might want to peruse this book.
Re^2: Trees in SQL, adjacency lists and sorting.
created: 2006-01-06 04:05:17
I cannot recommend Joe Celkos books highly enough. This book should give you what you need. I used the nested set method for making an FTP server with DB backend and it worked a treat.
Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-06 04:03:42

Seems to me that this strategy is only really suitable for small trees. For every node of depth N you will need (N+1)*(N/2) records. This means that for even a moderately sized tree you will be dealing with a lot of records.

Also, celko algorithm doesnt require reordering all the nodes. Only those perturbed by an insert (basically anything "right" of the inserted element". And the update required is simple and efficient. IMO, given that the latter would only hold N records for N nodes Id bet updating it would be more efficient that working with your structure due to scaling. Also, celko chose a simple partitioning logic, you need not use the same one, which would mean that updates following inserts need not occur at all.

Anyway, you haven't outlined what operations you need to perform on your trees and how often they occur. Without that info its hard to advise you beyond that your proposed algorithm looks unsatisfactory for much.

Also have you considered how much work needs to be done if you reparent a node? I should think such an operation would be woefully painful given your current structure.

BTW, a parent pointer representation is probably better.

  node_id, parent_id, root_id, depth

Can provide a fairly efficient structure. You can get the full tree by selecting on root_id, you can get children of a given node by selecting on parent_id and you can constrain by depth. Inserting nodes is O(1), deleting nodes is O(1), and you will have only N records for a tree of N nodes.

Celkos approach can be done by adding left_point,right_point fields to the parent pointer representation and gives the ability to do subheirarchy queries efficiently at the cost of some insert overhead. Generally trees are a data structure you want to query much more often than you want to insert into so normally this is an acceptable trade off.

---
$world=~s/war/peace/g

Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-06 05:33:13
The standard approach, despite the temptation to just use one table with an auto-foreign key, is:
CREATE TABLE node (node_name VARCHAR(32) NOT NULL)

CREATE TABLE node_node ( parent_fk VARCHAR(32) NOT NULL, child_fk VARCHAR(32) NOT NULL )
The next bit is DBMS specific but you also need to define the primary keys: (node.node_name) and (node_node.parent_fk, node_node.child_fk) and the foreign key constraints: node.node_name -> node_node.parent_fk and node.node_name -> node_node.child_fk.

The advantage of this over the single table solution is that it is normalised and can be extended into a multi-type, multi-tree model without having to keep adding foreign keys to the master table.

Re^2: Trees in SQL, adjacency lists and sorting.
created: 2006-01-06 14:25:36
The problem with this is that you can't do a single select query to get an entire branch of the tree in the correct order. You have to write something like a recursive function and you end up with one query per level of the tree.

Someone above mentioned "nested set" - this is what I use and it's great for quick lookups, at the cost of expensive (relatively) inserts. I've never had a database big enough where inserts have been a problem.
Re^3: Trees in SQL, adjacency lists and sorting.
created: 2006-01-09 05:01:04
You certainly can't do that with the single table solution either. Such a branch is iterative and therefore cannot be expressed as an RSE whichever RDBMS solution you choose.

Everything but the troll

Re^4: Trees in SQL, adjacency lists and sorting.
created: 2006-01-09 18:00:24
I beg to differ. In fact, here is an article about it: click
Re^5: Trees in SQL, adjacency lists and sorting.
created: 2006-01-10 04:35:38
About twelve years ago I had a similar idea to the one in the article for a multiple concurrent junior-senior structure but it isn't as simple as the article pretends. A whole family of stored procedures are in fact needed to support the complexity the writer either isn't seeing or isn't presenting and the developer is more likely to be told to come up with something simpler, especially if the DBMS doesn't have stored procedures.

Everything but the troll

Re^6: Trees in SQL, adjacency lists and sorting.
created: 2006-01-10 11:00:12
Actually I wrote a PHP class based on that article (and others that are similar) which requires no stored procedures and lets me pull out data with one query at the expense of slower inserts (it takes more than one query to insert data). I will be happy to post the source, although I have no perl counterpart.
Re^3: Trees in SQL, adjacency lists and sorting.
created: 2006-01-10 11:27:08

Thats not really true. Assuming your nodes can be ordered, and assuming you store a root id, you grab everything from the tree on or later to the node you are interested in. Then filter out the noncompliant nodes when you read them. One query sufficies then. Admittedly at the cost of some client side processing, but if that isnt a problem then it should be fine.

For the record this exactly what is doe in Recently Added Threads. We grab all the relevent nodes in two queries normally. (Roots then children as our roots arent the same type as the children).

---
$world=~s/war/peace/g

Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-06 08:54:57

Interesting node -- but I see a problem brewing with the rows that have 'distance to the next node' > 1. That's data that can be inferred, and if it gets entered wrong, or if someone monkeys with the database, the whole thing comes crashing down.

However, I think I understand what you're doing -- you're trying to pre-cook as much as possible for performance. Thus I'd probably have two tables: the first would have just nearest neighbours, and the second would have all of the inferred data. Then you could use a UNION to get the same performance as your original design.

Just some idle thoughts over my morning coffee.

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Trees in SQL, adjacency lists and sorting.
created: 2006-01-06 17:46:51
If you want ordered traversal, I'd add an 'order' column with sibling order, analogous to the depth column. Add a row ('a','a',0,0). select node_id from tree where ancestor='a' order by depth,order;

perlmonks.org content © perlmonks.org and Anonymous Monk, blahblahblah, BUU, demerphq, dragonchild, DungeonKeeper, Fletch, jhourcle, renodino, saberworks, simon.proctor, suaveant, talexb, traveler

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03