A
/ \
B C
/ \ \
D E F
I store that in a table that looks like this:
node_id | ancestor | distance ----------------------------- B | A | 1 C | A | 1 D | B | 1 D | A | 2 E | B | 1 E | A | 2 F | C | 1 F | A | 2Note that nodes D, E and F have multiple rows in the table.
"A Look at SQL Trees" might be of interest; and there's more in the next month's article.
Not that it's really Perl related, though . . . :)
AAA AAB ... 000children of AAA are
AAAAAA AAAAAB ... AAA000etc... it limits how wide you can go and how deep, but it works well for querying...
I did the following... I had a unique varchar(255) that held the tree string and used 2 digit base 36 ids... obviously you could extend the character set, but I figured 1296 children per leaf was enough. If you only need binary then one char is fine, 0-1 or something... you could probably encode something, but that may make queries difficult... obviously with 2 char codes you can only get 127 levels in a tree with a 255 char field.
I used a couple of self written functions to turn the codes to and from numbers:
sub de36 {
my($a,$b) = (uc substr($_[0],-2,1),uc substr($_[0],-1,1));
$CHARS{$a}*36+$CHARS{$b};
}
sub en36 {
$CHARS[$_[0]/36].$CHARS[$_[0]%36];
}
To insert I would would first grab the max id within the group I wanted, something like
SELECT MAX(thread) FROM messages WHERE thread LIKE "${parent}__"
matching those threads that were in the same parent....
you can get all immediate children with stuff like
SELECT * FROM messages WHERE thread LIKE "${parent}__"
and you can get a history with something like
SELECT * FROM messages WHERE thread = LEFT("$thread",LENGTH($thread))
sorting is, in this case, super easy.
I am using MySQL syntax but it should be clear enough...
You'll have to work out how to make it binary yourself, if you need it... not terribly difficult, select and compare, then add accordingly... table locking could probably help, too. :)
Facet A : Subject A1 : Science A1.1 : Physics A1.1.1 : Optics Facet B : Grade Level B1 : High School B1.1 : 10th Grade
So, '10th grade optics' would be 'A1.1.1B1.1'. We use a delimitor between items, so we aren't restricted to subsets that each have a unique character in them, allowing for wider branching. (depth is still a problem when dealing with fixed line lengths)
The problem is, it's fine for humans to understand, but a little messy to sort on in SQL. You can get around this by adding another delimiter between facets, and ending every notation w/ a null heirarchy: ';A1.1.1.;B1.1.'. We can then get all items relating to 10th grade physics w/ LIKE '%;A1.1.%;B1.1.%', without fear of getting 'A1.10AB1.1', and still getting items that may have even more facets, eg a movie for 10th grade physics ';A1.1;B1.1;C1'
The only requirement is that you keep each of the facets in order. (what order doesn't matter, so long as you're consistent)
If you'd like a more complex example:
If we were to try to categorize cooking implements:
Facet A : Use A1 : Utensil A1.1 : Spoon A1.2 : Whisk A2 : Pan A2.1 : Roasting Pan A2.1.1 : Covered Roasting Pan A2.1.2 : Uncovered Roasting Pan Facet B : Material B1 : Metal B1.1 : Stainless Steel B1.2 : Aluminum B1.3 : Cast Iron B1.3.1 : Enameled Cast Iron B2 : Ceramic B2.1 : Glass B2.1.1 : Pyrex B2.2 : Stone Facet C : Size C1 : 1 qt C2 : 2 qt C2.1 : 2 qt, shallow
Therefore, a standard 9x13 pyrex dish would be ';A2.1.2.;B2.1.1.;C2.1.', and we could find all 2qt glass pans with '%;A2.%B2.1.%;C2.%'
So if you do SELECT * FROM table WHERE ancestor = 'A' you want the ordered list A, B, D, E, C, F. That sounds like an preorder traversal of the tree. Doing that in pure SQL is a bit byond me, but you mnight google for "sql preorder tree" to find something.
HTH
my @nodes = map {
$_->[0]
} sort {
# What goes here??
} map {
[ $_->{node_id}, $_->{depth}, $_->{sibling_order} ]
} $sth->fetchall_hashref();
If you wanted a level-order traversal, you'd use:
$a->{depth} <=> $b->{depth}
||
$a->{sibling_order} <=> $b->{sibling_order}
Except, that gives you A-B-C-D-E-F-G, which isn't what you want.
Take a look at [cpan://Tree], particularly the traversal code I wrote in order to provide traversals as an iterator into the tree versus an array (which is what you're trying to do in an ORDER BY clause). The code is non-trivial, particularly in that it requires maintaining a stack to keep track of what comes next.
In short, you cannot do it with ANSI SQL-1992, SQL-1999, or even SQL-2003. You would have to use a vendor-specific extension that allows user-defined comparators for sorting. I recommend PostgreSQL as MySQL, Oracle, Sybase, and SQL*Server all don't provide this. SQLite definitely doesn't provide this.
How does this relate to trees? You have a perfectly fine way to store a tree in a database . . . depending on what you're trying to do with it. You're using a variant of the self-referential method which stores every single node-to-node relationship, not just the immediate ones. It looks like it would be extremely fast at finding all descendents and all ancestors of a given node, doing each one in one query. However, it will be no better than the standard self-referential method in finding all siblings of a given node.
It sounds like you really want to use the nested-sets method. It's where you don't store a reference to your parent node, but instead store a left and right bound which indicates where in the nested sets you stand. To find all your ancestors, you find all nodes where "my_id BETWEEN left AND right". To find all your descendents, you find all nodes where "id BETWEEN my_left AND my_right". It's no better than the standard self-referential in finding all siblings.
Both methods (yours and self-referential), it sounds, have near-equal benefits when it comes to speeding up searches. Yours is slightly better when doing inserts and updates at the cost of significantly more disk, but worse when doing deletes. That may be an acceptable tradeoff depending on your intended usage. For example, if your spec says you will never delete a node except in rare situations, this is probably a good direction to look at.
As for the every single depth bit, if I delete a node I can update all its children in a single query.
I don't think so. Let's say you have the tree:
A
/
B
/
C
/
D
This translates to the following table:
id | ancestor | depth A | NULL | 0 B | A | 1 C | A | 2 C | B | 1 D | A | 3 D | B | 2 D | C | 1
Let's say you remove B. That means you have to do the following:
Otherwise, you'll have an invalid depth in your table between some descendent of B and at least one of its ancestors.
Personally, I'm not sold on the depth thing. I think that storing every node-to-node relationship is definitely a solid strategy. I don't know what having the depth gets you. Personally, I'd use this as an adjunct table to the standard self-referential. Yes, you'd have duplicated information between the main table and the relationships table (the direct parent-child relationship would be duplicated), but if that's the optimization I need, then that's fine because the performance cost of the loss of normalization is made up by the speed gains I get with the caching.
Seems to me that this strategy is only really suitable for small trees. For every node of depth N you will need (N+1)*(N/2) records. This means that for even a moderately sized tree you will be dealing with a lot of records.
Also, celko algorithm doesnt require reordering all the nodes. Only those perturbed by an insert (basically anything "right" of the inserted element". And the update required is simple and efficient. IMO, given that the latter would only hold N records for N nodes Id bet updating it would be more efficient that working with your structure due to scaling. Also, celko chose a simple partitioning logic, you need not use the same one, which would mean that updates following inserts need not occur at all.
Anyway, you haven't outlined what operations you need to perform on your trees and how often they occur. Without that info its hard to advise you beyond that your proposed algorithm looks unsatisfactory for much.
Also have you considered how much work needs to be done if you reparent a node? I should think such an operation would be woefully painful given your current structure.
BTW, a parent pointer representation is probably better.
node_id, parent_id, root_id, depth
Can provide a fairly efficient structure. You can get the full tree by selecting on root_id, you can get children of a given node by selecting on parent_id and you can constrain by depth. Inserting nodes is O(1), deleting nodes is O(1), and you will have only N records for a tree of N nodes.
Celkos approach can be done by adding left_point,right_point fields to the parent pointer representation and gives the ability to do subheirarchy queries efficiently at the cost of some insert overhead. Generally trees are a data structure you want to query much more often than you want to insert into so normally this is an acceptable trade off.
CREATE TABLE node (node_name VARCHAR(32) NOT NULL) CREATE TABLE node_node ( parent_fk VARCHAR(32) NOT NULL, child_fk VARCHAR(32) NOT NULL )The next bit is DBMS specific but you also need to define the primary keys: (node.node_name) and (node_node.parent_fk, node_node.child_fk) and the foreign key constraints: node.node_name -> node_node.parent_fk and node.node_name -> node_node.child_fk.
The advantage of this over the single table solution is that it is normalised and can be extended into a multi-type, multi-tree model without having to keep adding foreign keys to the master table.
Everything but the troll
Everything but the troll
Thats not really true. Assuming your nodes can be ordered, and assuming you store a root id, you grab everything from the tree on or later to the node you are interested in. Then filter out the noncompliant nodes when you read them. One query sufficies then. Admittedly at the cost of some client side processing, but if that isnt a problem then it should be fine.
For the record this exactly what is doe in Recently Added Threads. We grab all the relevent nodes in two queries normally. (Roots then children as our roots arent the same type as the children).
Interesting node -- but I see a problem brewing with the rows that have 'distance to the next node' > 1. That's data that can be inferred, and if it gets entered wrong, or if someone monkeys with the database, the whole thing comes crashing down.
However, I think I understand what you're doing -- you're trying to pre-cook as much as possible for performance. Thus I'd probably have two tables: the first would have just nearest neighbours, and the second would have all of the inferred data. Then you could use a UNION to get the same performance as your original design.
Just some idle thoughts over my morning coffee.
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
perlmonks.org content © perlmonks.org and Anonymous Monk, blahblahblah, BUU, demerphq, dragonchild, DungeonKeeper, Fletch, jhourcle, renodino, saberworks, simon.proctor, suaveant, talexb, traveler
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03