Author: Arun NandaLast Updated: Wed, Jan 4, 2023
Handling complex data, like social networks and organizational hierarchies, is increasingly relevant in modern information systems. Modeling such data involves the use of data structures like trees and graphs. These data structures are multilayered and interconnected, like the underlying reality they represent. Programs written to access this data must navigate its complex structure to get useful information. This is done using recursive queries.
This is a hands-on guide. To benefit from it, you are expected to have prior practical experience with PostgreSQL. To test the examples, it is assumed you already have PostgreSQL running either on a standalone server or as a managed instance.
It is necessary to know how Common Table Expressions (CTEs) work in PostgreSQL. Recursion is based on the use of CTEs. Representing a graph in the form of tables leads to a normalized database design. So, it is useful to understand how views work to simplify
JOIN queries. It is helpful to have an understanding of basic computer science concepts, like recursion and data structures like graphs and trees.
Some functionality for recursive queries, such as the
CYCLE keyword, was introduced in PostgreSQL 14.0, released in 2021. The code samples in this guide are tested on PostgreSQL 14.5. They should be compatible with all PostgreSQL versions higher than 14.0.
This section is an overview of some basic concepts of tree and graph data structures.
Tree and graph data structures consist of nodes and edges. A node represents a specific entity, like a person, place, or component. An edge represents the connection between two nodes, such as the relationship between two people. Nodes are connected directly via a single edge or indirectly via other nodes and edges. Directly connected nodes are referred to as adjacent nodes. Adjacent nodes are often represented as pairs, like (Node1, Node2). Traversing a graph refers to visiting all or some related nodes of a graph, given a starting node.
In undirected graphs, edges do not have a direction. For example, "A is friends with B" is programmatically the same as "B is friends with A". In directed graphs, the direction of the edge is relevant. An edge from A to B is not the same as an edge from B to A. For example, "A follows B" is not the same as "B follows A". Directed node pairs are ordered: (A, B) is distinct from (B, A). Given a directed edge from A to B, A is the ancestor, and B is the descendant. A leaf node is a node with no descendants.
A cycle in a graph occurs when there is a direct or indirect path from a node back to itself. For example, in directed graphs, a cycle occurs when there is a path back from a descendant node to an ancestor node. Consider a directed graph with nodes A, B, and C. A is an ancestor of B, and B is an ancestor of C. A cycle occurs when C is connected back to A - thus, C becomes an ancestor of A. Note that in this case, a connection from A to C will not make a cycle.
Trees are a subclass of directed acyclic graphs (DAG). Trees have a single "root" node. They are hierarchical data structures - useful to represent things like the reporting structure of an organization as an orgchart. In a tree, each ancestor can have multiple descendants. Each descendant has only one ancestor. The depth of a node is the number of edges from that node to the root node.
Consider two examples:
Given an orgchart represented as a tree, you want to find all the people who are in the reporting chain of a particular manager. To do this, you need all the nodes that are directly or indirectly descendants of the given node.
For a social network represented as a graph, you want to find all the friends and friends of friends of a person. This is represented by all the nodes which are within 2 degrees of separation from the given node.
To extract this kind of information using regular SQL queries, first find all the nodes adjacent to the given node. Then find all the nodes adjacent to these nodes. Continue this process iteratively until you either:
reach the leaf nodes, or
traverse the desired depth.
For a small graph, it is possible, though laborious, to write multiple subqueries to do this sequentially. Real-world graphs are large and interconnected, with many layers between the root and leaf nodes. In many cases, the depth is not known in advance. Thus a semi-manual iterative approach, as described above, is not practical. Recursive queries solve this problem.
In a regular programming language, a recursive function repeatedly calls itself until the termination condition is met. The output of each iteration serves as the input to the next iteration. The initial value is given to the function and is called the seed value.
In SQL, a recursive query has two parts (sub-queries). The first sub-query is non-recursive. Its result is analogous to the seed value. The second sub-query (recursively) calls the main query. In each iteration, it combines the result with the results from previous iterations of the query.
Because of this structure, CTEs are uniquely suited to implement recursion. CTEs are used to organize complex queries into sub-queries which form the logical building blocks of the (main) query. A recursive CTE is identified by the use of the keyword
RECURSIVE before the name of the CTE.
To simplify the SQL examples, only directed graphs are considered throughout this guide. The initial examples are limited to DAGs (trees are a subclass of DAGs). Later examples show the features of CTEs in PostgreSQL that deal with cyclic graphs.
To illustrate the concepts, this guide uses a normalized data model consisting of nodes and edges.
node has an ID field. The
name field is a property of the node. The table
edge stores the IDs of two nodes that are adjacent (directly linked) to each other.
In real-world usage, you would also want to have additional columns with properties of the nodes and edges in the respective tables. For example, an edge denoting "is friends with" might have a property like "friends since." An edge between two cities might have properties about the distance and travel time between the cities.
Create a table
node with two columns -
CREATE TABLE node ( node_id INTEGER PRIMARY KEY, node_name VARCHAR NOT NULL );
Create a table
edge with two columns -
node2, both referencing foreign keys to the table
CREATE TABLE edge ( node1 INTEGER REFERENCES node (node_id), node2 INTEGER REFERENCES node (node_id), PRIMARY KEY (node1, node2) );
Insert test data into each of the tables. Create a few nodes (representing people):
INSERT INTO node (node_id, node_name) VALUES (1, 'Tom'), (2, 'Dick'), (3, 'Harry'), (4, 'Jane'), (5, 'Susan'), (6, 'Mary'), (7, 'Sam'), (8, 'Sally'), (9, 'Jack') ;
Create edges to define hypothetical relationships between the nodes:
INSERT INTO edge (node1, node2) VALUES (1, 2), (1, 8), (2, 3), (2, 4), (4, 5), (4, 6), (4, 7), (8, 9) ;
Tom is the ancestor of Dick and Sally; Dick is the ancestor of Harry and Jane; Jane of Susan, Mary, and Sam; and Sally is the ancestor of Jack. It is recommended to draw out (on paper) the relationship structure as a tree, using both IDs and names.
Using a regular (non-recursive) CTE, define a view
nodes_n_edges which contains the names and IDs of adjacent node pairs:
CREATE VIEW nodes_n_edges AS WITH cte1 AS ( SELECT n.node_id, n.node_name, e.node2 FROM node n JOIN edge e ON n.node_id = e.node1 ) SELECT cte1.node_id AS node1_id, cte1.node_name AS node1_name, node.node_name AS node2_name, node.node_id AS node2_id FROM cte1 JOIN node ON cte1.node2 = node.node_id;
This view is used in later sections to conveniently query the data structure after making changes. Using a view like this is more efficient than repeatedly joining the
SELECT * FROM nodes_n_edges;
Suppose you need to get the names of all the people who belong to the branch with the root node as Dick. This is analogous to traversing an orgchart to find all the people in Dick's reporting chain.
RECURSIVE keyword and create a recursive CTE named
-- incomplete code - do not copy this directly WITH RECURSIVE rcte AS (
Write the non-recursive sub-query to generate the seed. These are the nodes adjacent to Dick (people directly reporting to Dick):
-- incomplete code - do not copy this directly SELECT node1_id, node1_name, node2_name, node2_id FROM nodes_n_edges WHERE node1_name = 'Dick'
Write the recursive sub-query to
JOIN the entire query (
rcte) with the view
-- incomplete code - do not copy this directly SELECT ne.node1_id, ne.node1_name, ne.node2_name, ne.node2_id FROM rcte JOIN nodes_n_edges ne ON rcte.node2_id = ne.node1_id
UNION of the two sub-queries inside the (recursive) CTE
rcte created in Step 1. This is the recursive CTE to traverse the tree.
WITH recursive rcte as ( SELECT node1_id, node1_name, node2_name, node2_id FROM nodes_n_edges WHERE node1_name = 'Dick' UNION SELECT ne.node1_id, ne.node1_name, ne.node2_name, ne.node2_id FROM rcte JOIN nodes_n_edges ne ON rcte.node2_id = ne.node1_id ) SELECT node1_name, node2_name FROM rcte ;
The output of the CTE shown above is the list of node pairs in the sub-tree with its root at Dick.
It is helpful to see the path that links two indirectly connected nodes. Since a recursive CTE traverses all the nodes in order, it can also store the path taken to reach each node. To do this, start with the recursive query created in the last section. Initialize an array in the non-recursive sub-query. Each iteration of the recursive sub-query appends the next part of the path to this array.
WITH RECURSIVE rcte AS ( SELECT node1_id, node1_name, node2_name, node2_id, -- initialize the array ARRAY [node1_name, (select('-')), node2_name] AS traversal_path FROM nodes_n_edges WHERE node1_name = 'Dick' UNION SELECT ne.node1_id, ne.node1_name, ne.node2_name, ne.node2_id, -- append the path to the array rcte.traversal_path||(select(' ; '))||ne.node1_name||(select('-'))||ne.node2_name FROM rcte JOIN nodes_n_edges ne ON rcte.node2_id = ne.node1_id ) SELECT node1_name, node2_name, -- select and prettify the text of the path REPLACE(REPLACE(traversal_path::text, '"', ''), ',','') AS traversal_path FROM rcte ;
Associating a depth with each relationship helps to understand the depth of the subgraph and limit the query to a specific depth. For example, you could do this to find nodes within N degrees of separation from a given node. This is analogous to finding the friends and friends of friends of a person on a social network (graph).
To track the depth of the search, start with the query from the last section. Initialize a
depth variable to 1 in the non-recursive sub-query, and the recursive sub-query increments it on every iteration.
Initialize the non-recursive sub-query to start from the node Tom to traverse the entire graph (tree). Create a view on the graph traversal query. This view is also reused in later sections.
CREATE VIEW graph_structure AS WITH RECURSIVE rcte AS ( SELECT node1_id, node1_name, node2_name, node2_id, -- initialize a depth variable 1 AS depth, ARRAY [node1_name, (select('-')), node2_name] AS traversal_path FROM nodes_n_edges WHERE node1_name = 'Tom' UNION SELECT ne.node1_id, ne.node1_name, ne.node2_name, ne.node2_id, -- increment the depth variable rcte.depth + 1, rcte.traversal_path||(select(' ; '))||ne.node1_name||(select('-'))||ne.node2_name FROM rcte, nodes_n_edges ne WHERE rcte.node2_id = ne.node1_id ) SELECT node1_name, node2_name, REPLACE(REPLACE(traversal_path::text, '"', ''), ',','') AS traversal_path, depth FROM rcte ;
Check the traversal path and depths:
SELECT * FROM graph_structure ;
To limit the depth of the search, add a
WHERE clause on the
WITH RECURSIVE rcte AS ( SELECT node1_id, node1_name, node2_name, node2_id, 1 AS depth, ARRAY [node1_name, (select('-')), node2_name] AS traversal_path FROM nodes_n_edges WHERE node1_name = 'Tom' UNION SELECT ne.node1_id, ne.node1_name, ne.node2_name, ne.node2_id, rcte.depth + 1, rcte.traversal_path||(select(' ; '))||ne.node1_name||(select('-'))||ne.node2_name FROM rcte, nodes_n_edges ne WHERE rcte.node2_id = ne.node1_id ) SELECT node1_name, node2_name, REPLACE(REPLACE(traversal_path::text, '"', ''), ',','') AS traversal_path, depth FROM rcte -- limit the traversal depth WHERE depth < 3 ;
The previous section discussed recursive CTEs on trees. Trees are a special case of acyclic graphs. In a tree, each descendant has a single ancestor. In a graph, each descendant can have multiple ancestors.
In the current data, Sam has a single ancestor - Jane. Convert the Example Data Model from a tree to a graph by adding an edge from Tom to Sam. So, Sam now has two ancestors - Jane and Tom.
WITH tom AS ( SELECT node_id FROM node WHERE node_name = 'Tom' ) ,sam AS ( SELECT node_id FROM node WHERE node_name = 'Sam' ) INSERT INTO edge (node1, node2) VALUES ((SELECT node_id FROM tom), (SELECT node_id FROM sam)) ;
Recheck the view
SELECT * FROM graph_structure ;
The edge from Tom to Sam is now a part of the tree.
The query structure for trees used in the previous section works for acyclic graphs but not cyclic graphs. Note the order of nodes inserted in the above query - (Tom, Sam). If this order is reversed, it leads to a cycle.
In the data model so far, there are no cycles. Consider the path from Tom to Jane: Tom is an ancestor of Dick and Dick of Jane. Add a cycle in the data by connecting Jane back to Tom:
WITH jane AS ( SELECT node_id FROM node WHERE node_name = 'Jane' ) ,tom AS ( SELECT node_id FROM node WHERE node_name = 'Tom' ) INSERT INTO edge (node1, node2) VALUES ((SELECT node_id FROM jane), (SELECT node_id FROM tom)) ;
Recheck the traversal path using the view
graph_structure created earlier:
SELECT * FROM graph_structure ;
This query hangs because of an infinite loop. If you are using the terminal, interrupt it manually with
The infinite loop is created because every time the search reaches Jane, the path leads back to Tom. This cycle continues ad infinitum.
To avoid cycles, use the
CYCLE clause introduced in PostgreSQL 14.0. The
CYCLE clause is added at the end of the CTE before the
SELECT. It defines the columns to check for identifying a cycle. In this case, the presence of a cycle is identified based on the columns
node2 of the table
edge. These columns are referred to as
node2_id in the view
Start from the query previously written for the view
graph_structure in the section Calculate and Limit Traversal Depth. Add the
CYCLE keyword after the sub-queries of the CTE:
WITH RECURSIVE rcte AS ( SELECT node1_id, node1_name, node2_name, node2_id, 1 AS depth, ARRAY [node1_name, (SELECT('-')), node2_name] AS traversal_path FROM nodes_n_edges WHERE node1_name = 'Tom' UNION SELECT ne.node1_id, ne.node1_name, ne.node2_name, ne.node2_id, rcte.depth + 1, rcte.traversal_path||(SELECT(' ; '))||ne.node1_name||(SELECT('-'))||ne.node2_name FROM rcte JOIN nodes_n_edges ne ON rcte.node2_id = ne.node1_id ) -- cycle detection clause CYCLE node1_id, node2_id SET is_cycle USING path SELECT node1_name, node2_name, REPLACE(REPLACE(traversal_path::text, '"', ''), ',','') AS traversal_path, depth FROM rcte ;
Notice that after the path reaches Tom back from Jane, it (again) goes through the nodes linked to Tom. But it stops before reaching Jane for the second time, thus avoiding getting into an infinite loop.
Recursive queries make it possible to model, store, and analyze graph data structures on relational databases. The material in this guide covered the basic usage of recursive queries in PostgreSQL, particularly in the context of directed graphs. Practical applications of directed graphs include modeling social networks based on the "follower" relationship, modeling travel cost and distance to find the most efficient route covering different cities, and so on.
There are two ways of traversing graphs - breadth-first and depth-first. The official documentation on recursive CTEs explains how to use the
SEARCH keyword (introduced in PostgreSQL 14.0) to do this. It also contains a more detailed explanation of how recursive queries work internally. Additionally, it shows how to detect and avoid cycles without using the