Migrating hierarchical queries from Oracle to PostgreSQL

This is the second part in a series of blog posts describing PostgreSQL analogs of common Oracle queries

One of the most intricate Oracle specific constructions is "START WITH ... CONNECT BY". According to Oracle's documentation, the syntax is: SELECT [query] [START WITH initial_condition] CONNECT BY [nocycle] condition. This statement is commonly used to traverse hierarchical data in the parent-child order. It's easier to illustrate how it works with an example.

Consider a table that stores opponents moves in a game of chess. Each table row contain coordinates (in algebraic notation) of a single move by whites and the move in response by blacks, as well as a column that references a preceding move, making it possible to keep multiple continuations of a specific move for the post-game analysis.

CREATE TABLE moves(id integer, parent integer, white varchar(10), 
black varchar(10));

The following statements describe 2 variants of a very short game, the first one leading to the early checkmate (known as a scholar's mate), and the second one to the position where black successfully avoids being checkmated.

INSERT INTO moves VALUES(1, 0, 'e4', 'e5');
INSERT INTO moves VALUES(2, 1, 'Qh5', 'Nc6');
INSERT INTO moves VALUES(3, 2, 'Bc4', 'g6');
INSERT INTO moves VALUES(4, 3, 'Qf3', 'Nf6'); -- checkmate is avoided
INSERT INTO moves VALUES(5, 2, 'Bc4', 'Nf6');
INSERT INTO moves VALUES(6, 5, 'Qxf7#', NULL); -- blacks being checkmated

Let's build an Oracle query showing a sequence of moves that leads to the checkmate:

SELECT DISTINCT id AS final_move_id, 
LTRIM(SYS_CONNECT_BY_PATH(NVL(white,'')||':'||NVL(black,''),';'),';')||';' 
AS moves, LEVEL AS mate_in 
FROM moves WHERE white LIKE '%#' OR black LIKE '%#' START WITH id = 1 
CONNECT BY PRIOR id = parent;

FINAL_MOVE_ID MOVES MATE_IN
6 e4:e5;Qh5:Nc6;Bc4:Nf6;Qxf7#:; 4

The query instructs Oracle to look for a checkmate:

  • The search starts at the move with id = 1, as indicated in the START WITH clause, and considers all possible continuations that lead to a checkmate, denoted by the final '#' in the move's description.
  • Each move and its direct continuation, for instance, moves 2 and 5 represent the parent-child relationship, described by the PRIOR condition.
  • The search depth is stored in the LEVEL pseudo-column.

As a result, Oracle goes from one row to another only if the parent column of the new row contains the id of the current row, accumulating all visited rows in a result set. The SYS_CONNECT_BY_PATH clause produces a string out of the specified columns of the visited rows, connecting each (parent, child) pair by the designated character (';' in our case).

Being Oracle SQL extension, CONNECT BY is not available in PostgreSQL. Recent versions of PostgreSQL implement Common Table Expressions (CTE), SQL-standard way of dealing with hierarchical data. Here's one possible rewrite of the query above for PostgreSQL using recursive CTEs:

WITH RECURSIVE search_moves(id, parent, white, black, level, path, checkmate)
AS (
  SELECT id, parent, white, black, 1 AS level, 
  COALESCE(white,'')||':'||COALESCE(black,'')||';' AS path, 
  FALSE as checkmate FROM moves WHERE id = 1
UNION ALL
  SELECT m.id, m.parent, m.white, m.black, sm.level + 1 AS level,
  sm.path||COALESCE(m.white,'')||':'||COALESCE(m.black,'')||';' AS path, 
  CASE WHEN m.white LIKE '%#' OR m.black LIKE '%#' THEN true ELSE false END 
  AS checkmate FROM moves m, search_moves sm WHERE m.parent = sm.id
) SELECT id AS final_move_id, path AS moves, level AS mate_in 
FROM search_moves WHERE checkmate = true;
final_move_id moves mate_in
6 e4:e5;Qh5:Nc6;Bc4:Nf6;Qxf7#:; 4

The SELECT statement above fetches rows produced by the recursive CTE, defined in the WITH RECURSIVE block. This block consists of 2 parts, separated by the UNION ALL clause:

WITH RECURSIVE search_moves(id, parent, white, black, level, path, checkmate)
AS (
  SELECT id, parent, white, black, 1 AS level, 
  COALESCE(white,'')||':'||COALESCE(black,'')||';' AS path, 
  FALSE AS checkmate FROM moves WHERE id = 1
UNION ALL
  SELECT m.id, m.parent, m.white, m.black, sm.level + 1 AS level,    
  sm.path||COALESCE(m.white,'')||':'||COALESCE(m.black,'')||';' AS path, 
  CASE WHEN m.white LIKE '%#' OR m.black LIKE '%#' THEN true ELSE false END 
  AS checkmate FROM moves m, search_moves sm WHERE m.parent = sm.id
)

Let's break it down into smaller fragments and examine them in more detail. The first one is the recursion base case, adding the first row to the result set:

SELECT id, parent, white, black, 1 AS level, 
COALESCE(white,'')||':'||COALESCE(black,'')||';' AS path, 
FALSE AS checkmate FROM moves WHERE id = 1

This part is analogous to the START WITH clause in the Oracle statement above. The second part contains rules to return the whole result set out of the rows already added to it and those from the moves table that matches the WHERE condition:

SELECT m.id, m.parent, m.white, m.black, sm.level + 1 AS level,
sm.path||COALESCE(m.white,'')||':'||COALESCE(m.black,'')||';' AS path, 
CASE WHEN m.white LIKE '%#' OR m.black LIKE '%#' THEN true ELSE false END
AS checkmate FROM moves m, search_moves sm WHERE m.parent = sm.id

This condition is equivalent to the Oracle's PRIOR clause, choosing the new row to add to the result set: the row should be a direct child of another already added row.

In order to highlight the sequence of moves leading to a checkmate 'white' and 'black' attributes of rows added to the result set are concatenated into a string containing attributes from previously added rows, producing the output similar to Oracle's SYS_CONNECT_BY_PATH:

sm.path||COALESCE(m.white,'')||':'||COALESCE(m.black,'')||';'

Note that there is no special LEVEL pseudo-column in PostgreSQL: instead, an arbitrary counter, named 'level' out of convenience, is incremented for each transition from parent to a child row.

Here's another example. Suppose we have a table describing bus routes, each row containing 2 consecutive stops and a route they belong to:

CREATE TABLE segments(start_id integer, end_id integer, route_id integer);
INSERT INTO segments VALUES(1,2,10);
INSERT INTO segments VALUES(1,2,1);
INSERT INTO segments VALUES(2,8,1);
INSERT INTO segments VALUES(2,3,10);
INSERT INTO segments VALUES(3,5,10);
INSERT INTO segments VALUES(5,8,10);
INSERT INTO segments VALUES(8,1,10);

It's possible to extract all bus stops in order belonging to a specific route using the technique described above. There is a caveat, though: by following the stops along the way sooner or later the algorithm reaches the starting point and, unless special precautions are taken, loops infinitely. One can prevent it (starting from Oracle 10g) by adding the NOCYCLE clause:

SELECT * FROM (SELECT DISTINCT route_id, LEVEL, 
SYS_CONNECT_BY_PATH(start_id, '/') "route" 
FROM segments WHERE route_id = 10 START WITH start_id = 1 
CONNECT BY NOCYCLE start_id = PRIOR end_id AND route_id = PRIOR route_id 
ORDER BY level DESC) WHERE ROWNUM <= 1;

ROUTE_ID LEVEL route
10 5 /1/2/3/5/8

Consider the analogous PostgreSQL query without the clause that detects cycles:

WITH RECURSIVE routes(route_id, level, route, start_id, end_id) AS (
  SELECT route_id, 1 AS level, '/'||start_id AS route, start_id, end_id 
  FROM segments WHERE start_id = 1
UNION ALL
  SELECT cs.route_id, ps.level +1 AS level, ps.route||'/'||cs.start_id 
  AS route, cs.start_id, cs.end_id FROM segments cs, routes ps 
  WHERE cs.start_id = ps.end_id AND cs.route_id = ps.route_id
) SELECT route_id, level, route FROM routes 
  WHERE route_id = 10 ORDER BY level DESC LIMIT 1;

The query above works correctly if there are no cycles in the data hierarchy. For the segments table it loops until a user cancels it. Let's change it to handle cycles correctly:

WITH RECURSIVE routes(route_id, level, route, cycle, start_id, end_id) AS (
  SELECT route_id, 1 AS level, array[start_id] AS route, false AS cycle, 
  start_id, end_id FROM segments WHERE start_id = 1
UNION ALL
  SELECT cs.route_id, ps.level +1 as level, ps.route || cs.start_id as route, 
  cs.start_id = ANY(ps.route) as cycle, cs.start_id, cs.end_id 
  FROM segments cs, routes ps WHERE cs.start_id = ps.end_id AND 
  cs.route_id = ps.route_id AND cycle = false
) SELECT route_id, level, '/'||array_to_string(route,'/') AS route 
  FROM routes WHERE cycle = false AND route_id = 10 
  ORDER BY level DESC LIMIT 1;
route_id level route
10 5 /1/2/3/5/8

The difference between the last two queries is the addition of the array to store stops visited along the route:

  • Each new stop is compared to the contents of this array (using the ANY operator).
  • If the stop id is already present in the array, then a cycle is detected and subsequent searches along the corresponding route are stopped.
  • Otherwise, the stop is added to the array and the execution continues.

This actually works a bit differently from Oracle's CONNECT BY NOCYCLE (but similarly to the newer cycle detection code in Oracle 11g R2): we can only detect a cycle upon entering one and, as a result, the first row of a cycle is also saved by CTE and can be retrieved by lifting off the cycle = false restriction in the resulting SELECT.

Common Table Expressions are available in PostgreSQL 8.4 and above. For the earlier versions, there is tablefunc contrib module, containing the function that emulates most of the connect by functionality. While not as general as recursive CTEs, it nevertheless brings the power of hierarchical queries to earlier versions of PostgreSQL.

Latest version of Oracle Database (11g R2) includes SQL standard recursive CTEs, making possible to port CTE queries from 11g R2 to PostgreSQL with only minor changes. There are a couple of useful extensions implemented by Oracle, for instance, it's possible to do depth-first or breadth-first search and there exists a mechanism for cycle detection not yet implemented in PostgreSQL (which can be emulated as demonstrated in the examples). There are CTE extensions in PostgreSQL as well: version 9.1 introduced writable CTEs, allowing INSERTs, UPDATEs and DELETEs with RETURNING in the WITH blocks, making it possible to combine results of multiple DML queries and SELECTs in a single query.

Finally, despite the fact that this post is about Oracle to PostgreSQL conversion, I can't consider it complete without mentioning some interesting applications of recursive CTEs: there is an awesome library called pgchess written by Gianni Ciolli that actually allows a user to play chess against the PostgreSQL database, Oracle and PostgreSQL queries to solve a sudoku puzzle and a number of other interesting examples one can find in the snippets section of the PostgreSQL wiki. Have fun and stay tuned for further posts!