Recursive Queries, Lateral Joins, Pagination, JSONB Arrays

In this episode of Scaling Postgres, we discuss recursive queries, using lateral joins, paging through a data sets efficiently and working with JSONB arrays.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your family, friends, and coworkers continue to do well. Our first piece of content is "PostgreSQL 12.4, 11.9, 10.4, 9.6.19, 9.5.23, and 13 Beta 3 Released!". This is from postgresql.org. Now there are two security issues. One is an uncontrolled search path element in logical replication and the second is an uncontrolled search path element in CREATE EXTENSION. So be sure to check those out to see if you need to upgrade sooner rather than later, depending on how you use Postgres. Then at the bottom here, they mention all of the different bug fixes and improvements that were done. So review and see when you want to upgrade your Postgres version of Postgres.

The next piece of content is "UNDERSTANDING RECURSIVE QUERIES IN POSTGRESQL". This is from cybertec-postgresql.com. So this is looking at recursive queries and how you do it in Postgres via CTEs. So first they review common table expressions and basically convey it as a view that only exists within the scope of a single query because you can usually always write a CTE as a subquery. Then they cover the syntax of a recursive query. So you use WITH RECURSIVE, the name of your CTE, and then its definition. How you do the definition is with two queries combined with a UNION or a UNION ALL. The first one is the non-recursive branch and the second one is the recursive branch that references itself. So therein lies the recursion. Then lastly you say SELECT ALL from the CTE you've defined and they go a little bit over how recursive queries are processed internally.

It uses a working table, so really it kind of works through things procedurally. But then it goes into an example here. And of course, a lot of recursive examples use people who are managing people in a hierarchy. So they're using that. Here you have this manager or this MGR column that defines who the manager of that person is. So what they want to find out is all the subordinates of person 7566. So the nonrecursive branch basically selects the employee e number where it's 7566. Then the recursive branch selects from the table, joining on itself the CTE where the manager ID equals the employee ID. Then here they show the complete query building out the CTE and then the results, which is what you expect. Then they go into showing how you can do generated columns as long as you use this UNION ALL.

Here, as each part of the recursion, they add one to this level that they're defining. So you can see a first-level subordinate and a second-level subordinate. Then next they show defining the path. So for example, being able to say that Jones supervises Ford. Who supervises Smith? Basically in the recursive query, they append the next employee name onto this column that they've defined. And then lastly, they go into how this compares to doing it with Oracle. They do mention that Oracle does not conform to the SQL standard, but they do have their own way of doing this. And then they also cover things looking at designing a Fibonacci sequence using recursion. So if you want a refresher on recursive queries, definitely check out this post.

The next piece of content is "Iterators in PostgreSQL with Lateral Joins". This is from crunchydata.com. So a lateral join, as they say here, is that "...the output will apply the right-hand part of the join to every record in the left-hand part of the join". So they use this example of a generate_series and you're generating a series of numbers from 1-4. So in the left column, you'll see 1234. But then they do a cross_join_lateral to another generate_series that goes 1 through whatever the value is in the generate_series. So for example, when the left column is 1, it's going to generate a series from 1 to 1. So you've got 1 and 1 because there's nothing other than the 1.

When the generate_series is at 2, it generates two rows. Why? Because this lateral join executes this twice. Once it goes from 1 to 1 and then it goes to 2. So it's going to show you 2. And then when you have 3 in the left-hand column, it's going to generate three rows. When you have 4 in the left-hand column, it will generate four rows, and so on and so forth. Now, they use functions as examples here and even here, where they're showing movie recommendations based on a user ID and recording a name and a rank. You could also do this as a subquery. So here they're saying picking the five most popular streams by zip code. So this was a relatively brief post giving a good explanation of lateral joins. They even have a reference here to an online class that Crunchy Data has put together to help understand it better. So if you want to learn more about lateral joins in PostgreSQL, check out this blog post.

The next post is "Keyset Pagination". This is from momjian.us. So this is basically how to do Keyset Pagination, paging through a data set. Now, there was a previous post, and the way he does his blog post is on the same page as the next post. So this one posted is called Pagination Tips and covers a couple of different ways to do it, but I would probably only choose the Keyset Pagination that he goes into more detail about here for performance reasons. Basically using offset with a limit is really poor on performance as you get to larger and larger records.

How he does it is basically remembering the product ID that was used and then showing records greater than the last product ID looked at by a limit. Now, the reason he does a limit of 11 if he wants to display 10 is to know whether there's a next page or not, to say show a next page or not. Then he talks about a method using the CTE here to be able to page backward, potentially. But there's usually a disconnect between good performance and actually doing things by pages because it gets very complicated in terms of what happens if a record that you're looking at is deleted or a new record is inserted in its place for whatever reason. How would you handle that in the paging? Do you need cache values in the application? So there are a lot of things to consider when doing Pagination.

Now, I also am going to reference two other blog posts that cover this in a lot of detail and they have their own recommendations. One is a post called "Pagination with Relative Cursors" and essentially it's like a Keyset Pagination. Basically, they are saying you want to avoid doing limit offsets when you have large offsets because the performance is just horrible. And they show here once you have 100,000 offset, it takes over 2 seconds and then it continuously gets worse and worse the larger offsets you go. Now they're calling it a Relative Cursor Pagination, but basically, it's the same technique. You're using an ID and a limit. So you're seeing what products are above a particular area. Now, they mentioned the disadvantage is that it's hard to go to a specific page. Basically, you can go forward and backward but you aren't really able to go to a specific page three or page six for example, because it's always going forward and backward. Then they talk about how you even have a way to use sorting and skipping records by a different sort other than the ID and they have a technique of how to do that here.

Now, even a third post related to this is "Paging Through Results" from use-the-index-luke.com and he covers different databases. But for PostgreSQL, he shows again the one that's the least efficient using an OFFSET and a LIMIT. Now, he uses the SQL-specific syntax FETCH NEXT 10 ROWS ONLY. But again, he uses the same technique of using greater than or less some particular date with a limit. He says you can use this specific syntax in Postgres, a row values syntax, to consider both a date and an ID at the same time. He shows here in this great graph on the bottom how the performance with OFFSET starts getting progressively bad as you go up in page numbers, whereas the performance, using what he calls a seek, or this Keyset Pagination, maintains pretty good performance no matter how large the pages get. So if you're interested in designing an efficient way of paginating in your application, definitely check out these blog posts.

The next piece of content is "Working with a JSONB Array of Objects in PostgreSQL". This is from levelup.gitconnected.com Now, this is an application developer describing how to work with JSONB in their data, and he uses a number of different JSONB functions to do it. His example is having purchases in an array of products in the JSONB field. Now, I wouldn't necessarily advocate this particular data model, so I'm not sure if I would follow this data model myself, but as an example of how to manipulate the results and use all the different JSONB functions, this was a great post to read through to learn more about that. So if you're interested, check out this post.

The next piece of content is "Running multiple PgBouncer instances with systemd". This is from 2ndquadrant.com. So PgBouncer runs as a single process, and it only uses one CPU or one CPU core. So if you have a multicore machine, PgBouncer is not able to use all of the cores of those machines. Now, what people typically do is set up multiple PgBouncer services on different ports. So you have to run different services to be able to leverage all the CPUs. However, this has a technique of running multiple PgBouncer instances without having to do that using a common port. So they cover an example of the unit file for the systemd configuration as well as the PgBouncer configuration that he's using.

And there's this setting reuse port that you can set to be able to run multiple instances. He made a modification of the PgBouncer services using the systemd template system. After making these modifications, you're able to start PgBouncer service one, and start PgBouncer service two. He says they don't even have to be numbers. They could be some other identifier that you use. Now, he says the problem with this particular setup right now is that the Unix-domain sockets can't be shared, but he has a solution for that. So you can set up socket activation, which is supported in PgBouncer 1.14, and he shows the configuration for that.

What that gives you is a common port for all services to connect to on that system. Then a per instance TCP/IP and Unix domain sockets for administration and monitoring. So basically, you can start the services. What started here are the port numbers for the administration of that system, but they can all use the same connection port to connect to the database. So they show that here, production traffic can just connect to a common port. Whereas if you want to administer a particular PgBouncer instance, you can use the port defined when you started the service. So this is a pretty interesting technique that I'm definitely going to keep in mind potentially if I'm going to be setting up multiple PgBouncer instances on a single server. So if you're interested in getting more details, definitely check out this post.

The next piece of content is "Partitioning Improvements in PostgreSQL 13". This is from highgo.ca, and they first talk about the history of all the improvements that have been made to partitioning over all the different versions. And in 13, they talk about partition wise join improvements, partition pruning improvements, as well as before row-level triggers. So if you want to get all the details and examples of queries on how to check out these improvements, definitely check out this post.

The next post, also from highgo.ca is "PostgreSQL High Availability: The Considerations and Candidates". So they're looking at High Availability and they're proposing what people generally want when they're talking about a High Availability PostgreSQL solution. They came up with these requirements or considerations, as well as kind of what High Availability means. For example, there should not be any single point of failure in the system, continuous health monitoring of back-end servers and systems, and reliable failover in case of failure. Now, they also consider four High Availability solutions that they're going to consider to meet these requirements. The first one is Pgpool II. The second is PostgreSQL Automatic Failover or PAF. The third is RepMgr or the Replication Manager. The fourth is Patroni. So they cover the very basic summary of each of these solutions. In future blog posts, they're going to be examining each one to meet the High Availability needs of Postgres. So if you're interested, you can check out this post and look forward to the future ones covering each solution.

The next piece of content is "Consensus based failover with Pgpool-II". This is from highgo.ca. So this is talking about failover and how you can do it with Pgpool II based upon their watchdog feature and using consensus to do it and showing the different values that you can configure for it. So if you're interested in doing that, definitely check out this blog post.

The next post is "Building a recommendation engine inside Postgres with Python and Pandas". This is from crunchydata.com. This was an interesting post where they actually were building a basic recommendation system, but doing it within PostgreSQL and using functions to do it. If you're interested in checking that out, check out this post.

The last piece of content, the PostgreSQL person of the week is Michael Brewer. So if you're interested in learning more about Michael and his contributions to PostgreSQL, definitely check out this blog post.

Recursive Queries, Lateral Joins, Pagination, JSONB Arrays | Scaling Postgres 127

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript