Recursive CTEs, DB Access Patterns, Dates & Times, Cluster | Scaling Postgres 106
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss how to use recursive CTEs, database access patterns, using dates and times and how to cluster your data.
Content Discussed
- Recursive SQL Queries with PostgreSQL
- The Troublesome Active Record Pattern
- PostgreSQL Date and Time Data Types Explained
- CLUSTER: Improving PostgreSQL performance
- Is Postgres running ?
- [YouTube] Southern California Linux Expo Channel
- Creating a PostgreSQL procedural language – Part 5 – Returning Results
- Bruce Momjian
- Spatial Tile Serving with PostgreSQL Functions
- PostgreSQL GSSAPI Authentication with Kerberos part-1: how to setup Kerberos on Ubuntu
- pg_timetable: Advanced PostgreSQL cron-like scheduler released!
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Recursive SQL Queries with PostgreSQL". This is from the towardsdatascience.com blog. Basically, they're talking about Recursive CTEs or common table expressions. He first talks about why you would want to use a recursive query. He gives a few examples. First is where you have self-referential data. So for example, you have an employee table and one of the columns indicates who is the manager of that person or some other type of relationship. Or you have a category that belongs to another particular category, but it's referencing it within the same table. The next example is you actually have a tree of information, some sort of taxonomy system, links between articles, and things of that nature. Now, what I really like about this post is it really starts off from a basic standpoint and explains everything going on. I would say it's one of the best Recursive CTE posts I've seen in a while.
Now first they just discuss what a CTE is showing you the WITH statement here to basically compartmentalize or kind of do a sub-query to make it easier to reason about queries you need to do. And then he shows the self-referential employee table here where you have a manager ID that points to the primary key of the person who is that employee's manager. And he inserts some sample data here. You can see the CEO has no manager ID, whereas all of these others are managed by other people in the chart. Then he goes into recursion, basically how you use a Recursive CTE. He gives a basic example here where it has one, which is the base case, and then does a UNION ALL for selecting the other rows. And when you run this query, you can see it prints out these ten rows. Now then he uses the example of his employee table that he created to get a manager tree.
So he uses the base case of Alice, the VP, and then unions it with this query that increases the level here. He starts it at one, but he increases it each time to see who that person manages. You can see the results of this recursive CTE down here where you see those staff who are on the second level, those staff that are on the third level, and he even gives a convenient diagram so you can actually see what this output means graphically. Then he even does a test doing degrees of separation between two employees. Lastly, he continues on and looks in terms of graphs. So defining the relationship that exists between multiple points. He does that in the Recursive CTE set up here and it gives this recursive output defining all the different interactions between different points, or I should say the path between different points. So if you want to learn more about Recursive CTEs, I highly suggest checking out this blog post.
The next piece of content is "The Troublesome Active Record Pattern". This is from calpaterson.com. Now there is a Ruby library called Active Record, but what he's referring to here is the actual Active Record pattern and he mentioned the definition here in Martin Fowler's definition. Basically, converting a whole programmable object into the primitives that are stored in a database table as a row. So basically he has an issue with the pattern itself, not explicitly saying the Ruby on Rails Active Record library, although he does have issues with ORMs themselves because he does mention here issues with object-relational mappers. Now, some of the problems that he sees is that typically by default it retrieves the entire author object. Now actually that is what happens with an SQL query.
If you're using SELECT *, you're going to be pulling all of the columns from that table. So as much as you should narrow down what you retrieve, if you're doing it programmatically through an ORM, you should only choose the particular columns of the information that you need. Then he mentions the entire book for each object has to be retrieved because he's doing it as a nested loop here. But, of course, in some ORMs, you can use joins and you can literally send queries that look like this. And he mentions in the third point the nested loop prelude, much parallelism. So yes, there are a lot of problems, but the reason I brought this post up is that I still use an ORM because it's efficient, but I know how to use it to be able to generate the SQL that looks like this efficiently.
So you don't want to just use an ORM blindly but understand how to use it so that you can restrict how many columns you're pulling back so that you can do a join to a table and you don't have to pull more data than you need. So basically in my opinion, you can use an ORM but you need to use it intelligently, you need to know how the database works and use it efficiently. Now he also mentions here the issues with transactions where you get a record, you change the number of copies in stock and then you save it. So of course, there has been no locking going on.
So another process could have already updated this or increased the stock and now you're putting a wrong value in the value retrieved earlier because there was no lock on it. He has an example here of how you would want to implement this. You can even do it in a particular update query to do this using raw SQL and there's a way to do that communicating through the ORMs if you need to. Then he also discusses Restful APIs as well. So while this post isn't PostgreSQL specific, he mentions things that you need to be aware of and use intelligently. If you're using some of these tools, such as ORMs that make certain operations more efficient, you just want to know how to use them correctly to avoid performance and data problems.
The next piece of content is "PostgreSQL Date and Time Data Types Explained". This is from nanmu.me. So this basically just goes over date and data types and how they work. Now that first covers a timestamp and the difference between a timestamp with a time zone and a timestamp without a time zone, and how most people encourage using timestamp with a time zone. It gives you the code here because it handles time zone things for you and you can make your programs easier to work with. Now, you don't always have to have a set session here. You can do it within the query itself to define what the time zone is. But for his demonstration purposes, he just shows it here. If you have a timestamp with the time zone included, it can do the time zone translation for you.
But they also have the timestamp without a time zone if you want to rely on, say, your application framework to do the time zone conversions. But he covers all about timestamps and then he goes into dates which exclude time. He shows how you can compare a timestamp to a date in cases where it's true, in cases where it's false, and how you can potentially convert between them. Then he goes into times of a day and intervals and follows up with time ranges. Now, if you want a refresher on using dates and times with Postgres, definitely a blog post to check out.
The next piece of content is "CLUSTER: IMPROVING POSTGRESQL PERFORMANCE". This is from cybertec-postgresql.com and they're talking about CLUSTER, which is a tool that reorders the data based on an index. So the reason why you would potentially do this is that physically ordered data on the disk, a lot of times it's easier to do lookups or faster and more efficient to do lookups. Now for this, he created a test table and did a generate_series of 10 million rows. Then he did a second table that randomized the order of it. So the tables were the same size and included the exact same data. They are just in a different order. He created an index on each of the tables and you can even see the random index takes longer to generate because data is mixed all around on the disk.
He did a vacuum and then looked and saw how long it would take to read the blocks of the database. Now, all of this was pretty much stored in memory, so you're not seeing some of the disk access costs here, but you can see that the sequential one on disk was about 12 milliseconds, whereas the one that's random is about 13 milliseconds. Again, that's because it didn't necessarily need to go to disk, but look at the buffer shared hit ratio. It only had to access 130 buffers when it was sequential, whereas it had to access over 18,000 when it was randomized.
Now if you then CLUSTER based upon that index, so we ran the CLUSTER command and now it's not quite as fast, about 13 milliseconds still, but the read is down to 135. So basically if you know your table is typically accessed by a particular index, it can be beneficial to use CLUSTER to order those on the disk. Now the disadvantage of CLUSTER is that it doesn't maintain that order on the disk. If you do a lot of updates, it's going to start spreading them around again and you'll need to run CLUSTER again. It's also something that cannot be done while the table is active. So it needs a lock on the table. But if you're looking to boost performance, maybe for as he mentions here, maybe more so of an analytic database or where you have more static data, then it could be beneficial.
The next post is "Is Postgres running?". This is from mydbanotebook.org and it's just a post of what you can check to see if Postgres is running. Because if there's a problem that you can't connect, that's probably the first thing you're going to look for. And he just has this very quick PS command to look and see what the Postgres processes are. And once you know what processes are running, you can actually reference the directory issued for the postmaster to kind of find where the PostgreSQL comp file is because to connect to it you're going to need to know the port. Now you also need to consult if there are any other conf files that are included in postgresql.conf as well as the auto.com file. So that's something to keep in mind.
And then for completeness, this is for Red Hat and others, but he also mentions for Debian and others as well what that looks like when you do the PS command. He even shows a little quick command to be able to find what the port is in the file. But again, you need to make sure that there are no included files or it's in the auto.conf file as well. And then how do you start and stop Postgres? He referenced some of the Debian tools that are available such as pg_lsclusters to list out clusters as well as how to start, stop, and restart them, and then as well as systemd as well. So, pretty quick blog post, but if you work with different PostgreSQL instances, definitely some quick commands to determine if Postgres is running.
The next piece of content is actually a YouTube channel. It appears they've posted a number of videos for the Southern California Linux Expo on their YouTube channel, and there are about five or six or so videos that have PostgreSQL. So for example, "Using PostgreSQL for Data Privacy and Security", "Declarative Caching with Postgres and Redis", "No Excuses PostgreSQL Security". PostgreSQL Worst Practices", "Enterprise Authentication with PostgreSQL", "Linux IO Internals for PostgreSQL Administrators", "Welcome to Postgres 12", et cetera. So if you're interested in video content, definitely check out this YouTube channel.
The next post is "Creating a PostgreSQL procedural language - Part 5 - Returning Results". This is from 2ndquadrant.com and has been working through setting up Julia as a programming language to use within PostgreSQL. So if you've been following this series, the fifth post is available.
The next piece of content, the PostgreSQL person of the week is Bruce Momjian. So if you want to learn about Bruce's contributions and work in PostgreSQL, definitely check out this blog post.
The next post is "Spatial Tile Serving with PostgreSQL Functions". This is from crunchydata.com, and again they're following along with the web services that they've created to work with Postgres and how you can generate a spatial tile serving. So if you have an interest in this content, definitely check out this blog post.
The next piece of content is "PostgreSQL GSS API Authentication with Kerberos part-1: how to set up Kerberos on Ubuntu". This is from higho.ca. So this looks to be the first part of a three-part blog post talking about authenticating to PostgreSQL using Kerberos. Now, this first post doesn't cover PostgreSQL, but it covers setting up Kerberos on Ubuntu. So if you have an interest in that, maybe you want to check out this first post.
The last piece of content is "PG_TIMETABLE: ADVANCED POSTGRESQL CRON-LIKE SCHEDULER RELEASED!". This is from cybertec-postgresql.com, and they've released the second version of their pg_timetable tool because they made some changes to how data was stored. So they want to go ahead and do that and explain how this new tool works. So if you're interested, definitely check out this blog post.