Slow Updates, Lateral Joins, Serial To Identity, PG 15 Changes | Scaling Postgres 238
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss why and what to do when updates are slow, use cases for lateral joins, moving from serial to identity for auto-incrementing ids and changes to Postgres 15.
Content Discussed
- Why are my PostgreSQL updates getting slower?
- 5mins of Postgres E40: How to debug slow Postgres UPDATEs and bloated tables with auto_explain & page_inspect
- For each loops with LATERAL Joins
- Migrate PostgreSQL ID’s from serial to identity after upgrading to Django 4.1
- Postgres 15 Configuration Changes
- postgresql.conf comparison
- Exclusive Backup Mode Finally Removed in Postgres 15
- PostgreSQL 15: logging in JSON
- Postgres 15 Adds Copy Header and Matching
- Upgrading PostgreSQL 14 to 15 on Fedora, RHEL, CentOS, Rocky, Alma Linux with PGDG RPM packages
- Using pg_upgrade to upgrade PostgreSQL 9.6 PostGIS 2.4 to PostgreSQL 15 3.3 on Yum
- PostgreSQL at the PASS Data Community SUMMIT 2022
- Stored procedures
- Cédric Duprez
- Rubber Duck Dev Show Episode 62 | Staying Off The Rails With Joel Drapper
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "WHY ARE MY POSTGRESQL UPDATES GETTING SLOWER?". This is from cybertec- postgresql.com. He's talking about a situation where a job was running that was constantly doing updates, but they kept getting slower with each update such that the final one was, I think, close to 100 times slower. So here's an example of what caused it. There was an item table that records items of some value. Then there is a summary table that only has one row in it. It's basically the sum total of the values that are in the item table. Now, there is a trigger that's been created. Whenever an item gets inserted for each row, it fires off this function that updates the value. So if you insert a five, it's going to add five to the sum of this one-row table here. Or if you add a ten, it'll add a ten to it. So basically, as items are inserted into the item table, the sum total in the sum table is added up. Now, they are running this DEFERRABLE INITIALLY DEFERRED to try and make it as performant as possible.
And they inserted the values using the COPY command. So basically, they're inserting 100,000 random numbers using this technique here. So what is causing this? So, to help diagnose it, they went ahead and set up auot_explain, which is an extension that gives you an explain plan for queries that you specify. And with this configuration, they're basically logging all queries and getting information like buffers, et cetera. And what they found is that the first update looks like this, and the 100,000th update looks like this. Now, they're pretty similar, but you can tell that the actual time for doing the sequential scan is super long for the last update as opposed to the first. The other thing you can see is that for the shared buffers, the number that is being hit by the first update is one, whereas it's 443 for the last update. So what's causing this to happen? And this is basically Postgres's MultiVersion concurrency control. Basically, whenever you update a row, it doesn't update that row in place, but it actually creates a new row.
And then later those rows have to be vacuumed up. So basically, you keep adding more rows for more updates that you do to the table. And then he says, okay, well, vacuum is supposed to clean this up. Why is it not cleaning it up? Well, the problem is this is happening all in one transaction. So it's essentially one long-running transaction, inserting 100,000 rows, so there's no time for it to go ahead and vacuum it up. The other thing is that they didn't create an index on that sun table. So what they're updating, there's no index on it. So that should enable heap-only Tuple updates. So you shouldn't have to be updating indexes or anything like that should be making it faster. And you should also get HOT pruning, which is kind of like a mini vacuum when it's doing the update. But not even this can work. It's the same issue as the vacuum. It can't happen because the transaction is still open, it's one long-running transaction. He said even if you try to add indexes to it, it'll actually make the problem worse because you have more information that you need to keep up to date in terms of managing the indexes too.
Now, to help see what was going on, he also installed the pageinspect extension. And here you can see that HOT updates are being done, but these Tuples are just dead. So every update that happens is creating a new row. Now, the solution he took to address this is actually not to have the trigger run after every INSERT but to do it after every statement. So basically, it runs this SUM function after each statement. So when all the data has been inserted for a given statement, then it runs the SUM function and the function has been modified to handle that type of update. So I think this went from over a minute to 90 milliseconds using this technique. So basically you want to be cautious whenever you're doing updates in Postgres because it creates these extra rows.
So if you're trying to update a single row multiple times, you're going to be creating it multiple times. So you need to be very cautious about how many times you're doing that. Now they slowed this down essentially by doing that update every statement as opposed to each row in a statement 100,000 times. But there are different ways you could address it. You could focus on using append-only tables to try and avoid this situation. Or if you're doing a bulk load of something, then incorporate a pause into the loading of data such that Auto vacuum has time to catch up, or even run a manual vacuum between stages of a data load. That could help as well.
Now, related to this, the next piece of content is "5mins of Postgres E40: How to debug slow Postgres UPDATEs and bloated tables with auto_exxplain & page_inspect". This is from pganalyze.com and the previous post is exactly what Lukas covers in this episode. So if you want to learn more about that, definitely check out his piece of content as well.
Next piece of content- "For each loops with LATERAL Joins". This is from sqlfordevs.com and he's talking about using lateral joins to efficiently pull data that is otherwise hard to get. So his example here is you have customers and you want to find the last three sales from those customers. Now you can use a subquery to do this but basically, you have to pull and group by the customer ID and only pull the last three and then create that type of join to the customer table. That can end up leading to an inefficient join query, whereas using a lateral join is frequently much more efficient and gives you better performance. So anytime you kind of want to do a for each loop within a particular query for each row, definitely look to use a lateral join to do that. This post explains how to do that, how it works in Postgres, and even mentions MySQL as well. So definitely check this out if you want to learn more.
Next piece of content- "Migrate PostgreSQL IDs from serial to identity after upgrading to Django 4.1". This is from adamj.eu. Since the introduction of identity in PostgreSQL 10, they give you a different way to create an auto-incrementing integer, basically. Historically, you could use serial but now you can use the SQL standard generated by default as IDENTITY or generated always as IDENTITY. It gives you a number of conveniences for working with these types of auto-incrementing integers. One- it's easier to grant permissions to end users, it is a part of the SQL standard so once you learn it you could use the same syntax in other databases.
It's actually easier to manage the next value compared with having to do the next value as sequence. If you're going to copy a table, those sequences are actually copied. The other feature benefit of it is that you can strictly enforce using automated incrementing integers, meaning you can't manually insert one if you do generate it always. So in my opinion this is really great that a framework, an application framework, has started using these. I use Rails and to my knowledge they aren't supporting this yet. They still use serial but Django has moved forward in version 4.1 and is supporting this for new tables. It still works for the serial for old tables but he talks about whether you should potentially migrate for your older columns and he actually developed some code to do this type of conversion for you. So if you use Django and are interested in making this transition, definitely check out this blog post.
Next piece of content- "POSTGRES 15 CONFIGURATION CHANGES". This is from rustprooflabs.com and what they do is they have a tool that they created called a pgConfig site and they compare the configuration changes of postgresql.conf between versions. So from version 14 to 15, there are six new parameters you can change. Three have been updated potentially with new defaults and then one was actually removed. So some of the changes are they now have a log startup progress interval to basically minimize logging up on startup. Log_checkpoints is now on by default and log_auto_vacuum_min_duration is now set at ten minutes. They talk about a few others here and basically, the link to this pgConfig is here and here you can see all the different changes: what it was in 14, what it is now in 15, and you can make changes by looking at other versions as well. If you're looking to migrate from something earlier than 14 to 15 as well, you could use it using this tool. So definitely check this out if you're interested in that.
Next piece of content- "Exclusive Backup Mode Finally Removed in Postgres 15". This is from enterprisedb.com. Exclusive backup mode was something that was deprecated in 9.6, but now it's fully removed. And to make sure that no one is really using it, they actually changed some of the function names from pg_start_backup and pg_stop_backup to pg_backup_start and pg_backup_stop. So if you actually use some of these lower-level functions to do your backup, like for example, I have a client that's using this for doing file system-based snapshots for taking their backup. They didn't want to use pg_base_backup, which is what I typically use. We had to use these functions in order to do a snapshot-based backup. So if you want to learn more about these changes, you can check out this blog post.
Next post- "PostgreSQL 15: logging in JSON". This is from fluca1978.github.io and he's talking about how you can enable the new JSON logging and also log to stderr as well. And he shows you how you can output this information. Then, of course, you can consume it with something that consumes JSON. So check out this blog post if you're interested in that.
Next piece of content- "Postgres 15 Adds Copy Header and Matching". This is from enterprisedb.com. They're talking about how now not only CSV files but also just standard text you can output with HEADER. So they show an example here where you say WITH HEADER and be aware that once you try to actually load that file back in, you're going to need to specify WITH HEADER before or you'll get an error. You can also specify HEADER MATCH and that will make sure that the columns that have been output as part of the output file match the columns that are being inserted. So in this example, they renamed one column and then they tried to import it with a match and it gave an error message. So definitely a great addition. Check this blog post out if you want to learn more.
Next piece of content- "Upgrading PostgreSQL 14 to 15 on Fedora, RHEL, Rocky, Alma, Linux with PGDG RPM Packages". If you run one of those and want to upgrade to Postgres 15, you can definitely check out this blog post for guidance on how to do that using pg_upgrade. And I should say this is from endpointdev.com.
Next piece of content related to that is "Using pg_upgrade to upgrade PostgreSQL 9.6 PostGIS 2.4 to PostgreSQL 15 3.3 on Yum". So this not only does the Postgres upgrade but also describes how to upgrade PostGIS as well. So definitely check out this blog post if you're having to do that. And this is from bostongis.com.
Next piece of content- "PostgreSQL at the PASS Data Community SUMMIT 2022". This is from softwareandbooze.com and he's mentioning there will be a full-day pre-conference training about PostgreSQL. It looks primarily for people who are coming from Microsoft SQL Server to Postgres, but you can definitely check out the course content here if you're interested in attending.
Next piece of content. There was another episode of Postgres FM this week. This one was on "Stored procedures" and whether you should again implement your business logic in the database versus the application side.
Next piece of content. The PostgreSQL person of the week is Cédric Deprez. If you want to learn more about Cédric and his contributions to Postgres, definitely check out this blog post.
And the last piece of content, we did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on "Staying Off The Rails With Joel Drapper". It’s about choosing to go off the rails with your application framework and choosing to stay off them. So if you're interested in that kind of long-form developer discussion, we welcome you to check out our show.