Safety Systems, Failover Slots, Transaction ID Complications, Repartitioning | Scaling Postgres 118
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss safety systems, the purpose and existence of failover slots, complications with transaction IDs and how to repartition without downtime.
Content Discussed
- Safety Systems Can Reduce Safety
- Failover slots for PostgreSQL
- PG Phriday: 10 Things Postgres Could Improve – Part 1
- Control Runaway Postgres Queries With Statement Timeout
- Repartitioning with logical replication in PostgreSQL 13
- Multi-Master Replication Solutions for PostgreSQL
- Composite type performance issues in PostgreSQL
- Transactions in PostgreSQL and their mechanism
- Developers Diary 1
- Developers Diary 2
- Using the Crunchy PostgreSQL Operator with PostGIS
- PgBouncer 1.14.0
- How to use AdaBoost Machine Learning model with 2UDA – PostgreSQL and Orange (Part 6)
- Oleksii Kliukin
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Safety Systems Can Reduce Safety". This is from momjian.us. He typically does quick little blog posts that are a paragraph or two. This one's a little bit longer but what he had to say about safety systems I found particularly interesting. He's basically mentioned three instances where actually a safety system caused a problem. So for example, caused a jet to crash, caused the Three Mile Island nuclear reactor meltdown, and caused the Chernobyl nuclear meltdown. So basically the safety systems introduced to protect something caused a problem. And he's looking at this from the perspective of PostgreSQL and he mentions a few different safety systems here with regard to backups, error reporting, replication, and connection pooling.
Talking about backups, he looks at it through a few different lenses. How serious is the failure that the safety system is trying to prevent? High. How likely is the failure that the safety system is trying to prevent? High. How reliable is the safety system? About a medium. Then what impact will the failure of the safety system have on the overall system? And it's relatively low. He says, well, maybe you run out of disk space while the backup is occurring, for example. But usually, you're not going to cause the system to go down by taking a backup. Maybe it has a little bit of a performance impact, but not very much. What I find interesting about this and thinking about it is that what about other areas like he mentioned replication. So if you're going with streaming replication, the one that causes the least impact is asynchronous replication. But then once you turn on synchronous replication, a safety system to ensure you don't lose any data, you have essentially two servers that always need to be operating to keep synchronous replication going.
Maybe you can add more, but each one you add, you're adding another point of failure. So this safety system, each time you add to it, you're introducing another potential source of failure because your database will stop working if the synchronous replication stops working. So essentially the safety system causes a problem to happen. And then he talks about connection pooling and that maybe you want a connection pooler. Well, now that is a source of error. Well, maybe you want to introduce two connection poolers. Well now you've got more complexity you need to coordinate and deal with to make sure that those work. The same thing happens with High Availability solutions. So those manage the switch over from one database to another if there's an issue. But then what if there's a failure in that system? Could that actually cause downtime? So not a long post, but definitely something to consider as you add these new more safety systems. Do they really create a safer system?
The next piece of content is "Failover slots for PostgreSQL". This is from 2ndquadrant.com. I don't know if this has been updated or if this is a new post, but I have seen a number of other articles talking about failover slots. This is the concept of a replication slot, when you failover from a primary to a replica, any replication slots do not get carried over, they stay with the old replica. So you need to recreate them on the new replica. And this can cause problems with things like they mentioned here, logical decoding, where essentially you lose the place that you were at and you potentially have to make adjustments to handle it. And they talk about the failover problem here with the fact that the slot does not get crossed over.
The issue is that you have to track not only the existence of the slot but where it is in the WAL file so that when failover occurs, the slot knows exactly where it needs to be and it talks about a patch with regard to it. But as I said, I don't know if this is an old post because I don't see any reference to a link to the patch or whether it's a patch that's potentially coming in Postgres 13 or a future version 14. So I don't know the state of this, but even then it brings out an important thing to keep in mind if you're using logical replication or logical decoding with Postgres, the fact that the slots don't get copied over and you need to handle that case. So if you happen to know of the existence of a patch for Postgres 13 or 14 that may address this, go ahead and let me know.
The next piece of content is "PG Phriday: 10 Things Postgres Could Improve- Part 1". This is from 2ndquadrant.com as well. Now, interestingly, they say 10 things Postgres could improve, but I don't see a listing of 10 things. They're listing four things that they're going to address in four parts. So I was a little confused by that. But basically, this addresses part one of the things that could be improved and it covers transaction ID complications. So this is from a post that we covered in a previous episode of Scaling Postgres, talking about 10 Things I Hate about Postgres. So it's a little play on that. But they're focusing on the transaction ID complications. This is the fact that a transaction ID is a 32-bit integer and that means there is a finite number of active transactions you can have. So you can run into problems if you're not actively vacuuming up your system. If you have long-running transactions, you could run out of those active transactions essentially within tables you're working with, or for using things like prepared_transactions.
So there are a few different scenarios where you can run out of these transactions. So you really want to be aware of these possibilities. Now, how can you protect yourself from it? Really, the number one way is to monitor it, to monitor the state of it across all the tables and the database to make sure that you're not going to run out of these transaction IDs. The second thing, of course, is to make sure your table is well autovacuumed and that you're freezing these IDs so that they are available for reuse, as well as make sure you don't have a lot of long-running transactions, especially idle_in_transaction sessions. In terms of the feature, they are looking towards things like maybe making the transaction ID a 64-bit number just to increase the runway. Or alternatively, maybe a new storage system like zheap could offer an alternative. So if you want to learn about transaction ID complications, go ahead and check out this blog post.
The next piece of content is "Control Runaway Postgres Queries with Statement Timeout". This is from crunchydata.com and it's talking about statement_timeout, which is a configuration option that enables you to timeout statements that exceed a certain duration. Now, you can set this on the cluster by session. They are even talking about per user, per database. So this helps you avoid the kind of things that we were talking about in the previous post having too many long-running transactions within your database system. Now, one thing he says he likes to do is at the database level, set the statement_timeout to 60 seconds so that no statement can exceed that duration within that database. If you have to run a longer query, then in whatever session you're using, just set the statement_timeout to a longer duration and then once you close out of that session, those settings will be forgotten. So if you want to learn more about the statement_timeout configuration option, check out this blog post.
The next piece of content is "Repartitioning with logical replication in PostgreSQL 13". So this is from blog.hagander.net and he's talking about a way to repartition a table from say, a daily partitioning scheme to a weekly partitioning scheme. Although you could migrate it one way or the other. And he's looking to do this without downtime. So clearly you can remove access to the system and be able to do it that way. But how could you do it without downtime? Now he says, quote "But first, a disclaimer -- this is definitely not pretty". But apparently, it does work as a quick hack. So he has one table that is partitioned daily and he inserted some data into it and then he set up a new table with a weekly partitioning scheme.
Now, in order to do this, you actually need to work with two databases and create a publication from one table to another table in another database, and then from that database, create a publication from that table in the secondary database back to the first database. He goes through all the different codes here. But the description of all that he's doing and kind of how it works is listed out in this eight-step process here. And it's pretty much what I described. You replicate from one table in database one to a second table in the second database, and then replicate from that to another table in database one, all using logical replication. That should be able to repartition your data without encountering any downtime. So if you're interested in learning more about this technique, definitely encourage you to check out this blog post.
The next post is "Multi-Master Replication Solutions for PostgreSQL". This is from percona.com, and they're basically describing different solutions that do multi-master replication. They cover some key concepts with regard to database replication, synchronous replication, asynchronous replication, and a single master replication, which is essentially what Postgres does out of the box, only a single master solution. But then they go into multi-master. Talk about some of the pros, then, of course, some of the cons with regard to it. I'll just mention some of the cons. The main disadvantage of it, of course, is complexity. Second, conflict resolution is very difficult because you can have simultaneous writes on more than one master at a time.
So if someone updates the same data in the same table on two or three of the servers, which one wins? And then sometimes manual intervention is necessary in that case. Then they mentioned the different solutions that they're aware of. The first is BDR, which is the bi-directional replication product of 2nd Quadrant. They discuss xDB. They discuss PostgreSQL-XL. PostgreSQL-XC and XC2, Rubyrep, and Bucardo. So if you're interested in looking into a multi-master database solution built on Postgres, definitely check out this blog post and the different solutions offered.
The next piece of content is "COMPOSITE TYPE PERFORMANCE ISSUES IN POSTGRESQL". This is from cybertec-postgresql.com. By composite types, they're talking about custom data types that you can create in Postgres. So for example, you can create a type. This one they've chosen to call Person, and it includes an ID, a name, and an income. That one type encompasses all of these different types. You can store this in a single column. As you can see here, the type of that column is Person when you create it with that data type of person. Now, one thing they say here is to be careful about database performance. Now, they didn't use this example.
They actually moved to use the pgstattuple because, in terms of what it returns from when you call it, it returns a composite type. But based on how you query it, if you query it this way, it returns in almost 2 seconds. Whereas if you actually do a normal SELECT with a FROM clause, it returns in 200 milliseconds. He says this is because when working with a composite type, it actually goes through and calls this function for each column, apparently. So it's just something interesting to be aware of is that depending on how you use composite types, you could run into a performance issue. So just be aware of this if you use them. Personally, I haven't used composite types in my database design. I tend to use non-composite types. If I do use a composite type, I generally use JSONB fields, but go ahead and check out this blog post if you want to learn more.
The next piece of content is "Transactions in PostgreSQL and their mechanism". This is from highgo.ca. So this is talking about transactions, and it covers essentially normal transactions that are auto-committed by default in Postgres. They talk about that and show the different ways it gets executed. And they start digging deep into how this actually works, and they query the table to look at the different hidden fields and how that tracks and manages that transaction. They talk about subtransactions that can happen within a primary transaction. They talk about multi-transactions that cover basically row-level locks, and then finally they follow up with two-phase commit transactions and how they work. So if you want to deep dive into the technical details of how transactions work in Postgres, definitely check out this blog post.
The next piece of content is a set of two posts: "Developers Diary 1" and "Developers Diary 2" from Paul Ramsay at cleverelephant.ca. He's talking about enhancements that they've been doing to PostGIS with regard to performance. So if you want to learn more about the work he's been doing to improve the performance of large object manipulation in PostGIS, definitely check out these two blog posts.
The next piece of content is "Using the Crunchy PostgreSQL Operator with PostGIS". This is from crunchydata.com, and they're talking about exactly what it says using the Crunchy Data Kubernetes operator to set up PostGIS in Kubernetes. So if you're interested in doing that, definitely check out this blog post.
The next piece of content is that PgBouncer 1.14 is released. This is from pgbouncer.org. The main feature they list here is that encrypted SCRAM Secrets in PgBouncer can now be used for server-side login. So storing plain text passwords in PgBouncer is no longer necessary to be able to use SCRAM. So if you're interested in this new feature, definitely check out this new version.
The next piece of content is "How to use AdaBoost Machine Learning model with 2UDA - PostgreSQL and Orange (Part 6)". So this is part six of the post from 2ndquadrant.com. It covers how to do the machine learning that they're mentioning here. So definitely check out this blog post to learn more.
The last piece of content, the PostgreSQL person of the week is Oleksii Kliukin. So if you're interested in learning more about Oleksii and his contributions to PostgreSQL, definitely check out this blog post.