Postgres & The US Treasury Attack | Scaling Postgres 354

In this episode of Scaling Postgres, we discuss the role of Postgres in the US Treasury online attack, new releases of Postgres, optimizing application of streaming changes and a query that filled a databases disks.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Well, a lot can happen with postgres in a week. Normally I record these shows on Wednesday and Thursday and the video editing takes some time because I'm very challenged with recording these shows and there's numerous other steps to get it ready to go for releasing on Sunday evenings. Well, it seems since the last show there have been two new releases of Postgres, so we'll definitely talk about that this week and be sure to stay till the end of this episode to see my consulting corner where I talk about a query that filled up a database disk, but I hope you, your friends, family and co workers continue to do well.

Our first piece of content is “PostgreSQL 17.3, 16.7, 15.11, 14.16, and 13.19 Released!”. The main highlight of this release is a fix for a security issue, CVE2025 1094 and it's titled PostgreSQL quoting APIs miss neutralizing quoting syntax in Text that Fails Encoding Validation. So definitely a mouthful, but it looks like it's in the LIBPQ functions. Basically there's some string escaping issues that needed to be resolved. Now this also means basically PSQL is affected. So you're probably not only going to have to patch Postgres, but I'm assuming anything that would use the LIBPQ library, including client libraries.

Now this also got highlighted in the news because it was related to a zero day security attack that happened on the US treasury in December. So Rapid7, who was doing the security investigation actually revealed this and it was disclosed last Thursday this vulnerability. So you can definitely read this post if you want to learn more about that.

Now this release also includes all sorts of bug fixes including some race condition issues that can happen with vacuum that can cause corruption to the system catalog. So definitely several other things fixed as well in these releases. But shortly after this release some issues were found. So they posted on the 14th that an out of cycle release was scheduled for February 20th, 2025 because the patch for the security vulnerability actually caused a regression to string handling. So basically the recommendation was as they say here, if you weren't immediately impacted by this issue, consider waiting for the next round of releases. And this has happened when I'm recording this on February 20th. We do have new releases “Postgres 17.4, 16.8, 15.12 and 14.17 and 13.20”, have been released that fixes that regression in the LIBPQ library and they also fixed a Small memory leak in pg_createsubscriber. So definitely because of the security issues, you should upgrade to these releases as soon as you're able.

Next piece of content “End of the road for PostgreSQL streaming replication?”, this is from cyber postgresql.com and when I first read this I imagined, ah, are they talking about maybe logical replication will take over for streaming replication and all of our replicas will now be logical replicas and it's something that I could see happening in the future. There's definitely a lot more changes that need to be made for logical replication. One is sequence synchronization so that the sequences get updated in replicas as well as DDL changes. So when you create a new table, there should be a mechanism to of course create that table in the destination database as well. But that's not exactly what this blog post is talking about. It's talking about the issue with streaming replication where when you think about it, your database system runs many different back ends, many different processes on the primary database, and all of those commit things to memory that eventually get flushed to the wall and then those wall records are transferred to the replica. Now the issue is there's only a single process that applies these changes to the replica, so this can be a potential bottleneck in the apply worker. And I've seen some of this firsthand where a replica can struggle to keep up with what the primary is sending it in terms of wall files. Now in the case where they were struggling, there was actually compression going on with the ZFS file system, but still that behavior became apparent. And what this blog post focuses on is how to find the limits as to what's slowing it down and what can be done to actually speed it up. So they go into a lot of detail on things that would need to be done to try to speed things up. And they mentioned doing things in parallel will be quite difficult to do because the sequencing of how the wall needs to be applied is important. But they still found some areas that some parallelism could potentially work, particularly in terms of trying to read ahead to get certain data, but doing various different operations. They believe they could do some optimizations, to quote, possibly hitting half a million transactions per second in terms of applying to the replica. But this is basically an examination of how this apply process could be accelerated further for replicas in Postgres. And if you're interested, you can definitely check out this blog post.

Next piece of content is “Postgres Is”, this is from drunkdba.medium.com with the tagline, do not drink in database. But this blog post talks about what is postgres, Because a lot of different database systems come out and they say were Postgres compatible. But the question is, well, what does that really mean? Because Craig here mentioned on Twitter or X that Amazon's distributed SQL or DSQL is Postgres compatible, yet it doesn't have views, trigger sequences, foreign key constraints, extensions, notify, nested transactions, no JSONB. So how compatible is it really? And that's kind of what this blog post is talking about, along with a lot of memes that you can see here. And from this, someone developed a postgres compatibility index to be able to run tests against a particular database to see how compatible it is. And they have various assessments for different types of data database systems that you can look at here and even some detail about what areas are not supported. So you can definitely check that out. And then given that postgres is this data platform, you can plug in various different things using its extension ecosystem, is Postgres AI? Is it a GraphDB? Is it a streaming engine? Is it a time series database? So it can be essentially all of these things. So I just thought this was a particularly interesting blog post and definitely check out if you're interested.

Next piece of content “How to Upgrade RDS PostgreSQL with Minimal Downtime”. This is from dataegret.com and this blog post is talking about, you know, Amazon RDS for PostgreSQL has the ability to do blue green deployments as a way to upgrade your database system. So it's basically using logical replication under the covers and then doing some steps to be able to do blue green deployment and upgrade your database system. But it has some caveats. One, you can't use RDX proxy when doing it. You can't do it if your DB has subscriptions. You must be on the last major version of the database and you don't get a built in rollback after a cutover. But they say if you want all of that, they develop their own strategy, again using logical replication at its core. But because it's RDS, they're using a snapshot technique to be able to transfer the data a lot more easily. Well, it's not easier. It's definitely faster because there are quite a few steps involved. So there are 13 steps and each of them are pretty comprehensive. All the different steps that you would need to do to do this kind of logical database upgrade. But a technique similar to this is how I upgrade other Clients databases. So it's definitely a valid way to do it. And this is a way to do it on RDS. So check it out if you're interested.

Next piece of content, “Postgres in the time of monster hardware”. This is from enterprisedb.com and this is discussion of when you put Postgres on really large hardware, like they're saying a CPU that has 768 threads or something that has 192 cores per socket, that's a lot. So with the upper end of hardware, what kind of issues can you run into? And the first area she says is basically non uniform memory access because the different cores actually have affinity to different areas of memory. And this can particularly affect the shared buffers because if a lot of the shared buffers are in this area of memory, but this core needs access to them, it can take longer to access this non local or more remote memory area for these particular cores and vice versa is something stored here. It's harder for this core to access it as well. So they say maybe up to two to three times slower accessing these different memory areas. And of course the biggest limitation when you have something like this is usually the IO bandwidth, you know, and above we're talking about slow memory. Well, what about slow disks? So that can impact it. Then the next question is, when you have this massive machine with say 768 threads and 10 terabytes of RAM, could you configure Linux and Postgres to use all those resources, particularly with regards to connection count and how well will parallelization work with it? So an interesting discussion if you're pushing the limits of what's available with physical hardware.

Next piece of content. There was another episode of Postgres FM last week. This one was “SQL vs NoSQL”. So Nikolai and Michael had Frank Petscho on to discuss this topic. And Frank has done work with PostgreSQL and YugabyteDB and a number of other database systems and he's joined the MongoDB team. So there is definitely a back and forth talking about SQL versus no SQL. And whenever I see this discussion, the NoSQL advocates generally talk about developer experience and how easy it is to get started. And you don't have to translate from objects to relations using an ORM. But the result of that is exactly what this post here “Schema Later”. So in other words, I observe that the data model isn't really thought about. You just go with what you have and you build from there. What is fast and easy. That's what you do. But as this post by Michael Stonebraker, no less than Alvaro Her Fernandez, is that schema later is considered harmful because basically you don't put any forethought into how you're going to be storing and using the data in the future. Or at least that's a ramification of doing it. And Frank mentioned that he's basically a developer advocate at MongoDB seems like focusing on that. So if you're using a document oriented database like MongoDB, definitely take more time to think about how you're storing the data and and what the future will be. Because I think that's one of the advantages of SQL is flexibility. So your data model is not necessarily tied to how your application works, but you've tried to model the data the way it works logically in the business and the application interfaces with it. But this show had a lot of great discussion back and forth, everyone talking about advantages and disadvantages of each of these. And I definitely encourage you to listen to the episode here or watch the YouTube video down here.

Next piece of content “FREE POSTGRESQL PERFORMANCE MONITORING WITH PGNOW”, this is from the scarydba.com and apparently produced from Redgate is a free online tool to give you a postgres dashboard to look at workload analysis, live sessions, indexing, vacuum health, etc. And a lot of these can be accessed by the system views and postgres of course. But here is just putting it I assume in a nice to look at web interface and it says it's free and will continue to be free. So if you're interested in that, you can check out this piece of content.

Next piece of content “Waiting for PostgreSQL 18 – Add cost-based vacuum delay time to progress views.”, so now the postgres views will have delay times as a part of it. So you can see how long a vacuum has had to wait during during its running process.

In addition, he posted another blog Post “Waiting for PostgreSQL 18 – Add delay time to VACUUM/ANALYZE (VERBOSE) and auto vacuum logs.”. So not only are they put in the system view so you can see when a vacuum is actively running, but now it logs that information as well.

And now it's time for the Consulting Corner. So the interesting issue I had to deal with a number of weeks ago is a query that was running. It was a SELECT query that was filling the disk on the entire database system. So thankfully this was not a production system, it was an earlier environment system. But it had a query. If you executed it, it would fill the entire disk. Now, it wasn't a very large database system, so maybe the disk size had to grow by 20 to 30 gigabytes in order to fail the disk, but still running one query could do it. And we never got to see the results of how long this query actually took, because it always got killed or we had to cancel it before it would fill up the disk. And basically this query was constructed by an ORM that was doing a whole lot of SQL gymnastics. And I assume this was required given the data model, or at least they weren't pulling the data in the most efficient way or using the right tables. Basically, there were three duplicate subqueries that were running, all retrieving the same data, and There were multiple ORs within the query. But the result is basically would run for 20 minutes, never finish, and, you know, fill the disk. So I've never used a materialized CTE before, or I've never really seen a performance advantage to using materialization, but. But I thought I would give it a try for this, particularly for those identical subqueries. Why don't you just run it once and then materialize the result? Because the subqueries, the end result was only a few records, but then those were used in other queries within this larger query. And then I had an idea. All right, can we convert the ORs into unions? Because a lot of times you can get a performance boost by using unions instead of ORs. And by making that change using materialized CTEs for those replica subqueries and using unions instead of ORs, I was able to get a runtime of approximately 40 to 50 milliseconds compared to the query running for 20 minutes and filling up the disk. So I thought this was interesting just because it was an interesting use of materialized CTEs that I actually haven't used before. And also just a suggestion to be cautious of the queries that your ORM is generating sometimes. In my opinion, ORMs are fine, but once your queries start getting super complex, you definitely want to be more careful.

Postgres & The US Treasury Attack | Scaling Postgres 354

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript