background

The Future Is Containers? | Scaling Postgres 348

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss whether the future of containers and Postgres, UUIDv7 commit, and a pg_duckdb interview.

Content Discussed

YouTube Video

Podcast Audio

Transcript

And we're back. Happy 2025 to you. I hope you had a great holiday season and I hope your new year is going well so far. I myself have spent the last couple of weeks moving to a new place during this break, so I've definitely been super busy. But it's time to get back into Postgres. So of course there's about two to three weeks of content that was missed, so I'm going to be covering it all in this episode. But again, I'm only covering about the first 15 or so posts in the show. All the links will be in the show notes on the website, but what's been interesting about the posts over the last two to three weeks is there was one theme that emerged across a number of blog posts and it was all related to containers and postgres. So that'll kind of be the theme, at least for the first part of this episode, but I hope that you, your friends, family and co workers continue to do well.

Our first piece of content is “PG Phriday: Kubernetes Killed the High Availability Star”, this is from bonesmoses.org and this post is actually from a presentation that was done at PASS this year. And this blog is actually from Sean Thomas and he wrote the book on Postgres high availability, basically. So PostgreSQL 12 high availability written by Sean Thomas. So I would say he knows a thing or two about High availability in postgres, but even though he wrote the book on it, he says I won't continue that series, meaning the book, because Postgres High Availability is dead and Kubernetes killed it. So when someone who has worked a lot on postgres High Availability says that it's dead, I think that's something to take notice of. And the fact that it looks like Kubernetes is the destination from Postgres High Availability and he talks about all the stuff you have to get right with regard to high availability and the fact that Kubernetes has built a lot of that in that can be utilized through these postgres operators that run Postgres on Kubernetes. And this is definitely an interesting blog post I encourage you to check out given his unique perspective. But at least from his viewpoint, right now the future looks to be Kubernetes and specifically he seems to favor CloudNativePG, which is actually the postgres operator I've started learning the most. And it's interesting. This meme here says YAML goes in and containers come out. You can't explain that. So he does talk a little bit about what Kubernetes is and how it works and the advantages of using the operators and how you can create a cluster with just this declarative set of YAML essentially. And he does, even though he does favor Cloud native PG at this time. There are plenty of other operators that are out there as well, but basically he's declaring postgres high availability is dead and the route is kubernetes.

The next blog post is “Challenges of Postgres Containers”. So this is from ardentperf.com, and he's exploring using containers with Postgres and some of the challenges that you can encounter with it. But hopefully a lot of these issues the operators will help you to bypass. But I thought this was an interesting read as well, just to get more container knowledge under your belt if you're looking to run postgres on containers. And specifically the thing that I would look out for is that Azure Postgres operator solution using this new built in coalition provider that came in version 17. Because I think this is going to be essential for working with containers because that's one of my big concerns with using some of these operators is the collations and will they change across different versions.

Next post “CloudNativePG in 2024: Milestones, Innovations, and Reflections”, this is from gabrielebartolini.it and like the first post was endorsing this particular Postgres operator. This one is by the I guess you would call him the founder of Cloud Native PG or the most prolific contributor. But you can see as of the beginning of 2024 the meteoric rise with regard to the star history in GitHub with regard to Cloud Native PG and since then it is now overtaken every other operator in terms of that 1 metric GitHub stars. So clearly its popularity is increasing. He goes over all the different releases that have happened this year and calling out specific features that he thought was were important and also them working directly with the Kubernetes project to add support to help running postgres workloads on Kubernetes like particularly Volume Group Snapshot Support in Kubernetes 1.32. So check this out if you want to learn more.

The next piece of content also from Gabriele Bartolini it is “CNPG Recipe 16 - Balancing Data Durability and Self-Healing with Synchronous Replication”. So this is enhancement that was done in 1.25 where they have introduced a new data durability option. So it gives you options on how to balance the availability of data versus being consistent. So for example, if you want to ensure strict data consistency and durability, you can set it for required, meaning that it will always keep however many copies of the data. Or you can use a preferred method for data durability which prioritizes self healing and high availability by disabling synchronous replications when there are no replicas in the cluster. So that allows you to still use the cluster even though a synchronous replica is not available. So if you want to learn more about that definitely check out this blog post

Next piece of content “Waiting for PostgreSQL 18 – Add UUID version 7 generation function.” so hopefully this new commit will finally land in Postgres 18 coming up next fall where we finally get official uuid version 7 and it has a function uuidv7 and for those of you who are not familiar with it, this gives you a time based UUID for the first portion and then a randomized version for the last portion which should allow much less bloat because it doesn't have to do all sorts of random writes that it would have to do with a random uuid. As you can see here, every part of the UUID is random and you can even extract the timestamp from the UUID version 7 and there's a function to do that. Oh, and I should say this is from depesz.com.

The next post also related to UUID 7 is “Postgres UUIDv7 + per-backend monotonicity”. This is from brandur.org and he says with each back end you ensure that your IDs keep going up if they're generated. So it says “We select a sub second timestamp so that it monotonically increases for generated UUIDs within the same back end even when the system clock goes backward or when generating UUIDs at very high frequency”. So that's definitely great. And he goes into a bit more detail about the UUIDV7 as well.

Next post. There were two episodes that were missed from Postgres FM. The more recent one is an episode on “pg_duckdb” where they actually had Joe from I think Hydra and Yelti from I think it's a mother duck and they discussed the pg_duckdb project and this is where they take the DuckDB engine and embed it within Postgres so that you can send queries to postgres. But it could actually be the DuckDB engine that actually executes the query to get the data you're looking for. So so the DuckDB engine can say be used against S3 parquet files. Well, you could do a query in Postgres that actually gets passed off to the DuckDB engine to execute and run a query on columnar based Parquet files in S3. And they say with this project they even want to get to the point where you could potentially do a join between that data and postgres heap tables. So definitely interesting episode. Feel free to check it out or watch the YouTube video down here.

The next episode from Postgres FM was “RLS vs performance” and RLS is basically row level security and personally I've never used row level security. I try to avoid it because I know the big performance impact it can have and there's ways to mitigate it. And they discuss that in this episode. Basically make sure whatever column you're using to help define your security included in indexes. A few other tips as well, and also some things to watch out for in general, in my experience, most applications handle security by themselves and they actually don't use row level security at the database level. But if you are using that, you can probably learn some things by checking out this episode or watching the YouTube video down here.

Next piece of content “Running an Async Web Query Queue with Procedures and pg_cron”. This is from crunchydata.com and usually doing something like this I would do within my web application framework. But if you want to do it in the database, here's a way you can do it. And what he first says is that if you're going to be using the HTTP extension within postgres, you can get yourself into trouble because this one backend is having to wait for a query. So it introduces a lot of latency where that back end is just sitting around. So there's different ways you can design it to make it more efficient. Hence the Async web query queue. And the example he's going over is geocoding. So if you've got an address and you want to find out what say the longitude and latitude of that location is, you can do a geocoding. Well, that requires a query out to a service to get back that geographical information. So what he proposes is basically using PGCron to run a regular job and to use a procedure. And he says the trick that you want to do is select everything that needs to be geocoded and then geocode it one at a time. But if you want to learn more, definitely check out this blog post.

Next piece of Content “Can we use this index, please? – Why not?” This is from hdombrovskaya.WordPress.com and she was brought this query that looks roughly like this and it was running very inconsistently, sometime for many minutes, sometime for just a minute or so. And it did have an index on column one and two as well as the created_at, but it was still pretty slow. I mean she thought about potentially doing a partial index for this, but how many partial indexes would you need for each hour and minute you potentially want to extract? So that seemed like a very inefficient way to do it. But then she was just doing some tests and she defined this query getting a count and it returned in milliseconds. She was like wait, wait a minute, what happened? The reason it was so fast is because it was just doing an index only scan so it was able to get all of this information in the index. So what she decided to do is take this count query and use every portion of it but the count, but just Instead return the column 1, 2 in the Created_at and then do an inquery to get all the columns from the table specifying the column 1, 2 and created_at and that was able to allow this query to run with good performance and consistently. So check it out if you want to learn more.

And the last piece of content “Postgres backend statistics (Part 1): I/O statistics”, this is from bdrouvot.github.io and it looks like Postgres18 will have a way to look at backend level statistics to PG stats. So basically you can look up a specific back end to get stuff statistics on it with regard to reads, retime, writes, extends, etc. So it definitely fleshes out the information you can get back with regard to running backends. So if you want to learn more, definitely check out this blog post.

 

episode_image