UUID Use Cases, pg_settings, Automated Restore, Parallel Future | Scaling Postgres 102
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the use cases for UUIDs, using pg_settings, setting up an automated restore and the future of parallelism.
Content Discussed
- Why I’m not fan of uuid datatype
- Natural versus Surrogate Primary Keys in a Distributed SQL Database — The Distributed SQL Blog
- Generate a UUID as PostgreSQL default value
- Take advantage of pg_settings when dealing with your configuration
- Compression of PostgreSQL WAL Archives Becoming More Important
- Automated Testing of PostgreSQL Backups
- QParallelism, what next?
- shared_buffers: Looking into the PostgreSQL I/O cache
- Which tables should be auto vacuumed or auto analyzed – UPDATE
- Waiting for PostgreSQL 13 – Add %x to default PROMPT1 and PROMPT2 in psql
- Have An Eye On Locks Of PostgreSQL
- [Tool] pg_flame
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Why I'm not a fan of uuid datatype". This is from depesz.com and he's basically stating the reasons why he's not a big fan of the UUID data type. Now, this is a controversial topic, as you can tell by the number of comments on this post, but still, I found it particularly interesting. So the first problem he says, is that UUIDs are completely opaque. So for example, you have these sequences of letters and numbers, whereas something like an integer is much easier to reason about, to remember if you're trying to reference it somewhere and it typically occurs in order. The second problem is you can't sort them very well. Now, there are ways around this. He mentions, potentially, version 1 of it, but also in the comments they have a link to an article that we actually covered on a previous episode of Scaling Postgres that enables you to have partially sequential UUIDs.
The third problem, it's a pretty wide value, definitely larger than most integers you would be using. Then he did a little bit of a performance test to compare the speed of doing a process with integers versus UUIDs. And the UUIDs were more than twice as slow. However, he does note that if this was a more complex table, the difference would be smaller. And then there are a number of comments that discuss this; here is this sequential UUID generator. But a number of comments go back and forth discussing this particular issue. Now, I would say I have to agree with them that I'm not a fan of UUIDs. What I mean by that is that I tend to use integers as primary keys and they only reach for UUIDs in specific use cases. The first use case is if you're going to have a data set that spans multiple databases or servers, then I could see needing to use a UUID.
So for example, some of the shared-nothing database architectures use UUIDs because you're literally generating on separate machines. Another use case is if an actual client is generating a particular ID, having it generate a UUID that you can then store in the database so the actual record gets created outside the database. Then I could see using UUIDs. Similarly, in separate database systems, someone mentioned how they have an on-premise solution and a hosted solution, and to make it easier to deal with, they used UUIDs so that someone could easily move back and forth between those situations. And that's listed in the comments here. So, an interesting discussion, but I kind of use it from the perspective of what are the use cases that I would use UUIDs.
Now, related to this is another article that it's not PostgreSQL specific. "Natural versus Surrogate Primary Keys in a distributed SQL database - The Distributed SQL Blog". This is from medium.com. Now this is specifically about a Yugabyte but still the kind of things that they mention here. When you would want to use a UUID and a surrogate primary key are when you have essentially multiple databases and in this case I believe they're kind of synchronizing with each other. So in the discussion of UUIDs, this was another article this week, even though not specifically about Postgres that I thought was relevant.
And lastly, another article is "Generate a Uuid as PostgreSQL default value". This is from medium.com. So this is a relatively brief post that tells you how to set up UUIDs in your table as a default value. The first way is using the pgcrypto extension or the uuid-ossp extension. This is the one that I use and use the SELECT uuid_genterate_v4 for the default value when creating tables that need it. So if you're interested in this UUID content, definitely check out one of these three blog posts.
The next article is "Take advantage of pg_settings when dealing with your configuration". This is from fluca1978.github.io. Now he's talking about where someone's trying to diagnose something and they ask you what's your configuration. So they just grep say the postgresql.conf file to give what the configuration is. But the problem is that's the kind of star configuration not even considering the auto.conf file, but what is the runtime configuration? Now he lists all the different downsides of this about omitting the postgresql.conf.auto file. It includes commented lines and again it doesn't reflect the running configuration. So how do you get the real configuration? He says here you use the pg_settings system table so you can do a query as you want to and it shows you the setting, what the current setting is, and whether it's changed pending a restart. So you need to do a restart to enable it. So this post is just a quick reminder about the existence of pg_settings and you can use it in addition to show commands to see what your current running PostgreSQL settings are.
The next post is "Automated Testing of PostgreSQL Backups". This is from pgdash.io. So basically this is doing automated restores and this blog post runs through two different scenarios. So say you're using pg_base_backup. They go through a scenario of the scripts needed to create a directory, unzip the items into that directory, go ahead, and start at the database. However, if you're starting it on an existing system you'll probably want to adjust the port and some additional settings and then do a query to test out that that restore was successful. It had the data that you expected to be in there. So a table count or the last update at something of that nature to validate that it was a successful restore.
And they have a full script here to do that. And then they have a second version using pg_dumps. So if you use pg_dump, it goes through the process of using either psql or pg_restore to reload the files, and the whole process goes through. They also make a note to be sure to watch out for triggers because for some triggers, particularly if they're contacting another service or some other database system, you want to disable those triggers as data is being loaded. So that's definitely something to keep in mind with pg_dump or pg_restore. And lastly, they cover doing point-in-time recovery testing and how you could set that up. So definitely a critical thing to do to verify that your backups are working and successful. And here's a great post that covers how you could do that.
The next post is "Parallelism, what next?". This is from amitkapila16.blogspot.com. In the initial set of paragraphs here, he covers what's been developed in terms of parallelism in Postgres from 9.6 on through 12. Then he's talking about what is in progress for 13. The first one is a parallel vacuum. So being able to vacuum indexes in parallel, could be a huge benefit if you have a lot of indexes on your table because it's the indexes that tend to take the longest time with vacuum in my experience. Second, improving EXPLAIN's handling of per-worker details can be very beneficial to try to understand what's going on with EXPLAIN plans. Then avoid unnecessary shared memory writes in parallel hash join. So this could be a good performance benefit in certain cases in terms of what's being discussed for the future parallel grouping sets. The second is a parallel copy, so being able to load data copy in parallel, could be advantageous. And then a parallel file Foreign Data Wrapper, probably similar to the work being done for COPY to be able to have that operate in parallel. And in the future, they're talking about doing more parallel work with things like gin and GiST indexes because right now it's primarily done for B-tree indexes. And then of course there are additional comments that you may want to review as well. So if you're interested in what's in store for parallelism in the future, definitely a blog post to check out.
The next post is "SHARED_BUFFERS: LOOKING INTO THE POSTGRESQL I/O CACHE". This is from cybertec-postgresql.com and it discusses how you can get insight into what the shared memory cache is storing. And he creates a simple test database to do it. And he's using the extension pg_buffercache to be able to get that information. Then he shows you how you can expect to inspect per database caching as well as inspecting your current database. So if you have a desire to investigate what your cache is storing, definitely a blog post to check out.
The next post is "Which table should you be auto vacuumed or auto analyzed - UPDATE". This is from depesz.com. So this is an update to a previous post where he basically made a prediction of what the next vacuum that's scheduled to be run, or the next analyze suspected to be run. But that version did not do it per table. It was only consulting the system-wide values that are set here, say in the postgresql.conf file. This one takes into account table settings as well. So if you've altered a particular table and adjusted the vacuum settings, this script now consults those as well. So if you're interested in this tool, definitely check out this blog post.
Also from depesz.com is "Waiting for PostgreSQL 13 - Add %x to default PROMPT1 and PROMPT2 in psql". So it looks like what they're referring to here is that they've added some additional notifications on a transaction state. So for example, in the current version of Postgres, when you do a BEGIN, you do a SELECT and you get an error, and then do another SELECT and you're out of the transaction. On the PROMPT, there's no additional notification, it's just symbols =#. Whereas, in the new version, it has additional prompts as notifications to say, okay, we're inside the transaction here, and the ! means we're inside a transaction that failed but hasn't been rolled back yet. So some additional prompts as you're working with transactions that could be beneficial. So if you're interested in learning more, go ahead and check out this post.
The next article is "Have An Eye On Locks Of PostgreSQL". This is from highgo.ca. It's a post that talks all about locks. The first thing they mention here is that you can look at what locks exist using the pg_locks system table. Then they go into table-level locks and then all the different share types that exist for them. Then they go into row-level locks and how they can be shared, covering transaction locks, page locks, and advisory locks. So if you're interested in learning more about locks in PostgreSQL, definitely a blog post to check out.
The last piece of content is actually a tool. Now, I don't normally mention tools, but I thought this was interesting, and it's called pg_flame. It creates flame graphs of your EXPLAIN ANALYZE output, and it looks like this runs on a Mac, or it looks like you can work with Docker as well. But basically, you get your EXPLAIN ANALYZE output and it can generate a flame graph for you. So if you want a more visual way of looking at your EXPLAIN ANALYZE output, definitely a tool to check out.