Unlogged Tables, pg_stat_io, Type Constraints, Text Types | Scaling Postgres 253
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss working with unlogged tables, the new pg_stat_io feature, handling complex type constraints and choosing the best text type.
Content Discussed
- PostgreSQL Unlogged Tables - Look Ma, No WAL!
- Waiting for Postgres 16: Cumulative I/O statistics with pg_stat_io
- Type Constraints in 65 lines of SQL
- Should You Use char, varchar, or text in PostgreSQL?
- UUIDs vs serials for keys
- Avoid Postgres performance cliffs with MultiXact IDs and foreign keys
- PostgreSQL Patch of Interest (to me): Using the System CA Pool for Certificate Verification
- PostgreSQL: Finding the current timestamp
- Get the most out of PostgreSQL using psql with must known features.
- Ranking Data with SQL Window Functions
- Exploring Row Level Security In PostgreSQL
- Postgres WAL Files and Sequence Numbers
- Listen to Database Changes with Postgres Triggers and Elixir
- Go & PostgreSQL: FOSDEM 2023 talk
- Storing Network Addresses in Oracle vs PostgreSQL vs SQL Server
- Real-time analytics
- Takamichi Osumi
- Getting Stuff Done With Drew Bragg | Rubber Duck Dev Show 74
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL Unlogged Tables- Look Ma, No WAL!". This is from crunchydata.com. And they're talking about unlogged tables which means that nothing is being logged to the WAL. Now that has some advantages and some disadvantages. The main advantages that they list here is improvements to write performance and they highlight that below. Less vacuum impact. So basically these changes don't end up in the WAL stream and less total WAL or write-ahead log. But this has some significant disadvantages. One, tables are truncated on crash recovery. So basically, all your data gets wiped out. Now they do show here where you can do a restart and those unlogged tables will persist, but if there's any type of crash, the system doesn't try to maintain any data.
It just truncates the tables. Second, unlogged tables can only be accessed on the primary, not the replica. So again, the only use case for this is really temporary use cases. And then three, unlocked tables can not be used in logical replication or physical backups. So definitely a lot of advantages and what I always say is if you care about your data, don't put it in an unlogged table. But if you have temporary data like maybe you were loading some data and you need a staging ground before you place it in final tables, you could use unlogged tables for that purpose. Because if something crashes, you could always restart that data loading process. But the reality is a lot of times, I just use temporary tables. But unlogged tables could be advantageous if you actually want it to cross multiple sessions or maybe you're trying to load data in parallel.
There may be some use cases for it but I haven't used a lot of unlogged tables. But this blog post goes into a lot of detail about how they work and shows different performance implications in terms of the advantages and then some of the disadvantages that were discussed. Now what they do say is you can also convert a table from log to unlogged and from unlogged to logged. So there may be some use cases where you want to set up a table quickly and then turn it into a logging table. But just be aware that once you turn on that logging, a lot of things are going to be written to the WAL because it has to now set up this table for crash recovery. But if you want to learn more, definitely check out this blog post.
Next piece of content- "Waiting for Postgres 16: Cumulative I/O statistics with pg_stat_io". This is from pganalyze.com. They're talking about a new feature in 16 called pg_stat_io. So basically it takes the concept of pg_stat_statements, which records cumulative queries being run against the system. pg_stat_io records cumulative I/O activity on the database and breaks it out by the backend being used and the context, it looks like this. It records the reads, the writes, and the extends, which extends objects to store more data, as well as highlights evictions, reuses, and fsyncs. So this is a lot of detail that wasn't present before and this post mentions it's due to some of the groundwork laid in 15 that has enabled this feature to now come to fruition.
This post goes over a lot of the advantages in terms of being able to really narrow down the different I/O activities and what's happening so you can potentially figure out performance problems. It particularly talks about shared buffer evictions because sometimes when you want to read a lot of data you have to evict a lot of things from the shared buffers, which has an I/O implication. This can help narrow down and understand exactly what's going on with your I/O systems as well as how much vacuum is being used. So this is a really great improvement. If you want to learn more about it, definitely encourage you to check out this blog post.
Next piece of content- "Type Constraints in 65 lines of SQL". This is from supabase.com. They're talking about wanting to create a data type in Postgres called Semantic Versioning. So they want to support showing the major version ·(dot) miner version ·(dot) patch version -(hyphen) then maybe a pre-release +(plus) metadata associated with it. So that's a pretty complex type. So how they handled it is that you first created a composite type by doing create type and they are adding an integer for the major minor and patch. Then they use a text array for pre-release information and a text array for the build metadata portion. So that gives you the basic structure but there are no constraints, other than the individual data types, that prevent invalid semantic versions from being created. So then what they did is they created a domain, which is your own custom type, that you can apply constraints to. So for example, the major portion of the symmetric versioning, they said, is not null and the value must be greater than or equal to zero.
And they applied that same thing to the minor and the patch integers as well. And then they put a regular expression constraint on the pre-release and the metadata. So now they have some control of being able to reject invalid versions. Then the next step is presenting it because as it's stored once you present it, it's just going to present it as three integers and two text arrays. So they show that example here which is not appealing and is not the proper format. So how they handled the format is they actually created a custom function that outputs the data in precisely the correct format. So this is a great post using custom types and custom domains to be able to present a highly complex data type. If you want to learn more, definitely check out this blog post.
Next piece of content- "Should You Use char, varchar, or text in PostgreSQL?". This is from maximorlov.com. We've seen this discussed in previous episodes of Scaling Postgres but this was a very succinct description of what has seemed to become the accepted path. In that char or character is fixed length, so it's always going to use the number of characters you set aside. So it's going to store a little bit larger than the other two- varchar and text. But the important thing to realize is that there's no performance advantage in Postgres for any of these. So any of these can be used and because you get the benefit of space savings with a varchar or text, it makes more sense to use one of those.
But the other issue that this discusses is when you enforce a maximum length, you've now limited yourself. If you need to change the size and increase it, you're essentially going to have to change the data type. So the better solution is to just use the text data type and if you need constraints, do a check constraint to do it. He has an example of using a check constraint. So basically, for storing text or strings, the best approach for Postgres now seems to be just to use the text data type and if you need to or want to limit it use a check constraint. But if you want to learn more, definitely check out this blog post.
Next piece of content- "UUIDs vs. serials for keys". This is from thebuild.com. We've seen this discussion before, I think we've even had this discussion on The Rubber Duck Dev Show, about whether you should use UUIDs or bigints for primary keys. His opinion was there may be a better way to ask that question. One- should your keys be random or sequential? Two- should your keys be 64 bits or larger? Because you could use a bigint that is random or you could use it as sequential. There are also ways to get UUIDS that have a component that are sequential in nature. Now being sequential has advantages for performance. So the more random, the more difficult it is to insert update values and Postgres. Or you're going to have some sort of performance issues. The other issue he discusses is whether the keys should be 64 bits larger. He states that working with 128 bit values is actually more difficult to address.
I mean, you're moving more data around than the 64 bit. But every time I have this discussion, I kind of go back to that it depends on where your IDs are being generated. If it's being generated outside the database, UUIDs make a ton of sense. Because presumably, they should be created in a unique fashion and you just store those in your database. If the database is producing them, then to me it makes more sense to make it a bigint and make it sequential. If you need randomization, do that in a separate column, for example, if you want to present something to the public. So that way you get the advantage of performance by having a sequential integer ID, at least for most of your keys. And if there's anything external, go ahead and rely on a UUID for that. But definitely, some insights into this discussion if you want to take a look.
Next piece of content- "Avoid Postgres performance cliffs with MultiXact IDs and foreign keys". This is from pganalyze.com. And this is the next episode of "Five minutes of Postgres". They cover a post from last month talking about a foreign key pathology to avoid. Now we covered this in a previous episode of Scaling Postgres. This is where you have a streaming service and you have one streamer that suddenly has millions of viewers and if you have a foreign key relationship, from millions of rows to that one row, that one streamer row, you're going to be producing a ton of multi-transaction IDs trying to track. I think it's the four-key share lock that's happening on that row. You can run into performance problems. Well, Lukas goes into more depth on this and even covers an instance where somebody was seeing poor performance. So a real-life scenario in the second post he covers. So if you want to hear about a real-world issue where something like this happened, as well as his perspective on this, definitely check out his piece of content.
Next piece of content- "PostgreSQL Patch of Interest (to me): Using the System CA POOL for Certificate Verification". This is from softwareandbooze.com. He's talking about psql or I believe a libpq in general. When it sets up SSL for communication with Postgres, it actually looks in a specific location on the client for the certificate and that is in the home directory in a .postgresql hidden directory for the root cert- .postgresql/root.crt. So basically, this has to be in the same place every time. It doesn't look at where all the other certificates are stored in the local system. It only looks in this one place and it also causes problems if you need to connect to more than one database server, you can only have this one root certificate. Well, there's an enhancement that will hopefully be coming in 16 where you can specify the SSL root cert to be a system. Then it will know to look in your systems certificate authority pool. It won't be looking at this one location in all cases. So this sounds like a great enhancement. And if you want to learn more about that, check out this blog post.
Next piece of content- "POSTGRESQL: FINDING THE CURRENT TIMESTAMP". This is from cybertec-postgresql.com. They're talking about the three different timings that you can get in Postgres. One is the real-time, one is the statement time, and one is the transaction time. Now the most common time component I use and that's just because what comes to mind immediately is now(). But now() is actually the transaction time. So whenever you start a transaction, that is the time it's going to record throughout the transaction. So you can see there are three different statements that were run but they are all at the same time within this transaction. So now() is the transaction time. There's also a statement_timestamp(), which is the timestamp for each statement. So here you could see no matter how many times you call statement timestamp, even with a slight delay with sleep, you're always going to get the same timestamp for that particular statement. But the actual real-time, in other words, that's going to give you the exact time every time that function is run. You use the clock_timestamp(). So that essentially gives you real-time. So if you want to learn more, you can check out this blog post.
Next piece of content- "Get the most out of PostgreSQL using psql with must known features". This is from databaserookies.wordpress.com. He talks all about his favorite features in psql, from defining what code editor you want to use to being able to change your password without showing it visually on the screen. Setting up custom aliases within psql, using the shortcut table to select everything from a table, monitoring the output using the watch command, using \gexec to execute multiple SQL queries based upon a previous query, and echoing all internals query using the -E option. So definitely a great set of features and if you want to learn more about these, check out his blog post.
Next piece of content- "Ranking Data with SQL Window Functions". This is from antonz.org. He's taking an employee table with different salaries and he's going to rank those salaries in different ways. Now this is quite a long blog post but he goes through step-by-step on setting up window functions to do each part of the query. So if you want to get more familiar with windows functions, I definitely encourage you to check out this piece of content. He also covers more than just PostgreSQL, I believe, and it works also on mySQL and SQLite as well apparently.
Next piece of content-"Exploring Row Level Security In PostgreSQL". This is from pgdash.io. So if you're interested in learning more about row level security, you can definitely check out this blog post.
Next piece of content- "Postgres WAL Files and Sequence Numbers". This is from crunchydata.com. They're talking about understanding the difference between the WAL files that are being generated in your system and the actual sequence numbers that you can query on the system to know more about where you're at vs. what's being written. So if you want to understand more about that, definitely check out this blog post.
Next piece of content- "LISTEN TO DATABASE CHANGES WITH POSTGRES TRIGGERS AND ELIXIR''. This is from peterulrich.com. Now this is very Elixir heavy in terms of the use case of setting up a listen notify. There is some SQL in here, you can see where he's performing the notify with a particular name and a payload. But I believe his listening is very Elixir heavy because he relies on a library. But if you use that language, maybe you'd be interested in this.
Next piece of content- "GO & POSTGRESQL: FOSDEM 2023 TALK". This is from cybertec-postgresql.com. So whereas the previous post was about Elixir, this one focuses on GO in Postgres. So definitely try to check out if you use both of those.
Next piece of content- "Storing Network Addresses in Oracle vs PostgreSQL vs SQL Server". This is migops.com. If you're migrating from one of these databases to another, this covers how network data types are stored in each of these databases to understand better how to convert data between them.
Next piece of content. There was another episode of Postgres FM this week. This time it was on "Real-time analytics". So if you want to learn how to handle that better in Postgres, definitely check out their episode. You can listen to it or watch it on YouTube.
Next piece of content, the PostgreSQL person of the week is Takamichi Osumui. If you're interested in learning more about Takamichi and his contributions to Postgres, definitely check out this blog post.
And the last piece of content, we did have another episode of The Rubber Duck Dev Show this past Thursday afternoon. This one was on "Getting Stuff Done with Drew Bragg". So if you're a developer and want to be more efficient in your work we definitely welcome you to check out our show to see if our discussion might be of benefit to you.