Book Sales, B-tree Boost, More Postgres 12, Using pgBackRest | Scaling Postgres 92
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss book sales, boosts to b-tree indexes, more Postgres 12 features and how to setup and use pgBackRest.
Content Discussed
- Book: SQL Performance Explained
- Book: Black Friday & Cyber Monday The Art of PostgreSQL
- B-tree index improvements in PostgreSQL v12
- Discovering less-known PostgreSQL v12 features
- Setting SSL/TLS protocol versions with PostgreSQL 12
- Hope Is Not A Strategy: How Hardware Issues Affect Your PostgreSQL Data
- How To Get Started with pgBackRest and PostgreSQL 12
- [YouTube] PGConf.Russia Channel
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. The first piece of content is actually an announcement that two books are on sale this Cyber Monday and also Black Friday, 2020. But of course, you've missed the time this episode was published. So hopefully you've watched this episode early. SQL Performance Explained at sql-performance-explained.com is on sale and there's a Cyber Monday where you can get the PDF for free. So I've actually purchased this so if you're interested, definitely a book to check out. The second book is The Art of PostgreSQL at theartofpostgresql.com and they're also having a sale and 50% off with a coupon. So, it's definitely a great time to get some great educational materials on PostgreSQL.
The next post is "B-TREE INDEX IMPROVEMENTS IN POSTGRESQL V12". This is from cybertec-postgresql.com and it talks specifically about the improvements that were made to the index. It's mostly an infrastructure improvement and the example that they list here is they created a two-column table that's thought to be considered a join table. So they have a column aid and bid that bigints and there is a composite or multicolumn index for the primary key across both. Then a secondary index on bid in case you want to query those numbers faster. And then to that they added test data using this query seen here using generate_series, and they say each bid is related to 10,000 aids. Now the first thing you'll notice is the size difference. So in version 11, the size is 545 megabytes for the bid index alone, but in version 12, it's 408 megabytes, so it's 33% bigger in version 11.
So the question is how is this happening? They say that every instance of bid, it occurs 10,000 times in this index. So there's a lot of duplicates in this index, so a lot of the leaf pages, all of the keys are going to be the same. Since PostgreSQL doesn't quite know where a new index is going to go, you could get one of these middle leaf pages being split and you're not necessarily going to be filling in each leaf page. The reason is that such entries are stored in no special order in the index that are all the same value. But they did mention the rightmost leaf page was always split towards the right end to optimize for monotonically increasing inserts. In contrast to this, other leaf pages were split in the middle, which wasted space. But what they're doing in version 12 is adding the physical address or the TID, the tuple ID as a part of the index key. And as they say, duplicate index entries are stored in table order, so the physical order of the table.
So of course the obvious benefit is that it says you should get significant performance benefits when trying to pull out this value because they'll all be clustered together on disk and then they say moreover pages that consist only of duplicates will be split at the right end, resulting in this densely packed index. So it basically packs up the instance more to get that space savings. The second index improvement that they did was compressing the storage of internal index pages. So looking at version 11 and version 12, when doing a query using it looks to be the primary key. Actually, one less block is read in version 12 and they say because internal pages can fit more index entries. So how does it do that? It describes how it started storing the TID as part of the index, but they also have a way to reduce this redundancy because if you did it for every single index key it would take a hit to space savings.
Where the TID is redundant as our non-key attributes from an EXCLUDE clause. They introduced a truncation of these redundant index attributes. They show here in version 11 you can see an item length of 24 when looking at it through the page inspector extension, whereas it's 16 in version 12. So that also gives you new space savings. But then the important part to remember here if you want to get these improvements is that you actually have to do a reindex or rebuild the index in order to get these benefits. Because doing an upgrade using, say, pg_upgrade is not going to automatically upgrade the indexes, you actually have to do a reindex of each index to get these benefits. So definitely a good explanation of the index changes that came with version 12 of PostgreSQL.
The next post also from cybertec-postgresql.com is "DISCOVERING LESS-KNOWN POSTGRESQL V12 FEATURES". Now some of these are going to be redundant that we've covered in previous episodes, but I'll just run through quickly some of the ones mentioned here. The first one is the automatic inlining of common table expressions which was discussed before. Allow paralyzed queries when in serializable isolation mode so that could give some potential performance benefits if you're using a SERIALIZABLE isolation mode for some of your queries. JIT is enabled by default. But they also in here mention that if you're not going to be using a data warehouse use case or maybe your database is not quite as large, they say you could also tune the jitting threshold family of parameters, jit_*_cost so that the small/medium data sets won't use them.
The next one is support for this SQL JSON path language. Allow foreign keys to reference partition tables which we've heard before. Add partition introspection functions to be able to look at roots and ancestors in the tree of a set of partition tables. Add connection parameter TCP user timeout to control libpq's TCP timeout. Show the manual page URL in PostgreSQL's \help output for an SQL command. Allow vacuum to skip index cleanup, which could have some benefits but of course, you don't want to use it. You do eventually need to vacuum those indexes. Add EXPLAIN option SETTINGS to output non-default optimizer settings. So doing SETTINGS ON could be beneficial to get additional detail when you're doing an EXPLAIN plan. Allow logging of statements from only a percentage of transactions to minimize the number of log files.
Cause recovery to advance to the latest timeline default. This is when you're doing recovery. So this is beneficial not having to set it all the time because this is usually what it's set to is the latest and now it does it by default. Parallel automatic index rebuilding using a REINDEX CONCURRENTLY, which is great. Pg_checksums can now enable/disable page checksums for an offline cluster and they discuss a little bit here. Hopefully, eventually, they'll be in online mode, but right now it's offline. Allow CREATE TABLES's tablespace specification for a partition table to affect the tablespace of its children. So this looks like a definite convenience. Allow values produced by queries to be assigned to pgbench variables.
So this will be useful if you use a lot of pgbench. Allow fractional input for integer server parameters, which is interesting because now you can do things like set your work memory to 1.5 GB as opposed to say 1,500 MB. Allow vacuumdb to select tables for vacuum based upon their wraparound horizon. And then lastly is the changes for recovery.conf in that it has gone away and they now have the signal files. So just yet another post of all the different changes that came in PostgreSQL 12, so if you're interested in learning more, definitely a blog post to check out.
The next post is "Setting SSL/TLS protocol versions with PostgreSQL 12". This is from 2ndquadrant.com, they're talking about the new settings, the ssl_min_protocol_version, and the ssl_max_protocol_version. So I tend to do a lot of this working with NGINX and now PostgreSQL has the capability to set a min_protocol_version as well as a max. They go through when different things were released in terms of the different protocol versions. SSL is not considered secure anymore, but they were mentioned here for completeness and then they mentioned the TLS versions. 1.2 is pretty much what most people are using. There are some that are still supporting one and 1.1 and 1.3 are relatively new, not really much use as of yet.
But if you actually want to see what your PostgreSQL instance is doing, there is, as I mentioned here, a pg_stat_ssl view to see the SSL status of all connections. He mentions you can also enable log connections in the server for each connection attempt because it'll list the SSL parameters used. He mentions when you connect, generally, you're going to get an output if it's an SSL connection and gives you different parameters. Also, you can use the PSL command \conninfo to have the information printed at any time. So if you're interested in learning more about this setting and how you can adjust and even these different ways to determine what SSL version PostgreSQL is using currently, definitely a blog post to check out.
The next post is "Hope Is Not A Strategy: How Hardware Issues Affect Your PostgreSQL Data". This is from crunchydata.com and this is pretty much an opinion piece. He was saying that he was made aware of a bulletin from Hewlett-Packard where certain models of their SSDs have a firmware bug that will cause drives to deterministically fail very suddenly at precisely 32,768 hours of operation. He says, of course, this is pretty much due to an overflow of assigned short integers and that this could also be affecting other models of these hard drives. So if you have these in some sort of arrayed array, you could suddenly have the whole system go down.
Because you're not going to only have one failure of a drive due to some mechanical issue or electronic issue, but there's a programmatic issue that's going to bring all of the data offline. It caused him to prompt some questions, like where exactly is the data? If you're trusting the cloud to do it, are they taking care of the hard drives? Are they keeping track of this to be aware of these issues? What kind of drives are holding your data? Is it SSDs, is it magnetic storage? Does firmware get updated, et cetera? Are you doing regular continuous backups and are you testing those backups to verify that you can restore them? And he mentioned someone that he worked with previously had said hope is not a strategy. So if you're just hoping that's not a strategy, you really want to know and be relatively confident that your data is being taken care of.
Now, related to this, they have another post called "How To Get Started with pgBackRest and PostgreSQL 12". So basically they go through the whole process of using pgBackRest to do a backup and then test restore exactly what they're advocating. And they go ahead and do this on a CentOS box and use pgBackRest 2.19. They use the most recent version. But what they make note of here is that if you're using PostgreSQL 12 with the new recovery changes, you're going to want to use a version of pgBackRest 2.18 or greater. So using the recent version, they install PostgreSQL, install pgBackRest, and configure PostgreSQL to be able to do the backups. They configure pgBackRest, put it all together along with archiving the WAL files, perform the first backup, and then, as the last step, restore the backup. So if you want to get started using pgBackRest, this is definitely a great post to check out.
The last post is that PGConf in Russia's YouTube channel has recently added a number of videos from a recent PGConf conference in Russia. Some of these are in English and others are actually translated into English. So if you're looking for some video educational content, definitely a website to check out.