Prewarming, Nondeterministic Collations, Generated Column Performance, Foreign Keys | Scaling Postgres 86
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss prewarming your cache, working with nondeterministic collations, generated column performance and foreign keys with partitions.
Content Discussed
- Prewarming PostgreSQL I/O caches
- Nondeterministic collations
- Regenerated
- PostgreSQL 12: Foreign Keys and Partitioned Tables
- Postgres-BDR: It is also about fast safe upgrades
- Managing another PostgreSQL Commitfest
- Tuning checkpoints
- PgBouncer 1.12.0
- PostgreSQL Connection Pooling: Part 1 — Pros & Cons
- PostGIS 3.0.0rc2
- Generating land-constrained geographical point grids with PostGIS
- Braces Are Too Expensive
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. The first piece of content is "PREWARMING POSTGRESQL I/O CACHES". This is from cybertec-postgresql.com and it's actually talking about an extension called pg_prewarm that lets you essentially warm up your shared buffers because normally when you restart the system you're going to lose some of that in-memory cache. Now some of it, because it does utilize some of the OS caching may still be present, but it should help you preserve things like shared buffers. And using this extension, he says there are two ways you can do it. You can do manual caching or you can do automatic caching on startup, which is probably what most people would want, and to manually cache something after you have created the extension in your database, which in this extension is part of the contrib module. So install that first. You just do a SELECT * FROM prewarm and then specify the table name.
So it's a function called pg_prewarm and the result says 4,425 pages have been read and put into the cache is what gets returned as the results. Now they have some other options here you can use like specifying the first block, last block, and things of that nature, but then he goes into automatically populating your PostgreSQL cache and basically once you install the extension you then add pg_prewarm to your share preload libraries in your postgresql.conf file and do a restart. And then what you will see is an autoprewarm master process that is started in addition to the other standard processes. As he says here, pg_prewarm will store a list of blocks that are currently in memory on disk and after a crash or restart, pg_prewarm will automatically restore the cache as it was when the file was last exported. So if you have a larger database system and you find it takes time after a restart for the caches to warm up, this is definitely an extension to check out and try using.
The next post is "Nondeterministic collations". This is from postgresql.verite.pro and apparently started since version 12, PostgreSQL collations are created with a parameter called deterministic. So basically, doing a deterministic comparison it basically compares it essentially at the binary level to make sure that things are comparable. But you can also of course set this to false. So essentially you can get nondeterministic collations and what that enables you to do is potentially do case-insensitive searching or removing or searching and ordering without considering accents that are a part of words. And he has a list here of what he calls fancy comparison features that are enabled by nondeterministic collations. So this is something you couldn't do before, but now you can in version 12. So you can determine equality between canonically equivalent sequences of code points.
So they're doing a comparison here and when deterministic is false, you can get a truthful comparison between these values. So from a binary sense, they are different, but from an actual sense, just looking at them they are identical. So you'll get that with the nondeterministic comparison. The next one is equality between compatible sequences of code points. So I believe this is considered with regard to ordering. So comparing F F with a combined FF, so two Latin F letters, you can get an equal comparison on that if you create a collation that has a secondary strength. So by default when you do deterministic false, it still won't consider these true. You actually have to set the strength at a secondary level to be able to allow these comparisons to equate to true. Then they go on to ignore the case. Again, you can ignore the case as long as the collation is set to a secondary strength.
You can ignore cases and accents if you set the strength to a primary. So at a primary strength, you can ignore capitalization and accents. You can ignore accents but that is not the case. Using some more keywords here, saying the strength is primary but the case level is yes. You can ignore spacing and punctuation by setting different values here, alternate = shifted. Matching compatible symbols and ignoring code points assigned to individual characters. So there are a lot of different capabilities with this support for nondeterministic collations. I'm not 100% up on all this, but very interesting possibilities in terms of doing ordering and comparison operators when working with text. So if you do a lot of that, definitely check out this blog post to learn more about this new feature and potentially how it can be used.
The next post is called "Regenerated". This is from pgdba.org. It's basically about generated columns. Now, we've seen some similar posts about generated columns in previous episodes of Scaling Postgres. What is particularly interesting about this one is that they do a comparison of trigger-related and then use the new generated column features. If you want to achieve what generated columns do, basically generate a column based upon existing data within that table row, for example adding two values together. But what he does is he does a trigger approach because you have to use triggers because this feature did not exist before version 12. So he did a trigger approach and then he used the native-generated column approach and then he looked at their performance.
So that's what I found interesting here. So he compares inserts using both methods. Using the trigger method, it was completed in 6 seconds, so a million rows were inserted in 6 seconds. For the generated columns feature it's at 4 seconds. So it's more performant, which you would kind of expect because it's a core feature of PostgreSQL now. But with no generated columns used, the INSERT performance was about 2.4 seconds. So generated columns still do take a performance hint, just not as significantly as when you're using triggers so that's something to keep in mind using this new feature. Then he looks at updates.
So an UPDATE using the trigger was at 13 seconds, it was at 9 seconds with the generated column, and about 5.8 seconds with no generated columns or triggers. So again you'll see a performance degradation with UPDATE but not as far as using the trigger and then in terms of deletes the performance between generated columns and the trigger was actually pretty similar about four and a half seconds each and less than half that speed when no trigger or generated columns were used. So definitely a great post to cover the generated column feature in general, how you could use triggers instead if you're using your version prior to twelve as well as the performance comparisons.
The next post is "PostgreSQL 12: Foreign Keys and Partitioned Tables". This is from 2ndquadrant.com. So this is a pretty brief post but it talks about the pretty big benefits for referential integrity that comes with these new features added to 12 to be able to create foreign keys and then reference foreign keys within a partitioned table. They give an example where you have an items partition table, a stock partition table, and then a warehouses table that is not partitioned and how in the stock, you can actually use foreign keys that reference the items and reference the warehouses. Pretty much any way you can use foreign keys so this is a great boon to help maintain your referential integrity when you are using partition tables.
The next post also from 2ndquadrant.com is "Postgres-BDR: It is also about fast safe upgrades". So BDR is their bi-directional replication product, basically, their master-to-master PostgreSQL proprietary solution. But I thought this post was interesting because it does advocate that BDR has some advantages, particularly with upgrades. So now a lot of people use it because they want 24/7 availability and nothing to go down. When you have a primary and a replica database and you need to switch over there's some time that you are actually going to be down because maybe the primary goes down and you need to wait for your monitoring system to detect that it's down.
Then it does some automated failover method that can take on the order of minutes, whereas they're claiming here they've gotten it down to a second flipping over with their BDR product. But again, everything's always in sync. You don't have to wait for something to be synced over from a primary to a replica. Now, things may eventually need to catch up, but I could definitely see this has some advantages. And particularly they're talking about upgrades and how you can have bi-directional replication between different versions, which, of course, we've seen some posts do with regard to logical replication, but their product kind of does. It apparently is more out of the box to be able to keep things up and replicated. So definitely an interesting use case for a BDR that I had not considered before. So if you're interested, definitely a blog post to check out.
Another post from 2ndquadrant.com is "Managing another PostgreSQL Commitfest". So if you want some insights into how the PostgreSQL commit fests are run, here's a brief post that discusses that.
The next post is "Braces Are Too Expensive". This is from rhaas.blogspot.com and basically, he's talking about the PostgreSQL executor being volcano-style. What does that mean? It means a number of things. "...the query planner generates a plan that is organized into a sort of tree structure". The "...system is designed to be extensible with individual nodes that appear in the plan tree having no datatype-specific knowledge, but rather obtaining their knowledge about specific data types from support functions...". And it means that the executor function is using a pull model. So plan nodes generally support an operation that says give me the next tuple. So he has an example here where basically the nested loop one is run. Then it says, give me the next tuple from this one.
Then this says, okay, give me the next tuple from this one. Retrieves it, and then says, give me the next tuple for this one, retrieves it. Then this request, this one. So I would say this is kind of like a more top-down model, but all the interesting work happens essentially on the leaf nodes here, here. He says this can create an overhead of a lot of function calls, which is why I believe he's saying braces are too expensive. It's not about orthodontics, it's about the number of function calls that are being used.
They had done an overhaul of PostgreSQL expressions because the expressions were done in a similar tree of executable nodes and they reworked it and transformed the expression tree into a series of steps that are executed one after another, like a sort of dynamically constructed program. They actually got some increased performance by doing this for the expression size. So basically they're considering potentially doing this for optimizing these plan trees. So this is something that they're thinking about doing in terms of PostgreSQL that may be able to gain some better performance. So if you're interested in digging into some of this, definitely a blog post to check out.
The next post is "Tuning checkpoints". This is from evol-monkey.blogspot.com and he's basically talking about how to set up your configuration for checkpointing in terms of adjusting checkpoint_timeout, min_wal_size, max_wal_size, and checkpoint_completion_target. It gives some advice about the first kind of set that you want your checkpoint_timeout to be and then kind of backtrack from there. As well as doing some queries like this to see how fast you're generating WAL files to get a sense of what you want to set these parameters to. He gives an example where they helped optimize a database that went from this type of very high activity and were able to calm it down by optimizing these parameters. So if you haven't done this yet, definitely a blog post to check out.
The next post is that PgBouncer 1.12.0 is released. This is from pgbouncer.org. They say this release contains a variety of minor enhancements and fixes. Also contains some fixes to the new SCAM support in PgBouncer 1.11.0. Improving interoperability with newer PostgreSQL versions. So users of SCRAM are particularly advised to upgrade. So if you upgraded for SCRAM support, you'd probably want to upgrade to one point twelve relatively soon.
The next post is "PostgreSQL Connection Pooling: Part 1 - Pros & Cons". This is from medium.com. So again, after a PgBouncer post, this is a very general post that talks at a very high level about the benefits of connection pooling and some of its disadvantages. The benefits are being able to have more connections, and virtual connections as they were relative to the physical connections to the database. Some of the cons are you're introducing another source of downtime if something goes down as well as increased latency. You're going to be putting essentially your authentication layer within PgBouncer in a lot of cases because the clients need to authenticate to PgBouncer. But if you're considering using a connection pooler if you haven't already, definitely a blog post to check out.
Next post is that PostGIS 3.0.0rc2 is released". The information is here from postgis.net.
Related to that is a post about PostGIS: "Generating land-constrained geographical point grids with PostGIS". This is from korban.net. So if you have a use case for doing that, definitely a blog post to check out.