background

Statistics Speed Up, Synchronous Commits, TLS Updates, Avoiding Cursors | Scaling Postgres 128

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss speeding up performance with statistics, setting synchronous_commit, updates to TLS settings and how to avoid cursors.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your family, friends, and coworkers continue to do well. Our first piece of content is "How we used Postgres extended statistics to achieve a 3000x speedup". This is from affinity.co. So they had a query here that was giving them some poor performance and then they did an ANALYZE of it and they noticed that the statistics thought it was going to pull one row as a result of this join. But when the query was actually run, it was pulling over 2,000 rows and it was taking a lot longer than they anticipated on the order of minutes to run. So whenever you have a difference between what the planner thinks, and what the results should be in the actual results, you want to check out your statistics. So that's basically what they did. Now, this post then goes into all the basics of statistics on how it keeps statistics on each column. It doesn't do cross-column statistics, which we'll mention here in a second, but it does it per column and it takes a sampling of it. You can set statistics for the database itself, but you can also set them for individual columns as they indicate here. 

Now, extended statistics actually allows you to define additional statistics where you're defining relationships between columns. In the example that they're using here, say you have artists, albums, and songs where there's a relationship between each of these. A song belongs to a particular album, typically maybe more than one, and that belongs to an artist, although maybe more than one in that case, too. But there's a different relationship. There's a much smaller number of albums with artists giving a particular song. So that's not a randomized distribution, there's a correlation between them. And when you have that, you can actually, in Postgres 10 and above, create extended statistics to indicate that fact that the query planner can take into account. So they give an example of the command here where they use CREATE STATISTICS based upon album ID and artist ID from songs. Now, one thing they did also suspect is that they were using an inefficient join. So they go ahead and review a nested loop join, merge join, and hash join. 

They noticed that the nested loop can run into problems if there are too many rows in the left relation. So for example, if it has to do a repeated process to look on the right side of the relation for a join, it has to execute it each time. So I believe they were hoping if they updated the statistics, it would potentially choose a different join strategy. Now, what they observed is that the most costly step of their ANALYZE was looping over index access to this particular index, and they did it over 13,000 times so even if that index access is fast, it's going to take forever if you have to do it over 13,000 times. Plus, you know, there's a problem if there's a difference between what the planner thinks an estimate is and the actual counts like we saw with the rows. So what they did is that they quote "...maxed out per-column statistics for..." all the columns involved. I think these are relatively generic but they're org_id and list_id in one table and entity_id attribute_id in the next table. 

Then they created dependency statistics for each of these as well. Now as a result of that, they got a different query plan that actually finishes in 42 milliseconds. So vastly faster than the two-minute one. And therein lies the 3,000-fold speed increase. Now, it's interesting they didn't mention what their improvement would be just maximizing these statistics for each column versus using the extended statistics or the correlation statistics that they added. Because I would assume that increasing the sampling rate of each column would allow the planner to give better estimates. But I'm not sure how much of an additional benefit the extended statistics would do. But that would have been an interesting thing to know. But if you run into a scenario where the planner expects one result from a query but the actual results are quite different, maybe check out your statistics to see if you can improve how many times you're sampling it or potentially look for correlations between columns that may be able to get you better performance. And if you're interested in learning more about that, go ahead and check out this blog post.

The next piece of content is "PostgreSQL synchronous_commit options and Synchronous Standby Replication". This is from percona.com and they're talking about the synchronous_commit setting. Now first they talk about WAL propagation. So WAL is the write-ahead log. It's the log of all activities that are happening on the database in terms of object changes and INSERTS, UPDATES, and DELETES, and they go through this flow diagram here. So this is assuming you have one local PostgreSQL system and then a replica: the remote PostgreSQL system. So first the WAL inserts are written to the WAL buffers in memory. They are then written to the operating system page cache and then at some interval they are flushed to the actual disks of that local system. Meanwhile, if you have replication set up, the WAL sender sends it to the WAL receiver and it does a remote write to the page cache on the remote system. 

Then at some interval, it flushes it to the disk, and then usually sometime later it applies those changes that it's received to the database system itself so that it is accessible for others to read the data that's been inserted, for example. Now with this flow, you can determine what setting you want to use for synchronous_commit. So when the setting is off, essentially you're just writing to the WAL buffers and you have no guarantees that things actually make it to the disk. So clearly that's fast. But you're opening yourself up to data loss even if you're operating on a single server. So if you care about your data, you probably don't want to choose off. The next option is local. So local is assuring that the WAL records actually get flushed to the disk so everything is safe on the disk, at least on that local system. The next option is remote_write. 

This is assuring that the WAL is written to at least the OS page cache on the remote system. Now this gives you a fast acknowledgment so that there's no slowdown in the system, but you can't be sure if the replica fails, that it was written to the WAL files. For example, the onsetting for synchronous commit ensures that it gets written to the disk on the replica. This is the default setting for Postgres, so you have to change it from this setting if you want to choose a different synchronous_commit option. And then the last option is remote_apply. So this takes the longest but is the most durable and it assures that the information written on the local database appears on the remote database. So it gets written all the way to the database and is available for SELECTS. So it's the most durable but has the greatest performance penalty. 

Then the next section of this article explains how you can change the scope of synchronous_commit. So you can, of course, set it at the cluster level, but there are also multiple other levels that you can set it for. So for example, you can do a SET LOCAL of the setting which does it at the transaction level and it enables you to customize how durable a particular transaction you want to be. You could also set it at the session level or at the user level so you can have certain users where synchronous_commit is off, for example or some users where synchronous_commit is on. So you can definitely tailor this so that certain transactions or certain work you want to do is more durable than others and other work is maybe faster and less durable. So if you are interested in learning more about how to adjust your synchronous commit settings, definitely check out this blog post.

The next piece of content is"TLS Related Updates in PostgreSQL 13". This is from highgo.ca and they're talking about additions to the postgresql.conf file where you can now define a ssl_min_protocol_version and a ssl_max_protocol_version. In previous versions of Postgres, the default was TLS version 1, which is not considered secure anymore. In Postgres 13 it's increased to 1.2. But now these settings enable you to customize your TLS settings to match what is acceptable for your environment. 

The next thing they mentioned is that they also allow min and max_protocol_versions for the client as well, so you can pass in these values when you connect as a psql client. The next area is they mentioned channel binding when using SCRAM, so we've covered this in a previous episode of Scaling Postgres. But this covers how you can set up channel binding to help authenticate the server that you're connecting to from the client's perspective. And they also mentioned a new sslpassword function when trying to do a verification of the SSL mode. So definitely some great changes to take into account with PostgreSQL 13.

The next piece of content is "Oracle to PostgreSQL - Cursors and Common Table Expressions". This is from 2ndquadrant.com. So I actually found this quote interesting: "...in a 23 year career with PostgreSQL, I have only actually found a need to use cursors twice and I regret one of those". So he's basically communicating that with Oracle you apparently use a lot of cursors when doing programming, whereas with PostgreSQL you don't tend to do that and there are other ways to achieve the same thing. 

He talks a little bit about how cursors can be expensive rather than relying on the set-based logic and essentially using code that's already been written for using pure SQL versus programming something yourself to return data. Now, as an example, he used being able to create some hierarchical data. Now as opposed to using a cursor to try to do this, he used a common table expression and the WITH RECURSIVE feature. So he used this capability in order to generate this hierarchy of data that maybe normally you would try to use a cursor or some other method like that. So if you're interested in learning more about this, definitely check out this blog post.

The next piece of content is "8 Fascinating Things You Probably Didn't Know PostgreSQL Can Do". This is from enterprisedb.com and this is an eclectic set of different SQL tricks or tips that you can use with Postgres. So I'm not going to run through all of these in detail, so you should definitely consult this blog post. But they're talking about utilizing whole row references, comparing a selection of columns, hard-coded tables, and custom config parameters. Booleans can stand alone, convert column data types for free, find which rows belong to which underlying partition, and tables are types. So if you're interested in working a bit more in-depth with Postgres, definitely check out this list of tips and tricks and some I haven't really seen before.

The next piece of content is "Updating the PostgreSQL root.crt file". This is from 2ndquadrant.com. He's saying you have a scenario where you have SSL authentication set up between two different Postgres servers. Perhaps you're using Postgres Foriegn Data Wrapper they mentioned here. So if you have server certificates and the server keys, client certificates, client keys, and you have your own certificate authority that has generated this root.crt file that has signed all of these different certificates, well, what happens when the root certificate comes up for expiration? What's the best way to do that? 

What you don't want to do is recreate the root certificate and then overwrite that along with generating all the different keys for a single server and then update it. You'll run into problems because the other server is still using the old root certificate key. So he says that the correct way to do it is to generate a new CA certificate and add it to the same root CRT file. So basically it trusts both CA certificates, both the old one and the new one. Then you generate your new certificates with the new signed CA certificate and that allows you to update all of your certificates. So if you want to know a bit more about the process, definitely check out this post.

The next piece of content is "POSTGRESQL: GROUP BY EXPRESSION". This is from cybertec-postgresql.com and they're talking about GROUP BYs. So for example you can group data by the values in a column for example. So here they do region or they go by whatever the first column is. So that's pretty much a GROUP BY. GROUP BY expression is that you define an expression to GROUP BY. So in this case they're saying where the production is greater than 9,000. I think this is the case with barrels of oil. Or using a case statement in a GROUP BY. And that is what I've used a lot of this for is case statements in a GROUP BY in order to get particular data. So this continues and shows you how to use that as well as following up with grouping sets. So if you want to learn more about some capabilities of GROUP BY, definitely check out this post.

The next piece of content is "Using Postgres Row-Level Security in Python and Django". This is from pganalyze.com and they're talking about implementing Postgres's row-level security. So in this scenario with your application, you're actually going to need to set up a role per database user because you want to do the security in the database system itself. So in their example, they have a salespeople table and they create each salesperson and then for each salesperson, they create a role. They happen to use the ID from the salespeople table and there's probably a better way to do that, but that's what they use for this example. Each of these roles are part of the salespeople role so you can grant certain permissions to that. Then they enabled the row level security to access the tables that they should. 

How they change each person is that they use the SET ROLE option. So with Django, it uses a common database user to connect and pull all the data. But when it needs to act as a particular user, it does the SET ROLE option to then do that command. So after showing how the database side works. They then show the implementation in Django itself. The main thing that they have to do is after inserting a new salesperson, they actually need to create the role and grant it to the salesperson role as well. In addition, whenever they're doing a query, they need to set the role of that person. 

So I don't necessarily think that I would use role-level security for something like a web application unless it's for internal use only. These database users have access to other database systems, so it makes sense they already have a database user set up for a specific purpose. Otherwise, to me, this tends to get very complicated. The other thing that they also mentioned here is performance. So the performance of the row-level security feature of Postgres isn't as high as when you're not using it. So if you're interested in scaling Postgres, this may be an option you want to avoid. But if you're interested in learning more about row-level security and a basic setup for it, you can check out this post.

The next piece of content is "How to set up PostgreSQL on an IPv6 enabled network". This is from highgo.ca, and it talks about how to set up IPv6 for PostgreSQL, and it goes into a lot of detail about how you can do this. So if you're interested in doing that, check out this blog post.

The next piece of content is "Webinar: Similarity Queries in PostgreSQL [Follow Up]". This is from 2ndquadrant.com. So this is a webinar on similarity queries, and it gives about a ten-minute YouTube video of an introduction of what the webinar is like. So you can take a preview of it here. And if you're interested, then click here to register for the full webinar.

The last piece of content, the PostgreSQL person of the week is Charles Clavadetscher. So if you're interested in learning more about Charles, his contributions to PostgreSQL, definitely check out this blog post.

episode_image