Collation Issues, Fast Loading, Recovery, SCRAM Authentication

In this episode of Scaling Postgres, we discuss collation issues, how to load data quickly, recovery changes in Postgres 12 and moving to SCRAM authentication.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. The first piece of content is "ICU COLLATIONS AGAINST GLIBC 2.28 DATA CORRUPTION". This is from cybertec-postgresql.com. They're basically talking about changes to glibc with version 2.28. It could cause corruption of data, mostly for how things are ordered, and it could cause issues with indexes, for example, maybe unique indexes basically now storing more than one. In other words, it's no longer unique based on correlation changes or you could get ordering differences once your OS upgrades to 2.28. So for example, I checked Ubuntu and the most recent LTS version was at 2.27, so not quite at 2.28. So basically, they reiterate here. Why is it a problem? Basically the result of ORDER BY clauses and SELECT statements in the order of keys and B-tree indexes. So that could lead to incorrect query results as they say here, or to duplicate entries and unique indexes. Now you can do a REINDEX command. Thankfully version 12 coming up has REINDEX CONCURRENTLY, but we don't have that in our versions today. But what this post talks about is potentially moving to ICU correlations as opposed to relying on the glibc ones.

Now, more of this was discussed in a previous episode of Scaling Postgres with this post here "Beware of your next glibc upgrade". This talks about this issue and he demonstrates some of it with two different versions of glibc. If you're interested, I also encourage you to check out the comments because here they're talking about where they had the issues with the duplicate key violations on unique constraints with some of their indexes and kind of what they had to do to work through this. So basically this is a problem to be aware of. And this post talks about hey, maybe you potentially want to change to ICU collations because it would be a little bit more independent than the glibc changes.

They list several advantages here of moving to ICU such as they work the same on all operating systems, you can actually create your own collations, you can use case, accent, and sensitive collations, and they're versioned. But the unfortunate fact of the matter is he says here quote "You cannot use ICU collations as default correlation for a database or a cluster, so you have to specify them for all table columns explicitly. This restriction may be lifted at some point in the future". So basically this would be a pain to do, but if you wanted to go ahead and move it to ICU collations he goes over and describes kind of how they're laid out, how to create them, and how you can update the versions using ALTER COLLATION. Refresh version of how actually to switch to ICU collations.

Again, you have to do it per table and of course, you have to re-index your indexes. But of course, the issue that he also talks about is how to do this without downtime and it's basically using logical replication, and the post finishes by describing how you could potentially do this. So definitely this is something that's coming down the pike for different OS versions where you have PostgreSQL installed that you need to be aware of. It looks like it's mostly just an ordering change, but it could cause greater problems if you have unique indexes on text columns that are impacted by these collation changes coming in glibc 2.28. So it's just something to be aware of and two posts for you to check out.

The next post is "Fastest Way to Load Data Into PostgreSQL Using Python. From two minutes to less than half a second". This is from hakibenita.com. Now basically they have retrieved data and, using Python, went through a number of different steps on what's the fastest way to load hundreds of thousands of rows of data. Now this is a very long post, pretty Python-centric, not so much PostgreSQL, although the ideas are still there. Basically, single inserts are usually slower, and multi-line inserts are faster, although interestingly, that wasn't that significant in the tests he has here. At the very bottom, he covered his result summary of the different iterations that he did and the executemany actually didn't improve it by that much. But then of course the fastest one is using the COPY command, and he had a way of parsing the data out and using the COPY command.

And that's what got it down to this less than 128 seconds doing one insert to half a second. So definitely if you're a Python developer, this has some insight to be gained on how you can load data pretty quickly. But the general rule is single inserts one at a time are the slowest ones, doing a multi-line insert gives you better performance, and then the COPY command is probably the fastest one. This is similar to what I found in a video tutorial I did talking about Ruby. So it's "Fast PostgreSQL Data Loading Using Ruby". Again, it showed the same results. Single inserts are the slowest, multi-inserts, multiple rows per inserts faster, and then the COPY command is the fastest. So if you're a Python or Ruby developer, you could potentially check out these two pieces of content.

The next post is "Big News in Databases - Summer 2019". This is from winand.at, and this is his newsletter and the subtitle for the page title here is The SQL Standard Extensions and Cloud War. So basically he talks about how the SQL standard continues to grow. This is something this individual regularly monitors: the changes to SQL and tracks them between the Oracle, Microsoft SQL Server, PostgreSQL, Db2, and all the relational databases on how it's keeping up with the standards. They're talking about introducing things like multidimensional arrays, things like actually a graph query language, GQL, and also some streaming data-related features. And then with reference to the Cloud Wars he mentioned, basically there's this quote "New add-on features (options and management packs) are only available in Oracle Cloud and on engineered systems".

This means only on Oracle hardware, which is interesting, so you could no longer get it and install it on your own hardware. He has the subtitle here- "Cloud War Strange Allies Working Against Your Own Hardware". So again, this is another great thing about PostgreSQL is that it's open source and you can take it and run it wherever you want, or run it on a cloud environment and pretty much get the same thing. Then he also goes into different discussions about some technology and science, some new releases for the different database systems that he tracks, new content he's created, as well as some other areas. So if you're interested in more news-related items for this summer of 2019, definitely a blog post to check out.

The next post is "PostgreSQL & recovery.conf". This is from fluca1978.github.io. And he's referring to the fact that with version 12 of Postgres, the recovery.conf file is removed. He says here that the server will literally not start with it. So they've moved all of the things that you configured in recovery.com into the main postgresql.conf file, or if you included files with it. There are also some new potential files that trigger files that get created, a standby signal, and a recovery signal.

Basically, a standby is a hot standby and recovery recovers to the point it is ready to take on primary duties. He makes a point that some of the reasons why this was done are he says this makes them move into postgresql.conf "...makes them a good candidate to be changed by an ALTER SYSTEM statement". So basically when you upgrade to version 12, you need to make sure that your processes take this into account or that your backup recovery software that you're using takes this into account to handle the absence of this file as well as putting the configuration in the main postgresql.conf config file.

The next post is "How to Upgrade Your PostgreSQL Passwords to SCRAM". This is from crunchydata.com. And so with PostgreSQL 10, they've added SCRAM authentication as an option, so it's a much more secure and standardized version of doing password management for your users. They go over a brief overview of SCRAM and what it is. It's basically an acronym that's called the Salted Challenge Response Authentication Mechanism. They go through the steps of how to upgrade to determine if you can upgrade to SCRAM because there's some things that can't use SCRAMs. So for example, to my knowledge, PgBouncer by default does not support SCRAM.

I think there may be some GitHub patches that may allow it but by default, I don't think it does at this time. The next step is to validate your pg_hba.conf settings, and they suggest having MD5 for your users who are connecting. Change PostgreSQL's password authentication method. So basically set it to scram-sha-256. Determine who needs to upgrade. So they give you a script that you can look through the users that need to upgrade to use SCRAM. Upgrade each user's password. So each user's password needs to be reset, and they give you two methods to discuss how to do this.

One thing they mentioned here you want to avoid is don't use ALTER ROLE username PASSWORD 'newpassword', because the plain text password could end up being logged. So it kind of defeats some of the greater security improvements with SCRAM. Lastly, update your pg_hba.conf to use only scram-sha-256. Now because you change this in the pg_hba.conf file. For example, if you do have utilities that don't support SCRAM yet, you could have say, general users that are connecting interactively to use SCRAM. Maybe PgBouncer, you could leave on MD5, for example. But if you're interested in moving to SCRAM, here's a process you can use.

The next post is actually a YouTube channel, and this is the ITPUG Italian PostgreSQL Users Group. They recently had a PGDay in Italy 2019, so they posted a number of videos. The first set is for the individual speakers, basically who they are and what they're talking about in their presentations, and they're relatively short, one to three minutes. Then they have the greater presentations here that occurred at PGDay.IT. Now, I did notice some of them are in Italian or a language that was not English, but if you're interested in video content, there may be some presentations here you'd like to check out.

The next post is "IMPLEMENTING AUTONOMOUS TRANSACTIONS IN POSTGRESQL". This is from cybertec-postgresql.com. And he says, what is an autonomous transaction? It's basically you want to send off or fire and forget some subtransactions that are not connected to the main transactions. So you're in a transaction, you're doing some sort of work, and you want to send some sort of notification. And he says here mostly logging, auditing, and progress tracking into some tables. So basically this information would persist even if the transactions rolled back. Now, he said there's a hacky way to do it where someone literally uses the COPY command to essentially write something to a file. But that has some downsides regarding the cryptic structure. The COPY program has a superuser requirement, and you need physical access to the database server to read the log.

Another alternative he thought up was using a DB link, so it's creating a separate link to the database, and inserting it into a particular table. And then another alternative is using PL/Python and some Python driver, again, to make a connection and do an INSERT into the database. Now, interestingly, I wonder, if you had a requirement for this, if there is another alternative potentially using an asynchronous notification mechanism, LISTEN and NOTIFY, essentially, I wonder if these types of notifications would work and still be essentially fire and forget whereas the transaction gets rolled back. So if you have a requirement for something like this, definitely a blog post to check out.

The next post is "The database deficit: how full-stack developer's blind spot harms agile teams". This is from medium.com. Now, this is not explicitly related to PostgreSQL, although he does in general talk about relational databases and it is an opinion piece. But basically, he's talking about with full stack developers and particularly with the rise of Node.js. Being able to use one language on the client and the back-end server side, essentially keeping track of all the different changes that are happening. Basically, the database side gets the short end of the stick usually in terms of the skill set in a quote "...full-stack developer...". So it's an interesting piece talking about this database deficit and kind of what some examples look like not using a migration strategy for schema changes. And this is something I've quite thought often about as developers that basically don't spend a lot of time focusing on the database side, they focus on everything else. But it's an interesting opinion piece and I definitely encourage you to check it out if this type of content is interesting to you.

The next post is "PostgreSQL WAL Retention and Clean Up: pg_archivecleanup". This is from percona.com. So this is a utility that you can use to be able to clean up WAL archives. And he basically describes this built-in command that's provided with PostgreSQL. You can use it as a dry run to see what potential files would be removed and then do it in a delete mode. But again, what's critical is do not run this in your main PostgreSQL data directory like the pg_wal directory or pg_xlog directory. Do not do that, only do it to archive destinations because otherwise, you'll corrupt your database. But of course in using this utility, again, the consideration is you need to make sure not to delete the WAL that you need. They go through four different criteria here that are very important to follow in terms of being able to restore your database to a point in time. So if you're going to be using this utility for cleaning up your WAL archives, definitely check out this blog post and follow their criteria to ensure that you're only deleting the WAL files that you intend to.

The next post is "Generate Primary Keys (almost) Automatically". This is from fluca1978.github.io. He's talking about. He was recently looking at a quite large database with a lot of tables and essentially, almost all tables did not have a primary key. So basically he went through and developed a couple of different ways that you could query the database. Looking at the system tables and generating queries where you can automatically generate a primary key. At first, he just used and he used a couple of different methods. So he used this CTE method to be able to generate this, and then you could run this file against the database. Then he used a function that offers the ability to add different parameters to be able to control how the SQL generation works. And then he actually did everything within a procedure to execute the ALTER TABLE commands. So three different ways to do it. So if you ever need to run a lot of different commands using information that's located in the system tables, definitely a blog post to check out.

The last post is "Serving Dynamic Vector Tiles from PostGIS". This is from crunchydata.com. Now again, I don't know that much at all about PostGIS, but if you're interested in content where you need to serve these dynamic vector tiles, here's a new blog post that covers these techniques.

Collation Issues, Fast Loading, Recovery, SCRAM Authentication | Scaling Postgres 72

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript