Postgres Releases, Useless Vacuum, Isolation Differences, WAL Compression

In this episode of Scaling Postgres, we discuss new postgres releases, useless vacuuming, isolation differences between databases, and different ways to compress WAL files.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is that new PostgreSQL versions have been released. This is from postgresql.org and 12.2, 11.7, 10.12, 9.6.17, 9.5.21, and 9.4.26 have been released. Now this is the end-of-life release for PostgreSQL 9.4, so if you're still on version 9.4 or even earlier, you should definitely upgrade to one of the more recent versions. Now there was one security issue mentioned, CVE-2020-1720, where ALTER DEPENDS ON EXTENSION is missing authorization checks. So yet another reason to upgrade is to take care of this, although you could probably take some mitigation steps to reduce the risk of that. This post also lists all the bug fixes and improvements for this version, so definitely review this blog post if you want to learn more. there was a link to the PostgreSQL Roadmap. So if you want to know when they're targeting the minor releases, they do show it on the developer/roadmap page as well as give a projection of when the next major release version 13 will be.

The next piece of content is "Useless Vacuuming". This is from rhaas.blogspot.com and basically, he's talking about a situation where vacuum is running. It completes its process, but there are still a lot of IDs that haven't been cleaned up, so it appears to be doing nothing and he's mentioning a way to recognize the problem. Of course, the worst part is the error "database not accepting commands due to a wraparound data loss", where you've literally run out of your XIDs, your transaction IDs, or you have a warning with a lot of rows that need to be transactions that need to be vacuumed. So basically he says this issue comes from about four different situations and he goes into depth on the XIDs, why they exist, the 4 billion possible transaction IDs because it's a 32-bit integer, and how that exists. But the 4 ways that an XID can be in use is that the transaction with the XID is still running.

So basically, you have long-running transactions, which is generally a no-no for Postgres. You want to keep them generally as fast as you can. Another possibility is that another transaction that cares about the status of an XID is still running. So again, the long-running transaction problem. The third issue that can arise is that the transaction with XID has been prepared using the PREPARED transaction, but neither COMMIT PREPARED nor ROLLBACK PREPARED has been executed yet. So this is a situation using two-phase commits. This is not like PREPARED statements, but it's a PREPARED transaction where you're going to be doing a two-phase commit. Generally, this isn't used a lot in application development at all. So it's only specialized use cases. So I would hope you would not be running into that. And then of course the fourth issue is that there's a replication slot whose xmin or catalog_xmin is that XID.

So basically, there was a replication slot and essentially it's probably orphaned now and the replica that was using it is no longer retrieving information from that replication slot. So the XIDs are expanding. He gives a few quick queries on how you can check cases of one and two long-running transactions. Look for the age of the backend XID and xmin from the pg_stat_activity. If you're using prepared transactions, you can use the pg_prepared_xacts table or transactions table or you could check the pg_replication_slots table for essentially an orphan slot or some problem with replication. He goes into monitoring how you could monitor this to set up some queries using these parameters, determine a particular size, and be alerted to them. So definitely valuable advice to follow.

Then lastly follows up with how to recover from the situation. And he talks about basically long-running transactions. Well, you'll need to cancel them if you have prepared a transaction, you need to do the COMMIT PREPARED or ROLLBACK PREPARED to the ones with the high transaction age. Or if you have a problem with an orphan slot you should go ahead and drop that slot. And then he advocates running a verbose vacuum, not necessarily freezing yet, but a verbose vacuum to get things under control, and then you can move on from there. So definitely a really great blog post about useless vacuuming.

The next post is "Isolation Repeatable Read in PostgreSQL versus MySQL". This is from postgresql.verite.pro. He goes over and talks about the concept of isolation, how transactions that are occurring within the databases are isolated and that there are four versions according to the SQL standard. There's Read Uncommitted, in which a transaction can see the changes of other transactions before they're committed. PostgreSQL doesn't implement this mode, he says. Read Committed, in which a transaction sees changes from other transactions as soon as they're committed. This is the default in Postgres. Repeatable Read, when a transaction reads back a row that's already been read by a previous query, it must read the same values even if the row is changed by another transaction that is committed in the meantime. And from what he says, MySQL follows this as the default.

Then lastly is Serializable and a transaction cannot see or produce results that could not have occurred if other transactions were not concurrently changing the data. So he goes over into some very interesting differences. Number one, Postgres's default is Recommitted, whereas MySQL's default is Repeatable Read and he goes into examples of this.But even if you set it to Repeatable Read in both Postgres and MySQL, you get different results depending on how it's executed. So the implementation of the engines is different and he goes on and shows further examples of some of these differences using a third and a fourth example. So definitely if you're considering moving from MySQL to Postgres and you rely a lot on transactions, this is a definite valuable post to look at, to look at the differences in how Postgres versus MySQL handle different implementation issues. So, a very interesting blog post to check out.

The next post is "Compression of PostgreSQL WAL Archives Becoming More Important". This is from percona.com. Now, he's talking about literally compressing the WAL files using some sort of compression tool such as gzip. Or I believe he talks about p7zip here as well, not so much the WAL compression setting in PostgreSQL. So there is a WAL compression setting that you can set in Postgres that does some things to compress. I believe it does a full page image rights to the wall, and definitely, that's advocated to go ahead and set that you burn a little bit more CPU in exchange for disk space-saving issues. But I've generally found that for large database systems, it's much better to have that enabled and take a very minor CPU hit compared to the disk space savings as well as potential network transfer savings with WAL files as you get a highly active database.

Now, what he's also mentioning here is that compressing them is part of the archive command. So he says there are ways to do this using pgBackRest as well as WALG to be able to do that, but you can also do it yourself just using a gzip utility. So if you typically copy it to another location, you can use a gzip to compress to a particular location as well as 7za, because it does really high compression as fast as possible. And he shows you where you can get it and install it for Ubuntu and CentOS. Then he talks about restoring, he looks at the different ways you can restore with WALG and pgBackRest because you have to essentially unarchive the file and then use it again as well as using gzip and 7za. Now, with some of these compressions, he was able to get down from a 16-megabyte standard file down to 197 KB. So extremely high compression with the 7za. So if you want to conserve disk space and maybe make some of your WAL handling more efficient, definitely investigate this blog post to see if you can add some additional compression to how you're handling your WAL files.

The next post is "CONFIGURING WORK_MEM IN POSTGRES". This is from pgmustard.com. And it basically talks about the work_mem setting. So it basically defines the amount of memory to retain for each connection for work. Now, it's not that each connection used up to that point. There are certain transactions you run that will use multiple amounts of work_mem. So you need to keep that in mind as you are configuring it now by default. As he says here, the work_mem is 4 megabytes, and generally that can be boosted up. Depending on how many connections you have to your database and how much memory you have, you can adjust what it is. Ideally, you want to have as many query operations happening in memory as possible, so that's the reason you would want to increase it.

But you don't want to increase it so much that, essentially, you run out of memory in your system given how many connections and the complexity of the queries of those connections. So generally, if you have an online transactional processing database, you probably want to keep work memory a little bit lower because you probably have a lot of connections. Whereas if you have more like an online analytical processing load, then you'll probably have much fewer users and you can ramp up that work memory because presumably, the queries are going to be more complex. And he says a quote "For example, Christophe Pettus suggests that 16MB is a good starting point for most people". But he goes into ways you can set up for your session to try out different versions of work memory. So if you're interested in that relatively short blog post, check it out.

Next post is "Quickly Load CSVs Into PostgreSQL using Python and Pandas". This is from medium.com. So basically it's showing you a quick way to load CSV into Postgres. He's using Python and a tool called Pandas. Now it looks like Pandas just basically makes it easier to create a table and you can use it to actually load the data. But he has some options down here working with larger data sets. Option two is the most interesting to me because it basically just creates the table, looks at the data set to create a schema, and then it looks like it relies upon the COPY command. Again, that's the most efficient way to load data into Postgres, to actually load the data into the table that was generated. So if you use Python, perhaps this is a blog post you'd like to check out.

The next post is "Why Dropping a Column does not Reclaim Disk Space? (or better, why is it so fast?)". This is from fluca1978.github.io, and he's talking about how when you drop a column from a table, it doesn't immediately reclaim the space. So he has an example here where he creates a one-column table with 346 megabytes of data in it. He then adds a new table with a large DEFAULT, and the table goes up to almost double the size, 651 megabytes. Then he looks at some of the attributes of some of the system tables to look at the status of the columns. He then drops the column. It happens super fast, 20 milliseconds. So all that data probably wasn't deleted because, essentially, they just dropped the reference to it.

So you can no longer reference that column in queries, but he checks the size and it's essentially the same size and if he looks at the attributes, you can see pg.dropped for this particular column so it's not there anymore. As far as the system tables are concerned, it's considered invisible at this point. Then how do you get the space back? You do a VACUUM FULL and then that drops it down to the original size, but if you look at the tables you can still see that reference is there, so it looks like it essentially never goes away. He says the only issue probably dealing with the potential drawback is that the dropped attributes probably count as normal ones going toward the limit of the table. Now I would think it would be quite rare to have that happen, but it's an interesting thing on how Postgres works, so if you want to learn more about that, definitely a blog post to check out.

The next post is "MIGRATING FROM MS SQL TO POSTGRESQL: UPPERCASE VS. LOWERCASE". This is from cybertec-postgresql.com. So this is going from Microsoft SQL Server to PostgreSQL and apparently, Microsoft SQL Server has all caps for its table names and its column names, and that's typically not done with Postgres, but you can actually set up a couple of queries to actually change your schema over. So for example, he's using a query here to query the system tables to change the table name and rename it to the lower of a table name. Then using, in psql, the gexec function which basically takes this built query that you're processing here and then enables it to be run for each table. Now he does mention that this first implementation has an SQL injection issue and if you use quote_ident, it will avoid it. So definitely you'd want to use this version when working with these types of queries. Now that handles the tables and then this is how you handle the columns. So it gives you a tool that allows you to be able to go through each column and set it for lowercase. So if you have a use case for using this, definitely a very interesting blog post.

The next post is "How to Automate PostgreSQL 12 Replication and Failover with repmgr - Part 2". This is from 2ndquadrant.com. So this is the second version. The first part is linked here and it covers a three-node cluster, one primary, two standby PostgreSQLs, and then a witness node elsewhere to handle the election process of who's the primary or not. This post goes over setting up replication failover and it goes into a lot of depth with repmgr being able to set it up. So if you're interested in setting up clusters in this way and you want to investigate repmgr, definitely a blog post to check out.

Now related to that, there's another blog post covering repmgr and it's "Failover & Recovery with Repmgr in PostgreSQL 11". This is from blog.smallcase.com. This blog post, I guess they're still on 11, and it goes through the process of how they set up a repmgr for their use case. So, two blog posts to check out if you are interested in using repmgr for cluster management.

The next post is "Creating a PostgreSQL procedural language - Part 2 - Embedding Julia". This is from 2ndquadrant.com. Now we covered where the initial setup was done for running it and setting it up as an extension. And this is actually embedding Julia because it says that Julia provides an API so that Julia functions can be called from C. So they're going to leverage this to execute the Julia code from its defined function stored procedure. So he has an example here of setting up the next phase of it. So again, it's quite short, but it's the second part of creating an extension that will generate a Julia procedure language.

Next post is "Can PgBouncer survive transient network events?" This is from enterprisedb.com, and it's the fifth part of the series of PgBouncer handling network outages and again using their EnterpriseDB IP Manager to do it. So this covers a lot of the conclusions and what they covered. So if you've been interested in this series, here is the last post.

The last piece of content is "How to map a PostgreSQL ARRAY to a Java List with JPA and Hibernate". This is from vladmihalcea.com. So if you have an interest in doing that, mapping arrays to a Java list, definitely a blog post to check out.

Postgres Releases, Useless Vacuum, Isolation Differences, WAL Compression | Scaling Postgres 101

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript