State of Postgres, Dumping DBs, Configuration Parameters, Partial WAL | Scaling Postgres 94
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the State of Postgres, how best to dump your database, setting configuration parameters and managing partial WAL files.
Content Discussed
- State of Postgres
- 5 Quick Thoughts on the State of Postgres Survey
- How to effectively dump PostgreSQL databases
- Setting PostgreSQL configuration parameters
- Barman 2.10 – Recovery of partial WAL files
- How to Perform a Major Version Upgrade Using pg_upgrade in PostgreSQL
- How to run Postgres on Docker part 3
- PoWA 4: New powa-collector daemon
- Odyssey Scalable PostgreSQL Connection Pooler
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is actually called "The State of Postgres". This is from stateofpostgres.com and this is a survey that they've published along with the results from 500 members of the Postgres community. This was done by TimescaleDB and they have a very interactive site that goes over some of the results and some conclusions from that survey. Now personally, I kind of like the top link here, the full results to be able to see where different results came from. This is from about 500 people. So not a huge number of people using Postgres, but definitely a representative sample, but nevertheless interesting pieces of information. For example, it seems like predominantly software developers, and engineers responded to it and the vast majority are using it for work, doing software development, and of course, a whole other set of questions that were asked. So if you're interested in this type of content, definitely a link to check out.
The next piece of content related to that is "5 Quick Thoughts on the State of Postgres Survey". This is from xzilla.net from the zillablog and he mentions five points from his thoughts. The first thing he mentions here is that 500 people is a pretty small subset of everyone in the Postgres community, so that's something to keep in mind. He appreciates the slow and steady growth. It's not that suddenly people in the last few years have started using Postgres. It's had a long runtime which hopefully demonstrates that it will be relatively stable. He also talks about how it's predominantly software developers that were filling out the survey using Postgres and predominantly with regard to software talking about the clouds, I didn't necessarily agree with some of the things mentioned here as I did feel the survey gave some pretty clear indications about how they're running Postgres. And he had some opinions about particular quotes that were picked out. However, I thought this was an additional piece of content to complement the survey. So if you're checking that out, maybe you would like to check out this as well.
The next piece of content is "How to effectively dump PostgreSQL databases". This is from depesz.com. Now of course there's the tool pg_dumpall, which dumps the entire database cluster, every database with it along with the global things such as roles, table spaces, users, et cetera. And then pg_dump just dumps a single database. All of those globals that are global for the entire cluster are not dumped. Basically, he discusses the advantages and disadvantages of each one. He prefers using pg_dump and not pg_dumpall. Pg_dumpall is very convenient to get started with. But really as you increase your sophistication you're probably going to want to switch to using pg_dumps. So using pg_dumpall has a number of disadvantages. You can't easily restore, say, a single table. You can't easily do dumps or restores in parallel. So what he advocates is dumping the globals and then dumping each database individually. He talks about the different formats that you can do with pg_dump, that you can't do with pg_dumpall.
Pg_dumpall just dumps out to the plain format, whereas different formats, such as the directory format, allow you to do dumps in parallel, do restores in parallel, and you can use the J option to specify how many jobs to do that in. And you can even do a partial restore. So you just want to restore this particular table, for example. And he has an example script here for doing this backup process. Basically, he chooses to dump the roles into one dump, dump the table spaces into another dump, and then go over per database dumps, doing pg_dumps for each one. So, a very good blog post. That explains the rationale for using pg_dumps versus pg_dumpall. And if you're using logical backups as your primary way to do Postgres backups, I would definitely advocate the solution that he has outlined here.
The next post is "SETTING POSTGRESQL CONFIGURATION PARAMETERS". This is from cybertec-postgresql.com. Now, the first way to set the configuration, of course, is the postgresql.conf file. So that has all the different configuration parameters that you can just go in and edit, although it does support having included files. So there's usually a separate directory called conf.d that you can put files into that will be included as part of the postgresql.conf. Now, I tend to use Ansible configuration management to do my PostgreSQL configurations, and you can do a copy of this, but I prefer to use included files. So just say this is the configuration that I'm changing and basically leave the postgresql.conf file unchanged. Now, a thing to keep in mind is that the last parameter set wins. So if you've defined, say, shared buffers in two different places, once in postgresql.conf and once in an included file, the last one to be read will be set.
So generally the included files will override what's set in the postgresql.conf file. The next area that you can change to get configuration changes is running the ALTER SYSTEM command. Now, what the ALTER SYSTEM command does is it modifies a postgresql.auto.conf file. So this is a file that specifically says do not edit this file manually. It will be overwritten by the ALTER SYSTEM command. So basically, these changes take precedence over what's set in the postgresql.conf file. So any ALTER SYSTEM commands are going to make changes to this file to persist over reboots. And as it said, what I mentioned here these values will have precedence over the postgresql.conf file. The other thing that they mentioned is that you can set built-in settings.
So for example, when you start Postgres, like for example using the pg_ctl command, you can actually pass in parameters at that time that you're starting it up so that the database cluster will have that parameter when you've started it. The next area that you can set configuration for is at the database level. So you can do an ALTER DATABASE command to set certain parameters. Not all parameters can be set. There are certain ones that cover the entire cluster that of course you cannot set using an ALTER DATABASE statement. But that's another way that you can set certain parameters.
You can also alter the user and set parameters that impact a single user. You can also alter users in a particular database. So again, more granularity to be able to set it. And then lastly, of course, is setting at the session level so the connection, you've made a connection to a PostgreSQL server and within that session, you can set certain parameters that will be persisted during that session and then they even go into the concept of assigning parameters to functions. So definitely an interesting blog post and if you want to know where different configuration settings can be set for PostgreSQL, this is a very good post that reviews all of those different capabilities.
The next post is "Barman 2.10 - Recovery of partial WAL files". This is from 2ndquadrant.com. Now this is generally an announcement of the availability of Barman 2.10 but it also discusses partial WAL files and how they've made it a lot easier to handle those. Now I bring this up because it's another post that also educates along with doing an announcement. So for example, normally when you have archive mode enabled in PostgreSQL, you're going to enable an archive_command that's going to save those WAL files to a particular location. Generally, I always have archive_command running, but there's also the ability to stream the WAL files to another system that's running the pg_receivewal process. So essentially doing physical replication of the WAL to a separate system or process. Barman actually supports both methods. Archive_command can copy it to a particular directory defined by Barman as well as use a pg_receivewal process to stream the WAL changes.
Now the advantage of pg_receivewal is that with the archive_command, it waits for an entire WAL file to be filled up before archiving it. So it must have a full WAL file before the archive command is run, whereas pg_receivewal constantly streams the changes that have been written to the WAL file. So in that case, when it's not a full file, you're going to get a file with a partial suffix. Now the advantage of this is there's actually more data that you will have if you're using pg_receivewal versus just relying on the archive_command. So potentially, you will lose less data as they're talking about here. Your recovery point objective will be much closer to zero, and it may be zero if you're using synchronous WAL streaming. They say that basically Barman supports this fulfilling the RPO of zero, but this 2.10 release actually makes it a lot more seamless to be able to handle the recovery process by automatically handling these partial suffix WAL files. So if you use Barman, definitely a blog post to check out for the update, as well as also get some good information about the differences between using an archive command and using pg_receivewal.
The next post is "How to Perform a Major Version Upgrade Using pg_upgrade in PostgreSQL". This is from crunchydata.com, and this is a relatively short post, but it goes through the process of how you could do a pg_upgrade, listing the different commands here. Now, you'll probably find much more comprehensive documentation for doing an upgrade, but this is a simple post that actually just goes over the steps you would need to go through. There are points of validation you'll want to do and things of that nature, but this gives you a high-level overview of how to do an actual pg_upgrade. So definitely a blog post to check out if you're interested in that.
The next post is "How to run Postgres on Docker part 3". This is from optimadata.nl. So this discusses the process of setting up a primary and a replica as docker containers and having them sync data between each other. Now, I didn't actually see where Part 2 and Part 1 were, however, this shows you how to set up a cluster of PostgreSQL instances. So if you're interested in doing that, definitely a blog post to check out.
The next post is "PoWA 4: New powa-collector daemon". This is from rjuju.github.io. This discusses the performance monitor, PoWA, for Postgres, and it's in a beta state. This discusses the implementation of the collector. So if you use PoWA or you're considering it, definitely a blog post to check out.
The last post discusses a new connection pooler if you are unfamiliar with it. They're classifying it as Odyssey and they're classifying it as a scalable PostgreSQL connection pooler because they indicate it's using a multi-threaded process. This is by Yandex, so it is apparently driving a fair amount of traffic. Now, I tend to use PgBouncer, but if you're interested in looking at connection pooler alternatives, maybe you would want to check out the Odyssey project here.