Scaling Advice, Vicious Cycle, Improve Observability, Addressing Corruption

In this episode of Scaling Postgres, we discuss scaling advice, a vicious cycle of WAL writes, how to improve observability and addressing database corruption.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers can continue to do well. Our first piece of content is "POSTGRES SCALING ADVICE FOR 2021. This is from cybertec-postgresql.com and in a nutshell, most of the advice is don't prematurely optimize. So if you want to scale, don't think of creating some huge distributed database system. Just start with a single server because Postgres can get you a long way on just that single server. It covers things such as all distributed systems are inherently complex and difficult to get right. So the more you try to scale out or become distributed, the more complicated things get. He mentions that with a good server, a single PostgreSQL instance can easily do hundreds of thousands of transactions per second. So he did some read-only tests of his local system and got 32,000 transactions per second.

Then he looked at where he was doing another workstation with around 45,000 write transactions per second. So he says basically a single node can typically do tens of thousands of write transactions per second and that a single Postgres instance can easily handle dozens of terabytes of data, which is true. Then also a single node instance is literally bulletproof as far as data consistency is concerned. But he says to help achieve this, you need to declare your constraints so that your data is in a good state and don't fool around with the fsync or asynchronous commit option. So you make sure that things get committed to the disks appropriately and use good disk systems because we're actually going to cover a post in a little bit that talks about corruption. One of the main ways corruption can happen is through hardware issues.

But he also covers how you have to be prepared for when you need to scale. So if you need to scale out, what do you need to do? So the first thing he mentions is don't be afraid to run your own database and that you can migrate to the cloud later if that's something you feel you need to do. Don't make the mistake of having your entire data architecture centered around one huge table. Be sure to use plenty of tables to spread things out. Thirdly, make sure you bake in a proper sharding key in case that is a direction you have to go. So in other words, make sure you can partition all of your tables pretty much with a shard key. So if all of your tables include that key, then presumably you could shard your data that way. So this post didn't have a lot of specific advice for scaling because those are covered in other posts, but it's more kind of a thought post for not prematurely optimizing your database. So if you're interested in learning more you can check out this post.

The next piece of content is "The vicious circle of huge writes". This is from mydbanotebook.org. They're talking about a cycle of WAL writing that you can get into and that if you have a lot of write activity, a lot of WAL files are being written that can cause checkpoints to occur more frequently, which of course triggers more full-page image writes which gives you more WAL. So more WAL leads to more frequent checkpoints, which leads to more WAL being generated, which leads to more checkpoints. So it's a vicious circle and then you have autovacuum kicking in when you're updating a lot of rows for example, which then also causes more WAL to be written, thus propagating this vicious circle. So there's a graph right here that depicts that you have huge writes on a database in a short time that causes a lot of WAL to be written. Now this triggers checkpoints more frequently, which of course causes more WAL to be written because of the full-page image rights.

Then still while this is going on, it also triggers autovacuum to happen more frequently when you have a lot of huge rights which causes more WAL to be written and it's a constant vicious circle. Now, in terms of how to solve that, the number one thing she's advocating is what is your application doing? So for example, if you have a lot of updates, is there a way you can reduce the amount of updates? Like maybe you insert a lot of this data and then update it less frequently? Or can you architect your writes a little differently? The other thing you can do is increase your max_wal_size so that the checkpoints don't happen as frequently, or change your checkpoint timeout or your checkpoint settings so that your checkpoint is less frequent which gives you fewer full-page image rights. You could also potentially do some tweaks to autovacuum. So if you're interested in learning more about this, you can check out this post.

The next piece of content is "Configuring PostgreSQL for Observability". This is from pgdash.io. This is being able to determine what's going on in your database. Now the first thing they mentioned doing is actually changing your log_line_prefix to add a bit more information to add information such as the application being used in terms of the client, the username, and the database connected to the IP address, so you could potentially identify the host and things of that nature. So this gives you more observability of what's going on. When a particular error, say, gets written to the logs, the next thing they mentioned is to log your slow queries. So you could use log_min_duration_statement.

They propose 1 second so any query longer than a second will get logged. Now, if you have too many logs you can actually use sampling in more recent versions of Postgres. They give a proposal of how you could sample, say, 25% of the slow statements here. In addition to that, they mention logging your locks. So for example, turn on logging the lock weights and they also mention logging autovacuums. So specifying the duration when a vacuum is taking too long so that you can know what's going on and what happens with the process. Logging checkpoints. You also might want to alter, say, your track_io_timing or the track_commit_timestamp.

Now in addition to configuration settings, you could also add the pg_stat_statements extension so that you can get a sense of all the different statements that are being run. You could also use the auto_explain extension that will automatically put an EXPLAIN plan in the log when queries take too long and they show you how to configure that here. Then lastly on the client side, you can actually specify an application parameter so that you can uniquely identify what clients are connecting to the database. So for example, they set an application name of the client that's running the weekly revenue report so they'll know exactly what connection is causing particular issues if there's a problem in the logs. This is an interesting blog post that you can check out.

The next piece of content is "Avoiding, Identifying, and dealing with PostgreSQL database corruption- Part 1". This is from highgo.ca. So they talk a little bit about database corruption and some of its causes. Mainly bad hardware and bad software are also culprits. One thing they're mentioning here is that CFS has been known to have some issues on Linux. I don't know how true this is anymore, but that's another indication. Misconfiguration in user errors and then they go into symptoms of a corrupted database. So maybe you have duplicate rows that shouldn't be there or rows are deleting or not appearing anymore. Or you have system crashes attempting to read the data or even error messages that are related to internal tables or internal file structures and it's not related to client activity. Those could be indications of corruption. Now in terms of the best practices to avoid data corruption, they talk about backups and restores. Now, of course, this won't prevent corruption but will help you recover from it in the worst-case scenario.

But in terms of a way to avoid it is to keep your fsync on, keep your full_page_writes on, and set the appropriate value for the wal_sync_method. Additionally, if you want to protect yourself even more, you can turn on checksums for your data and the last thing they cover is some other best practices. So for example, they mentioned doing a plug-pull test of your system to make sure that it can recover successfully. So all your hardware and or software is working as you expected. Next, never modify your data files manually. It would be very bad if you did that. Don't go in and delete the postmaster PID, the process ID. Don't run antivirus on the data directory and don't kill any PostgreSQL processes. Use the commands pg_cancel_backend. Or, if you really need to, the pg_terminate_backend commands. So this is a great list of recommendations for trying to avoid corruption of your database. I definitely suggest you check it out.

The next piece of content is "Our Journey to PostgreSQL 12". This is from coffeemeetsbagel.com and they're talking about an upgrade they did from 9.6 to 12.4. Now, what I found interesting is that they give a description of their architecture here. Basically, they have three replicas, an additional background replica, and an ETL replica. They do have PgBouncer that's in the mix and they needed to upgrade this system. Now, the reason for the upgrade is that the primary database had been running for about 3.5 years and it had a number of oddities. So for example, quote "...new services would refuse to run in systemd (we ended up running a datadog agent in a screen session), and sometimes CPU usage went above 50% it would completely be unresponsive to SSH". So that's a serious concern. Then their disk usage was an issue because they were running out.

So they needed to address that. So when doing the upgrade they wanted to have minimal downtime, of course, and roll out the new clusters on new instances to replace the current aging fleet and upgrade their instances. So to have a little bit more headroom. Now they considered doing a logical dump and restore but that would take way too long. So then they considered pg_upgrade but they didn't want to have any sort of downtime at all. So they chose to use logical replication. Specifically, they decided to use pglogical because I believe this has some additional niceties to be able to subtract the sequences that logical replication normally does not. They went through the process of doing it and basically they first brought up a new Postgres 12.4 primary that was doing pglogical replication. Then they set up streaming to set up new 12.4 replicas from the new 12.4 primary and they just used the standard streaming replication.

Then once everything looked good, they put the site into maintenance mode, switched the DNS record to the primary database, and forced a sync of all the primary key sequences. So I believe this is a pglogical capability. They ran a manual checkpoint on the old primary, performed some data validation tests against the new primary, and brought the system back up. It happened successfully. Now, they did have a couple of issues that they mentioned that slow synchronization can be dangerous. So during the initial pglogical synchronization, they had to copy a lot of data and that was causing WAL to build up because it's holding the WAL, retaining it while it's doing that initial data copy. So that was a concern for them, they had to cancel it.

What they did to have it happen more quickly is they dropped all the indexes on the table being synchronized. They set fsync to off. I would be really concerned about studying that given what we understand about database corruption. They set the max_wal_size to 50GB and checkpoint_timeout to 1 hour. So basically they extended those checkpoints to minimize the amount of WAL writing. With that, they were able to get the data transferred over. Also, they mentioned that every update is logged as a conflict. So they went ahead and set a pglogical.conflict_log_level to debug to avoid those messages. So if you're interested in learning more about how they did this go ahead and check out this blog post.

The next piece of content is "MONITORING REPLICATION: PG_STAT_REPLICATION". This is from cybertec-postgresql.com. So this talks about looking at the pg_stat_replication view on the primary to determine what replication is happening. As they mentioned here, it only knows what replicas are connected to that primary. So if you have a cascading streaming setup here where this primary synchronizes to these replicas, it doesn't know anything about these replicas. So it only has two entries in pg_stat_replication. Whereas this replica that has three replicas against it has three entries of these. So just something to keep in mind if you're using this view to understand what's going on with your pg_stat_replication. You have four LSN indicators.

One is has it been sent? Which means has the primary sent this LSN to the replica? Second is the write LSN which means has it been written to the operating system? Not necessarily flush to disk but the OS knows that it's a write that needs to be done. Thirdly, has it been flushed? So has that LSN been written to the disk on the replica? Then finally replay LSN which means the LSN that has been written to the database file. So if you do a query of this database system you will return that data. They talk about replication lag. They do have lag columns by time in the pg_stat_replication view. So you can determine the write lag, the flush lag, and the replay lag so you can know how closely the replicas are being maintained to the primary. So if you're interested in learning more you can check out this post.

The next piece of content is "How to run some tasks without user intervention, at specific times- part 2". This is from depesz.com. So this is part two of a post and this one talks about pg_cron. So it's another extension you can use with Postgres to do Cron scheduling of tasks. So if you're interested in that you can check out this blog post.

Next piece of content is actually a YouTube video: "Managing PostgreSQL with Ansible". This is from the San Francisco Bay Area PostgreSQL Users Group. This has about a 50-minute presentation on how to use Ansible to configure and set up Postgres.

The next piece of content is "How to Create PostgreSQL Custom Builds and Debian packages". This is from percona.com, so it does exactly that. It walks you through the process of actually creating your own Debian packages for custom builds of Postgres.

The next piece of content is "A preliminary exploration of dynamic hashing in PostgreSQL". This is from highgo.ca. Now, this is a very internally related post to Postgres, so if that is of interest to you, you can check out this post.

Next piece of content is "Deploy PostgreSQL with TLS and Kubernetes". This is from blog.crunchydata.com. So if you use Kubernetes and want to set up TLS with Postgres, check out this post.

Also from blog.crunchydata.com is "Deploy TLS for pgBouncer in Kubernetes". So if you are setting up PgBouncer with Kubernetes, you can follow this post to determine how to set that up.

The next piece of content- "Does Post GIS 3.1 mean the end of TopologyException?". This is from elephanttamer.net. So apparently when you're doing a spatial analysis, certain activities can fail when geometries don't comply with OGC validity rules. But the hope was that with this update, these exceptions could be reduced. Unfortunately, in the testing that was done here, the conclusion was that there were still some issues with it. So he basically says you probably shouldn't hurry to update PostGIS, but if you want to learn more about that, check out this post.

The next piece of content is "BETTER OPENSTREETMAP PLACES IN POSTGIS". This is from rustprooflabs.com and they have been talking about the new flex output available in OpenStreetMap. They cover some additions and changes since 0.0.3 they were looking at 0.0.7. So if you're interested in some of the additions and changes, you can check out this blog post.

The last piece of content. The PostgreSQL person of the week is Franck Pachot. So if you're interested in learning more about Franck and his contributions to Postgres, definitely check out this blog post.

Scaling Advice, Vicious Cycle, Improve Observability, Addressing Corruption | Scaling Postgres 150

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript