Managing Autovacuum, Better JSON, Avoiding Updates, OS Tuning | Scaling Postgres 168
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss managing autovacuum, better JSON in Postgres 14, how to avoid redundant updates and operating system tuning.
Content Discussed
- Enabling and disabling autovacuum in PostgreSQL
- Better JSON in Postgres with PostgreSQL 14
- PostgreSQL Builtin Trigger Function to Speed Up Updates
- Tuning Debian / Ubuntu for PostgreSQL
- Monitor PostgreSQL Performance Using a Database Health Dashboard
- postgres_fdw Enhancement in PostgreSQL 14
- Postgres 14 highlight - REINDEX TABLESPACE
- Dropping support for version 2 protocol
- Implementing Incremental View Maintenance for PostgreSQL (Part 2)
- Getting started with pg_bulkload
- Disaster Recovery Strategies for PostgreSQL Deployment on Kubernetes
- PostGIS at 20, The Beginning
- pgSCV — metrics exporter for PostgreSQL (not only).
- Nikolay Samokhvalov
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers can continue to do well. Our first piece of content is "ENABLING AND DISABLING AUTOVACUUM IN POSTGRESQL". This is from cybertec-postgresql.com, and they're talking about more than just enabling and disabling, but a little bit about how vacuum works. The three main things that vacuum takes care of are first, creating statistics for the Postgres optimizer. The second is cleaning out dead rows from a table. Then the third is handling wraparound protection. So in terms of creating these statistics, the query planner uses statistics based on information in the tables to make the most optimized plan for answering queries. Therefore, it collects statistics on those tables. That's one of autovacuum's jobs, is to update these statistics. Now, you can run a separate analyze command, but autovacuum does it as well. They mentioned that you can also tweak or adjust the frequency at which it runs, either for the whole database or at the table level.
You just use a few different configuration parameters such as autovacuum_analyze_threshold and autovacuum_analyze_scale_factor. Now, autovacuum also cleans out dead rows. So anytime you have an update or delete, it actually doesn't remove the row immediately because it uses multiversion concurrency control. It actually keeps those rows around for other users of the system to be able to access that data until it's no longer needed by any transaction. Therefore, an UPDATE or DELETE merely marks rows for deletion, and then at some point later, they are deleted. But that deletion is handled by vacuum or autovacuum. The next area that autovacuum covers is the handling transaction wraparound. So every transaction in the entire system is identified by an ID, and that ID wraps around. So for example, every new transaction creates a new ID and it keeps moving forward in the future.
But this has a 2 billion limit, so you don't want it to hit that limit. So what happens is that vacuum goes in and reclaims these older IDs so that they can be reused again in the future. So this wraparound issue needs to be addressed. That's another thing that autovacuum does. Now, they say you can enable or disable autovacuum on a table basis by doing SET autovacuum_enabled=off or SET autovacuum_enable=on, but you really don't want to do that. They say that there's only one use case where you may want to do that, and that is to avoid some autovacuum updates. If you are creating a temporary table to be able to process some data and you're going to be throwing it away, that's the only case it makes sense to actually do this. So you don't incur additional processing for a table you're going to be getting rid of anyway. Basically, you're just creating the data and storing it elsewhere.
In this particular table, it's a throwaway table, so you could turn it off there. The last thing they cover is why is autovacuum so slow. Generally, when you run a manual vacuum, it runs pretty quickly, whereas the auto vacuum runs much slower. Now, the main reason is because of the auto vacuum_vacuum_cost_delay. Now, in previous versions, they said it was set at 20 milliseconds. Right now, for versions 13 and higher, it's set at two milliseconds. But essentially a manual vacuum, it is set at zero. Now, the other factor that could help improve performance is also increasing the cost limit. So how many actions, each having a particular cost can be executed while a vacuum is running? You can extend that limit or reduce the auto vacuum vacuum cost delay. Either of those should speed up your vacuums. So if you want to learn more, you can check out this blog post.
The next piece of content- "Better JSON in Postgres with PostgreSQL 14". This is from blog.crunchydata.com. They're talking about the new feature of being able to do subscripting with JSON. And they also cover the fact that Postgres supports JSON with two data types. One is a JSON data type, the other is a JSONB. Generally, you always want to use JSONB because it enables indexing and it compresses and gets rid of white space in the JSON. If you want to preserve how JSON was received, then you could use the JSON data type. Otherwise, just use the JSONB. Now, before when you wanted to query a particular table and get information from it in the where clause, you'd have to use these arrow operators to be able to pull out the exact data that you want. But now with subscripting, you can do it this way.
So it's much cleaner, particularly for people who are used to using Ruby or Python or languages that do something similar. You can easily just update a single record with a value by doing this with an update statement. So this is far superior to using these I'll call them arrow operators before. Now, the key thing you also need to do is index it for better performance. You can index JSONB data types using a gin index, but you're going to have to query it slightly differently and use this contains operator. I haven't tested it yet, but it looks like you can't necessarily use the subscripting and get the indexes to work in some cases. So maybe just do an explain analyze if you're testing it out with subscripting. But this is another post from Crunchy Data discussing the new subscripting of JSON allowed in Postgres 14.
The next piece of content- "PostgreSQL Builtin Trigger Function to Speed Up Updates". This is from fluca1978.github.io. Now, I don't quite agree with this blog post name for what they discuss. They're discussing a trigger that is called supress_redundant_update_trigger. That's exactly what this post discusses. It's a way to avoid essentially redundant updates of a value. So for example, if you're going to say, update a pgbench account and set the filler equal to the filler, well, essentially every update is going to be entirely redundant. You don't need to do that. Now, if you run this command, what essentially you get is a double in the size of rows, like we were talking about MVCC or MultiVersion Concurrency Control creates a new row, it doesn't update the old one, and you need to vacuum it away.
So it doubles the size of the table and it really doesn't complete any work at the conclusion. You can see that it updated 10 million rows here. However, if you use this built-in internal trigger function and they did a before update on pgbench accounts for each row, execute this function. Now, when you do this update, zero rows will be updated. Now, in terms of the length he had, depending on how bloated the table was, he saw maybe about a 10-60% performance impact. But what's great is it essentially does no work and it should not create any additional rows that then need to be vacuumed up. Now he goes in and does some further analysis and actually does an implementation using normal triggers to do this, but this internal trigger is much more efficient at doing it. So this is a great tool to avoid redundant updates that may be happening in your application. If you want to learn more about it and how it works, definitely check out this blog post.
The next piece of content- "Tuning Debian/Ubuntu for PostgreSQL". This is from enterprisedb.com, and they're talking about some settings you should make at the OS level to fully optimize PostgreSQL performance. Now, I haven't necessarily set all of these in my installations, but we'll cover some of what they discuss here. One is doing some kernel-level tuning using sysctl in terms of changing how often things swap, how things are flushed to dirty pages, turning off TCP Timestamps, and a few others. They also discussed disabling transparent Huge Pages. So that's another recommendation as well as enabling and configuring Huge Pages. So if you want to get the most performance out of your hardware, definitely check out this post.
The next piece of content- "Monitor PostgreSQL Performance Using a Database Health Dashboard''. This is from arctype.com, and a lot of this post does talk about Arctype and using it as a dashboard tool. But what I found more interesting is that in the post they do discuss queries that you can use to get the same information to present on a dashboard. So you could actually just use these queries against Postgres and build your own dashboard using another tool. So for example, this is the query they use to get a list of all the tables from the database. This is how you can get row counts in a given database schema, check for open connections in your database, get the size of your database, calculate cache hit ratios, scan for unused indexes, and also a way to inspect database caches. So this post is a great one to cover all the different ways to collect metrics about your database just using SQL queries. So if you're interested in that, you can check out this post.
The next piece content- "postgres_fdw Enhancement in PostgreSQL 14". This is from percona.com, and they're talking about all the different enhancements that have been made to the Foreign Data Wrapper for Postgres. Again, they're moving toward basically a scale-out scenario. So they're continuously adding new features to the foreign data wrapper because that's envisioned as the method to do a scale-out of databases. So the first performance feature they mentioned is being able to push down parallel/async scans to a foreign server and they discuss how you can do this and implement it. They mentioned the bulk inserts, which were mentioned in last week's episode of Scaling Postgres, where you can insert rows in bulk more easily. You can now run the TRUNCATE command on a foreign server table. You can do a limit too for being able to access child partitioning. You can actually see what Foreign Data Wrapper connections are available and define whether to keep them or not, as well as reestablish broken connections. So if you want to learn more about the Foreign Data Wrapper features that are being added to Postgres 14, definitely check out this blog post.
The next piece of content- "Postgres 14 highlight- REINDEX TABLESPACE". This is from paquier.xyz. He's talking about a new feature coming in 14 where you can actually do a reindex and at the same time choose a different table space for those indexes to go to. So this is great if you're running out of disk space on say, your primary table space and you want to set up a second one. The number one thing that I mostly do is recreate indexes on that secondary table space to be able to reclaim space on the primary one. Well, with this, you could just do a REINDEX CONCURRENTLY command and specify this other table space and it will move the indexes for you. So it sounds much more convenient. So if you want to learn more about this patch, definitely check out this post.
The next piece of content- "Dropping support for version 2 protocol". This is from pgsqlpgpool.blogspot.com and he's talking about since version Postgres Go 7.4, Postgres started to use a version three protocol for clients talking to servers. Through all this time, they've still been supporting version 2, the one prior. But with 14, they're actually getting rid of version 2 and only using version 3 now. So basically the thing to be aware of. If you have a really old client wanting to talk to Postgres and you're upgrading to 14, maybe you want to update that client as well because they're going to be getting rid of this version 2 protocol. But if you want to learn more, definitely check out this blog post.
The next piece of content- "Implementing Incremental View Maintenance for PostgreSQL (Part 2)". This is from yugonagata-pgsql.blogspot.com. This is a case where they want to have a dynamically updated Materialized View so you don't have to refresh it. It refreshes automatically and they're calling it Incremental View Maintenance. So basically when one of the underlying tables gets updated, the materialized view is automatically updated as well. Now, they had some relatively good performance with I think it was a single table with basically an OLTP workload. This one, they're looking at multiple tables. I think they're joining six tables here and seeing what the performance is like.
Unfortunately, this is their performance as they add more connections without using this Materialized View. They got a pretty good performance, but it tanked when they started using their Incremental View Maintenance because it has to do locking in order to do the updates to the Materialized View, particularly if it's using multiple tables in the materialized view. So that's not great, but it's not unexpected. You can't really bend the laws of physics to get something for nothing. So this is a feature in progress and they're going to continue working on it to see what improvements they can make with this potential new feature of Incremental View Maintenance. So if you want to learn more, definitely check out this blog post.
The next piece of content- "Getting started with pg_bulkload". This is from highgo.ca and they're talking about bulk-loading data using the pg_bulkload tool. Now, typically, I just use PostgreSQL's COPY. I haven't needed to get every single bit of performance out of it. But if you're looking for that, maybe you want to consider a tool like this, because this post does cover how to install it and how to get it working on importing data.
The next piece of content- "Disaster Recovery Strategies for PostgreSQL Deployment on Kubernetes". This is from b-peng.blogspot.com and they're talking about using Kubernetes and deploying Postgres clusters, using it to achieve disaster recovery and high availability. For that purpose, they're using the Crunchy Data PostgreSQL operator and they have a diagram of the two clusters they created in different availability zones, presumably, and they go through all the commands that they ran to set it up and how it works. So if you're interested in doing this, definitely check out this blog post.
The next piece of content- "PostGIS at 20, The Beginning". This is from blog.cleverelephant.ca. So Postgres is 20 years old as of May 31, 2021, and this blog post goes into the history and how it evolved and developed. So if you're interested in that historical context, definitely check out this blog post.
The next piece of content- "pgSCV- metrics exporter for PostgreSQL (not only)". This is from lesovsky.medium.com and this is a new metrics exporter for Prometheus. So if you use Prometheus for monitoring Postgres, maybe you want to check out this exporter to use instead of the Postgres exporter. Because he also mentions that this does support a PgBouncer as well as just Postgres. So if you want to learn more about this tool, definitely check out this blog post.
The last piece of content, the PostgreSQL person of the week is Nikolay Samokhvalov. So if you're interested in learning more about Nikolay and his contributions to Postgres, definitely check out this blog post.