Ballad of Bloat, Auditing Options, Configuration Options, Need Vacuum? | Scaling Postgres 99
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the "Ballad of Bloat", options for auditing, a configuration help resource and determining when a vacuum is needed.
Content Discussed
- [YouTube] Nothing Compares To VACUUM/The Ballad of Bloat
- Row change auditing options for PostgreSQL
- Postgresqlco.nf: Configuration for Humans
- POSTGRESQLCO.NF
- The Most-Neglected Postgres Feature
- Which tables should be auto vacuumed or auto analyzed?
- Don’t do these things in PostgreSQL
- pg_ctl Tips and Tricks
- Debian integration of Patroni and vip-manager
- Scheduled PostgreSQL Backups and Retention Policies with Kubernetes
- Migrating from Oracle to PostgreSQL: Tips and Tricks
- Barman Cloud – Part 1: WAL Archive
- pg_timetable: Advanced PostgreSQL scheduling
- Understanding Prepared Transactions and Handling the Orphans
- Developing PostgreSQL for Windows, Part 1
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. We have a lot of posts that were done, and a lot of content to get through, so we'll go ahead and get started. The first piece of content is just something fun and interesting. It's actually a song that someone put together and it's on YouTube. "Nothing Compares to VACUUM/The Ballad of Bloat". This is in the Warsaw PostgreSQL Users Group. It's a song, basically, using the melody from Nothing Compares to You by Sinead O'Connor. So quite interesting. I definitely encourage you to check it out.
The next piece of content is "ROW CHANGE AUDITING OPTIONS FOR POSTGRESQL". This is from cybertec-postgresql.com. So if you want to audit row changes, INSERTS, UPDATES, DELETES, and TRUNCATES, they go through a couple of different options that you can do. The first way is to change your log_statement in the postgresql.conf file so you can make it none. So don't log anything, log just DDL statements where you're creating, and manipulating objects. Do mod, so like modification or all, but it has a number of downsides and they go through the different pros and cons here. The pros are this is super easy to set up, it's one configuration variable to tune and it starts giving you some auditing. The cons are it catches only the top-level DML statement issued by the user. So like stored procedures, things of that nature can bypass this. No row contacts for multi-row updates, and no table or Schema filtering, information is purely text-based.
Queries from failed transactions are also included. The next step they mentioned is the pgaudit extension. So the pros of this are you have quite granular logging and auditing options and internally dynamically generated SQL can also be logged. The cons are it is a third-party extension, possibly heavy disk IO footprint, and no row context for multi-row updates. The information is purely text-based and queries from failed transactions are also included. The next option is to build your own custom audit tables and triggers for those tables to populate. So this gives you the most flexibility of what you want to audit, but it's probably the most work and they go through the different pros and cons for that. Now, the next option is interesting, I've used it myself, is using one generic audit trigger and table for all audited tables. So you have just a generic log table.
You create indexes on it and you're basically using a JSONB field to store the row data and the changes that were done. He creates different indexes necessary to do that. Then the function defines what and how to store the rows based on changes. And then you put this trigger after INSERT, UPDATE, or DELETE on each table that you want to be able to audit. Now, the pros of this are that there's less code to manage and automatic attachment of audit trail triggers can be easily configured for new tables. Now the cons are it is a bit more resource-intensive to do this method. It says some exotic indexing gin may be needed. I don't necessarily know if that's too much of a downside and SQL searching may become a bit more complex. Again, working with JSONB fields takes a little bit more work, but it's not so bad. So I personally kind of like this approach.
The next approach is doing logical replication and there are two different ways. One way is PostgreSQL native, so basically you set up a separate database server and you basically create a publication on the primary and then create a subscription on what you're replicating to for the audit changes that are done. The other way is logical replication with custom applications. So you develop an application that supports logical decoding and you essentially read the wall stream and log whatever you want to from that log stream and they come up with a good summary table here that discusses each of the options. So, definitely, a great blog post to consider if you want to implement auditing or perhaps change how you're auditing your current application.
The next post is "CONG: POSTGRESQL CONFIGURATION FOR HUMANS". This is from ongres.com. So this is a blog post that is talking about PostgreSQLCO.NF. They are defining it as your postgresql.conf documentation and ultimate recommendations source. So basically all the different configuration parameters have documentation that does mimic some of what's in the actual PostgreSQL documentation but also gives some recommendations. So it's essentially another source and the link that they're referring to is this one here. So it has all the different configuration options here you can just click one or you could just do a search and it comes from finding what are the best configuration options to change out of the 300 or so that are available. And so for example, if you click autovacuum, it gives a discussion of what it is, and some recommendations, they even offer the ability for comments to be added and it gives general parameter info as well as looking at it by version. So if you want additional resources to help you configure your PostgreSQL configuration, definitely check out this resource.
The next post is "The Most Neglected Postgres Feature". This is from richyen.com and basically he's talking about a log_line_prefix. So many times he's seen an empty log_line_prefix for older installations of Postgres, or they just have a simple timestamp or something similar that is not that descriptive. Now, more recent versions have a more comprehensive one, but if you've upgraded and haven't updated it, this is definitely something to consider doing because you can define some of these ones that he's defining as his favorite prefixes. To add one is the application name, the username, the database name, particularly if you have multiple databases on one database cluster, the remote host IP name, the process ID, the session process log line, as well as transaction IDs.
So you may want to consider reviewing this post if you want to add a little bit more descriptive information to your logs. Then at the end here, it says that's not all. He actually covers some other log statements that are fairly important that he feels you should set. This is one of my favorites is the log_min_duration_statement to help find slow queries, log_statement, log_connections or _disconnections, log_rotation_age, rotation_size, log_autovacuum_min_duration, log_checkpoints, log_temp_files, and auto_explain. So if you want to update how your logging is set up for PostgreSQL, definitely a blog post to check out.
The next post is "Which tables should be auto vacuumed or auto analyzed?". This is from depesz.com. Basically, he had an issue where autovacuum wasn't keeping up with changes that were being made to the database. So he needed to do a manual vacuum to catch everything up. But he said he didn't have a great way to check what tables were waiting for autovacuum. So he actually developed a script that does this or a query that does this. So he consulted the different autovacuum settings to be able to interpret the next table that would be up for an autovacuum. He developed this query that you can run in your own database to be able to do exactly that. He created it as a view here that can be used in this way. So you can know when the next thing should be vacuumed or when the next thing should be analyzed. So if you think you could benefit from this, definitely check out this blog post.
The next blog post also from depesz.com is "Don't do these things in PostgreSQL". Now this is from a previous Wiki page that was set up on what not to do. And we had talked about this last year when it was posted, or excuse me, a year before last. And I remember saying some of these were kind of opinionated in that I might not do the same thing. But it's interesting here what he did is he took all of these don't do things and then he set up a SQL script that for those you could look in the database to see how it's said and what's being done. It can actually look for violations of these Don't Do rules. So if you're interested in that, check out the pgWikiDon't in order to get access to this SQL script to run on your own database.
The next post is "pg_ctl Tips and Tricks". This is from pgdash.io and they're talking about pg_ctl or essentially PG Control for controlling the setup in your general database cluster. So they cover this command which typically is located on Debian, Ubuntu, and Red Hat as well as CentOS, and they go over the process of creating the database cluster, how to start a database cluster, and different ports and options you can set at the point at which you start it. So if you're interested in learning more about pg_ctl, definitely check out this blog post.
The next post is "Debian integration of Patroni and vip-manager". This is from credativ.com. So they've set up in Debian, the VIP manager. So this is a virtual IP manager to help you manage a cluster of PostgreSQL servers, one acting as primary and one or more replicas. And we're working with Patroni. This virtual IP manager essentially helps direct traffic to the active primary or direct write traffic to the active primary. So if you're interested in Patroni and the VIP Manager for managing a set of PostgreSQL servers on Debian or perhaps Ubuntu, definitely check out this blog post.
The next post is "Scheduled PostgreSQL Backups and Retention Policies with Kubernetes". This is from crunchydata.com. They have developed their PostgreSQL operator that helps manage Postgres instances within Kubernetes. And they've actually added the ability to use pgBackRest as a backup and schedule retention policies with it all within Kubernetes. So if you're interested in using that, definitely check out this blog post.
Next blog post also from crunchydata.com is "Migrating from Oracle to PostgreSQL: Tips and Tricks". So they talk about different solutions to issues you may run into if you're migrating from Oracle to Postgres. They mention Orafce, which includes some functions that you're familiar with using in Oracle. Well, they give you Postgres versions that work similarly to Oracle, and they tell you how to go ahead and install it and create this extension to be able to do that. They talk about how to handle if in Oracle you do things like disable and enable constraints or disable NOT NULL constraints. They discuss how to handle that in Postgres, and how the grant command is similar but yet a little bit different. They talk about how to best drop objects in PostgreSQL versus Oracle, how to check for NOT NULL, and discuss ROWID, CTID, and identity columns. So if you are migrating from Oracle to PostgreSQL, definitely a blog post to check out.
The next post is "Barman Cloud - Part 1 - WAL Archive". This is from 2ndquadrant.com. As of Barman 2.10, there's a way to do the streaming backup of your WAL files to a cloud location. So for example, Amazon S3, and that's what this post discusses: Barman Cloud WAL archive. So essentially the archive command you can configure in
Postgres, there's a Barman application, essentially, you can send it to that will send the WAL files to S3. Part two will cover barman-cloud-backup. So if you use Barman and want to start using this archive directly to a cloud location such as Amazon S3, definitely check out this blog post.
The next post is "PG_TIMETABLE: ADVANCED POSTGRESQL SCHEDULING". This is from cybertec-postgresql.com. Now, it's interesting they call it pg_timetable because I actually don't see that. This is an extension for Postgres. It's a separate application written in Go, and it's typically delivered as a container, it looks like. So I guess I'm used to seeing pg_ something related to an extension for Postgres. But this is actually a separate application that does the scheduling of tasks. It uses PostgreSQL essentially as the data source, but by having it being its own application, you can do all sorts of things outside of Postgres, like send emails, download files, et cetera. But it looks like when you're configuring it, you're doing all the configuration in SQL. So it's definitely an interesting project. It looks like they have a Cron style you can implement as well. I'm going to be sticking with Cron myself, I think, but if you're interested or looking for another scheduling option, maybe check out this blog post.
The next post is "Understanding Prepared Transactions and Handling the Orphans". This is from highgo.ca. Prepared transactions generally have to do with a two-phase commit process, so you normally don't use this in normal application development. So I haven't really consulted this. But if you are considering distributed transaction management in things like two-phase commits, maybe you want to check out this blog post.
The last post is "Developing PostgreSQL for Windows, Part 1." This is from 2ndquadrant.com. So if you are doing development of PostgreSQL and want to do one on Windows, here's a blog post that gives you some tips and tricks to help you do that.