background

Insert-Only Vacuum, Settings, Sequence Replication, Improvements | Scaling Postgres 111

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss insert-only vacuums, the Postgres settings hierarchy, sequence replication and desired improvements.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers are doing well. For this week, our first piece of content is "POSTGRESQL V13 NEW FEATURE: TUNING AUTOVACUUM ON INSERT-ONLY TABLES". This is from cybertec-postgresql.com. Now, you may be thinking why do you need to vacuum insert only tables? Generally, vacuum is used to vacuum up dead rows that have been the result of rows being deleted or updated. So Postgres actually maintains the old rows for a time to make sure that they are visible for any transactions for other users until they're no longer visible throughout the entire system and then they are vacuumed up later. So basically, an update results in a new row being added and the old one being retained and then vacuumed later. So why would you need to vacuum insert-only tables? Basically, the two reasons are actually mentioned there in the little graph here is because of the transaction ID age that has a limit and vacuum resets, which essentially freeze rows and allow it to be available. 

Also to update the visibility map. So this is important for insert-only tables. So that's essentially what this blog post runs through, this new feature that's being added to 13 that could be advantageous for insert-only or essentially append-only tables. So they mentioned it adds a new column in insert since vacuum. So essentially a number of inserts since vacuum column of the pg_stat_all_tables. You can adjust an autovacuum_vacuum_insert_threshold and an autovacuum_vacuum_scale_factor so you can make it customized. Apparently, they even have one for TOAST tables as well. Next, they talk about use case one, which I mentioned, to handle a transaction ID wraparound. In any active transactions, there are 2 billion that are allowed. So once you hit that limit, you need to freeze those transactions so that more become available. Now, what normally happens is when the TXID reaches 200 million, and it then triggers a special vacuum to handle it. The problem that you can run into now with, say, insert-only tables is that vacuum doesn't get done in time, which means you need to take the table offline and do a manual vacuum. 

Also, a mandatory autovacuum will not give up when it blocks a concurrent transaction. So such a blocked operation will block all other access to the table and process and come to a standstill. They give you some options of configuration changes you can set or recommendations when using this feature in version 13. The second case they mention is index-only scans. This goes to the visibility map, which the index-only scan needs to consult to determine whether it needs to go to check the heap tables or not. So if you want more index-only scans, that don't have to check heap tables, you want to make sure the visibility map stays updated. Essentially, this new feature will allow you to do that for more append-only tables. Now they do say that if a table receives enough UPDATES or DELETES so it's not strictly append-only, you can set the autovacuum_scale_factor relatively low, but with an insert-only table that's a bit harder to do. The last case they mentioned is hint bits on insert-only tables. 

So basically the first query that reads a newly created row has to consult the commit_log to figure out if the transaction that created the row was committed or not. Now, if a lot of rows were recently inserted in a table, that can cause a performance hit for the first reader. Subsequent readers won't have to check the commit log because they can just look at the hint bit. Therefore, it is considered good practice to vacuum a table after you insert or COPY a lot of rows into it. And essentially with this new feature, they say PostgreSQL automatically vacuums these insert-only tables after large inserts so you don't have to necessarily worry about it. So definitely an interesting new feature and if you want to learn more about it, definitely check out this blog post.

The next post- "Where do my Postgres settings come from?". This is from dbanotebook.org and they're talking about all the different places where you can make settings changes in Postgres. And they have this convenient graph here and it is in a hierarchy so the top takes precedence over those lower. So at the bottom, you have the default settings. Up above that the start configuration for the postmaster. Up above that is the postgresql.conf file. Above that is the postgresql.auto.conf file that you can change with an ALTER SYSTEM command. Up above that, each database has configurations that you can set through an ALTER DATABASE. The roles have their own settings through an ALTER ROLE, the sessions have their own settings where you just SET and then make a particular setting change. Then lastly, transactions you can SET LOCAL within a particular transaction. So this is a very convenient graph to kind of understand where different settings can be made and their hierarchy. You can also consult the pg_settings table which tells you kind of the source of the settings that exist. Then finally he presents this decision matrix in order to try to determine where a particular setting is coming from. So if you're interested in that, definitely check out this blog post.

The next piece of content is "Can Sequence Relation be Logically Replicated?". This is from highgo.ca and they're talking about logical replication where sequences don't get replicated over. So the data tables get replicated over but not the sequences. Now first, they address what a sequence is. So basically, when you create a serial column or you have a primary key that has a sequence associated with it, it's a counter that determines what the next new value is for that primary key. That is an integer or a bigint or something of that nature. And these sequence values are not replicated. And it says why they don't do it actually because of how they save things to the WAL. So the WAL logs sequences every 32 increments. So for example, if you're incrementing by one and you are at 500, what it's going to log to the WAL is 532, and then it's going to wait until 32 more increments pass before it logs again. 

Again, it logs 32 increments ahead. Now, the reason it does this is because sequences are not guaranteed to have no gaps. For example, you can see this if you do an INSERT and have an error in a table, you can see that the next time you have a successful INSERT, you will have a gap. You won't have those failed inserts. And the reason that they do this is for performance, so you don't have to write to the WAL as much. So they say in theory, this could be logically replicated, and the fact that gaps are okay, having a far-forward sequence should be fine. So they actually went through and looked at all the different codes here and investigated how to do it. So I'm not going to go into the details here, of course, but if you want to check this out, it shows you how this potentially could be added. And apparently, they may be looking to do an update for Postgres on this. So if this is of interest, definitely check out this blog post.

The next piece of content is "7 THINGS THAT COULD BE IMPROVED IN POSTGRESQL". This is from cybertec-postgresql.com. Now this is inspired by the article that we've covered in a previous episode of Scaling Postgres called "10 Things I Hate About PostgreSQL". And it's basically mentioning proposed improvements. So this mentions seven things that could be improved. The first one mentioned is rigid query planning. Now, he's not necessarily mentioning the absence of query hints that were mentioned in the previous blog post, but this is talking about having a more adaptive planner, because basically the planner just looks at the query, the table structure, and the statistics that exist, but doesn't make any determination based upon previous query runs in terms of deciding how to do that query. So he's saying having that type of knowledge could be beneficial to help adapt the plans the query planner is coming up with. 

So that definitely seems interesting. Number two is some levels of automatic tuning. For example, Postgres mostly does not look at the operating system at all because it runs on so many different ones. It tries to remain operating system agnostic. But there could be benefits of setting some things based upon looking at the memory that exists or the disk space that exists and things that are logged to the logs. Could they perhaps make automatic tuning changes? So that sounds interesting on the surface. I'd be a little bit wary of it and would want, of course, to be able to turn any automatic tuning off, but definitely an interesting proposal. Number three is lost statistics during a pg_upgrade. So when you do a pg_upgrade, you basically have to run statistics again to repopulate them. And since a lot of the data doesn't really change from an upgrade, it would be advantageous if we could retain those somehow. 

So I would say this is advantageous, but it doesn't really take even for large databases a long time just to run some statistics. Number four is the autovacuum and transaction ID wraparound. Big, huge issue. Definitely agree that making this easier to deal with would definitely be a great improvement. Now things like the zheap project that they're working on, where they're actually having a separate area and not just inserting a row with an update and then having to vacuum it up later, basically writing to a separate storage structure so that you could handle rollbacks or concurrent visibility on the state of a particular row. Definitely looking forward to seeing what this looks like once it's complete. Number five is the disk footprint in bloat. So basically look for better ways to optimize the storage of data on disk. Now they're saying zheap could help contribute to this because you'd have less bloat. 

They're also mentioning ZedStore which could be beneficial for ultra-large data warehouses. There's also the cstore_fdw. So that's another potential to look at. This could be advantageous, but again, you can store it on a more compressed file storage system if you so choose. So that is another way to kind of get this benefit. Number six is out-of-the-box vulnerability against brute-force password attacks. So here they mentioned that there's no built-in capability to do this, but you can add the auth_delay contrib module to do it. So just be aware of that and add it to avoid this potential problem. And seven, no query cache. Now, he does mention this is incredibly hard to do because anything dealing with caching is hard to do. But he said it generally would be beneficial if you had insight into the fact that no data has changed to be able to return the same query results for the same input. That could definitely be an advantage. So if you're interested in investigating these a bit more, check out this blog post.

The next post is "A Beginner's Guide to Basic Indexing". This is from bannister.me. Now this is a very basic guide and it's basically how this, primarily a Laravel developer, increased the performance of his application by 100 times. Now this is not surprising if you don't have an index. Adding an index gives you these kinds of huge performance wins. So if you're a developer and want to learn more about databases in terms of improving your application's performance, then this is a good blog post to review about how he did it. But essentially he identified indexes that were needed and added them and even looked into whether to make it an index-only scan or not. So again, it is on the basic side, but if you're a developer and want to learn more about indexing and the performances they can result in, definitely check out this blog post. And I should mention it's not PostgreSQL specific.

The next post is "Things I Wished More Developers Knew About Databases". This is from a medium.com under Software Engineering. Now again, this is not PostgreSQL specific, but more generally talking about different types of database concerns like ACID, Network uptime, and AUTOINCREMENTing. So again, this is more focused on developers than any DBA and considerations to take into account as you're developing applications. So if you're a developer, you've probably gained some insight by looking over this post.

The next post is "Stored Procedures also have rights". This is from highgo.ca, and it's talking about the rights that you can assign to store procedures. So primarily, you can assign INVOKER rights or DEFINER rights. So that procedure can be run based on who defined it, or it can be based on who is running that procedure. And when you're creating them, you need to kind of take into account how you want it to operate. So this goes through an example about how to set it up. Two different users, two different procedures that have each of these procedures. So you can see how it works and how you get an error if one has an INVOKER right, but doesn't have the permissions to the particular table to query. So if you're interested in learning more, definitely check out this blog post.

The next post is "Automated rapid switchover with a BDR database cluster in Kubernetes". This is from 2ndquadrant.com. So this is particularly interesting. Using Kubernetes, you have a cluster of PostgreSQL servers, in conjunction with BDR, which is their bi-directional replication product from 2ndQuadrant. So basically you have, say, three database servers. They are all masters. So it's multi-master and it shows how one server can die in that cluster. It can almost immediately switch to one of the active masters and then proceed to rebuild the missing one. It does this in conjunction with their CloudNative BDR operator that helps operate the Kubernetes cluster. They have a little presentation here titled "Hard to kill: Automated rapid switchover with BDR cluster and Kubernetes". So I find a really interesting use case with Kubernetes. Now, of course, the thing you have to take into account with a multi-master setup is your application development practices need to be different. So there are considerations you need to take with that. But definitely an interesting blog post.

The next blog post is "Easy PostgreSQL 12 and pgAdmin4 Setup with Docker". This is from crunchydata.com. So this just runs through setting up Postgres 12 and pgAdmin4 on a Docker container. So potentially maybe you want to use that with Kubernetes as well.

The second poster related to Kubernetes, also from crunchydata.com, is

"Set Up and Run PostgreSQL Operator Client (pgo) on Windows". Now it's showing you how to do it on Windows, but it's also saying how to set it up after you have the cluster up and running. So if you're interested in doing this on Windows, definitely check out this blog post.

The next post is "Oracle to PostgreSQL: START WITH/CONNECT BY". This is from 2ndquadrant.com, and they're talking about how Oracle has the START WITH or CONNECT BY to give a particular result. And they're showing you how you can do that. The same type of query in PostgreSQL using a WITH RECURSIVE CTE. So if you're interested in migrating from Oracle to PostgreSQL, it is definitely a blog post to check out.

In the next post, the PostgreSQL person of the week is Mark Wong. So if you're interested in learning about Mark and his contributions to PostgreSQL, definitely check out this blog post.

The last blog post is "A Chat with Luca Ferrari", and this is from superhighway.dev, and he is another person in the PostgreSQL community. So if you're interested in learning more about Luca, definitely check out this blog post.

episode_image