background

Partition Migration, Like Performance, Best Fillfactor, In-Memory Tables | Scaling Postgres 124

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how best to migrate to using partitions, like & ilike performance, determining the best fillfactor and the work towards in-memory tables.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Partitioning a large table without a long-running lock". This is from 2ndquadrant.com. He's talking about an issue where you have a large table in a database and it continues to grow in size and grow in size. Finally, you realize that I really should have partitioned this or I need to partition it. Now, typically what you would have to do is do this offline or somehow make the table unavailable while you do this. But this blog post describes a technique you can use to actually partition this table while keeping it online. So the first thing they do, they create a table they just call it orig_table. They put some indexes on it and inserted about, it says, 10 million rows. Then they go ahead and create the partition structure. They create a table called a part_table for the parent table. 

They partition it by the range and create four partition tables that are the child tables. Now, how they're going to keep this up to date between the partition table and the original table is using a trigger. So this is the trigger that they're going to create to do this. So whenever there's an INSERT on this table, it's going to insert what the new data is into that partition table. When there's a DELETE, it's going to delete it from the partition table and delete it from the original table. And if there is an UPDATE or this ELSE statement, it's going to go ahead and delete that value from the original table. If it's found when it's deleted, it's going to insert that data because this is an update, it's going to insert essentially that new data because it hasn't been found yet in the new table with the new data. 

However, if it already exists, it's going to go ahead and update that data in the partition table. So by using this trigger, everything can be kept in sync. So to implement this, it all happens as part of this transaction here. The first thing that they do is they're going to alter the table name to this old original table name. Basically, they're just renaming what the table name is. So this should execute pretty quickly. So just renaming the original table to an old table, and then they're going to alter the table to disable vacuum settings. Then they're going to create a view with the same name of the table that was renamed from the original table. It's going to select all the data from the original table and UNION it with the data from the new partition table. So basically, this view can give you the full state of the data that exists across the partition table and the original table. Then with the function that was mentioned above, they create a trigger out of it and apply it for each row to the original table. 

Now, once this is done, things will be kept in sync and data will slowly move from that old table to the new partitioning structure. Then what you do is do an update on that table, essentially the view for each ID that exists in the old tables, and it will slowly migrate that data from the old non-partition table to the new partitioning scheme. They just have a script in Perl that would do that. So you say they did like 10,000 rows at a time to migrate the old data over and then once the old table is finished, there are some cleanup commands that can be done here, basically dropping the view and the function altering sequence names and the table to rename it back to the original table name. So this is a really interesting technique and I think I might actually be using this in the future. So if you want to learn more about this technique, definitely check out this blog post.

The next piece of content is "POSTGRESQL: MORE PERFORMANCE FOR LIKE AND ILIKE STATEMENTS". This is from cybertec-postgresql.com and it's talking about LIKE performance. So they create a table that is basically filled with a hash and they did it for about 50 million rows. Then they did a query looking for a string that looked like this. They knew that there was only one row that had this string and they were looking it for anywhere within this column. Searching for that one row from 50 million rows took a little over 4 seconds, which is quite a long time. Now, doing an EXPLAIN plan shows it's doing a parallel sequential scan, so it has to scan through all 50 million rows to find this. So then the question is, how can you improve performance? 

So what he did is he used the pg_trgm extension. So it breaks up that column into trigrams and he kind of shows how it gets broken up into trigrams when you run it against it. And then the next step is to index it. Now, first, he used a GiST index, but that was huge, 2.5 times the size of the actual table and the performance was really bad, over a minute and 45 seconds. So it's longer doing a search through this GiST index than just searching on the table. So clearly you would not want to use a GiST index. Most text searches that I've seen use gin indexes. So, the next thing to do is to use a gin index. With this gin index, searching for that one row, using that LIKE query finished in 75 milliseconds, so much faster than, say, 4.7 seconds. 

But the thing to keep in mind is that that is not too efficient when it comes to looking for an exact match on a column. So if you also needed the ability to do an exact match, you would want to also add a B-tree index for this column because if you add a B-tree index for this column and pull out a single row, it returns in less than one millisecond. So it's just something to keep in mind when you're using text searching. A lot of times the gin index can be much more efficient for things of that nature, like full-text search or even using JSON data. So if you're interested in checking out the performance improvements you use for like, and I like definitely check out this blog post.

The next piece of content is "WHAT IS FILLFACTOR AND HOW DOES IT AFFECT POSTGRESQL PERFORMANCE?". This is from cybertec-postgresql.com. So fillfactor is how full you're going to make a table or index and leave space on the page available for new data to be inserted. So when would that data be inserted? Basically, when an update happens. So you're reserving some of that space so updates can happen there as opposed to splitting out and creating a new page for it. Now by default, the fillfactor is 100%. So the question becomes is it advantageous to drop down the fillfactor? Now if you never have updates to a table, a fill factor for 100% is basically where you want to go unless you have randomized inserts. But if you have a lot of updates, maybe you want to drop that fillfactor. 

Now he talked about the test setup he used and he wanted to run some tests to see what he could achieve. He looked at fill factors of 100, 90, 80, and 70. But because you are actually leaving some of the pages empty when you're decreasing the fillfactor, you're actually going to run into a set of diminishing returns because your caching becomes less efficient, because those pages are cached and if they're not as full, some of the performance is going to drop. He looked at both transactions per second as well as the meantime to run the queries. And it looked like in his test, the sweet spot for at least the sample size was a fillfactor of 90%. 

Maybe in some cases maybe drop it, you could drop it to 80%, but it's basically just a smaller fillfactor could give you a little bit of performance boost. I believe he said, quote "In average a 10% boost, when decreasing the FF by 10 or 20%...". So that seems to be about the sweet spot. But again, because of these diminishing returns, you want to take that into account when adjusting this. So again, adjusting this value seems very specific to your use case, so you definitely want to test it out. But in the test he did here a relatively minor fillfactor decrease seemed to work best for at least getting a little bit of a performance boost. So if you want to learn more and all the details about the test and the conclusions, definitely check out this post.

The next piece of content is "Approaches to Achieve in-Memory Table Storage with PostgreSQL Pluggable API". This is from highgo.ca. Now, there is no in-memory storage available for PostgreSQL, yet this post talks about working potentially towards that because with the release of PostgreSQL 12, now allows, quote "...custom table storage Access Methods..." to be developed. So if you've heard in previous episodes of Scaling Postgres, we talked about zheap. It's a new storage infrastructure for storing data that could potentially eliminate the need for doing vacuum. So they're actually using this API to say, hey, can we store data in memory only? 

So this goes through the process of their thoughts behind it, what kind of changes need to be made to develop a new storage method. So they talk about a number of the questions as well as how to handle the buffer manager because the buffer manager manages what's in memory and flushes it out to disk. This is utilized to a great extent in the other table storage methods that are being used, such as zheap. But if you're going straight to memory, there's not really a reason to buffer it in memory. Why would you buffer it to memory to then save it to memory? So really you would just want to go directly to memory. So it's essentially thinking about how to bypass this or what makes the most sense. So really this is an exploratory post on how they're developing it. So it's basically a work in progress. So if you're interested in finding out more about it, definitely check out this post.

The next piece of content is "Webinar: Being Committed - A Review of Transaction Control Statements 1/3 [Follow Up]". So this is from 2nquadrant.com and it's a webinar that talks about transaction control statements. They talk about transactions and the persistence of the data, how to handle application retries, and getting into a little bit about transactions and visibility. So this was, I would say, a basic presentation on transaction control statements. So if you're interested in that, go ahead and check out this webinar.

The next post is "Monitoring System Activity with the New system_stat Extension for PostgreSQL". This is from the enterprisedb.com blog and we had covered this extension in a previous episode of Scaling Postgres, posted I believe, on the personal blog of Dave Page. But this again covers the extension in detail. It exposes utilities that track CPU usage, disk usage, network activity, and things of that nature through to the SQL interface. So it basically gives you these new system tables where you can query and use these utilities to get back detailed operating system information. So if you're interested in learning more about this, how to install it, and what kind of OS statistics you can get from it, check out this blog post.

The next post, also from enterprisedb.com is "Maintaining PostgreSQL is More Than Just a Maintenance Plan". So this is a general post, but they talk about what are some common database maintenance tasks. So they run through four here: VACUUM ANALYZE every week, keep your stats updated, reindex your heavily updated tables every month, plan for vacuum full events when you need to shrink tables or whatnot, and also monitor your connections and your load. They also have a framework for addressing maintenance on an ongoing basis. First, examine what state your database is currently in, what system issues are you currently experiencing, and whether queries are waiting for a lock or any kind of performance degradation. Then look at that and how to approach it to address some of these issues. So if you're looking for a bit of a framework on how to tackle ongoing maintenance issues of your PostgreSQL instance, you can check out this blog post.

The next piece of content is "Unicode normalization in PostgreSQL 13". This is from 2ndquadrant.com. So basically, Unicode normalization is converting multiple different forms of a character to a consistent form. Now this post goes into more detail with regard to that, but it's basically normalizing a character that essentially looks the same to be the exact same Unicode character. This option becomes available in PostgreSQL 13. So they go ahead and show you how you can determine if NFC is normalized or is NFD normalized. Basically, NFC is the default, and they say that most of the world essentially uses NFC. So if you want to learn more about this normalization feature for Unicode with regard to PostgreSQL, definitely check out this blog post.

The next piece of content is "How to Monitor PostgreSQL 12 Performance with OmniDB - Part 2". This is from 2ndquadrant.com. So this goes into more so of building a performance monitoring dashboard for PostgreSQL. So this is a very comprehensive and long blog post describing how to do it, along with a ton of graphs showing you what the output looks like. So if you're interested in building a monitoring solution using their OmniDB tool, check out this blog post.

The next post, also from 2ndquadrant.com is "Barman 2.11: barman-cloud-restore and barman-bloud-wal-restore". So this shows you how once you have a backup using Barman, you can do the restore process, including both the database and the WAL. So if you're interested in using Barman for that purpose, definitely check out this blog post.

The next piece of content is "Snapshot Isolation Mode". This is from pgsqlpgpool.blogspot.com, and this is a new feature that's coming to Pgpool II version 4.2 that basically allows atomic visibility across multiple PostgreSQL instances. So it looks like this is a feature they're building towards where essentially you have global atomic visibility across multiple database servers and they go into some of the advantages and how they're thinking about it. Basically, this would enable this atomic visibility no matter the version of PostgreSQL. So if you're interested in that, check out this blog post.

The last piece of content, the PostgreSQL person of the week is Umair Shahid. If you want to learn more about Umair and his contributions to PostgreSQL, definitely check out this blog post.

episode_image