Toast Impact, zheap Storage, More Performance, pg_rewind Changes | Scaling Postgres 137
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the impact of toast, zheap storage results, more Postgres 13 performance data and changes to pg_rewind.
Content Discussed
- Statement from the PostgreSQL Core Team on the EDB acquisition of 2ndQuadrant
- The Surprising Impact of Medium-Size Texts on PostgreSQL Performance
- zheap: Inspecting storage sizes
- Postgres 13 Performance with OpenStreetMap data
- TPC-H performance since PostgreSQL 8.3
- pg_rewind changes in PostgreSQL 13
- Data consistency issues in logical replication
- Logical Replication Upgrade in Postgres
- How to Upgrade PostgreSQL 11 to PostgreSQL 12 with Zero Downtime
- Using PostgreSQL to Shape and Prepare Scientific Data
- The way PG store null value in record
- Free Space Mapping file in details
- Damien Clochard
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Statement from the PostgreSQL Core Team on the EDB acquisition of 2nQuadrant". This is from postgresql.org. They're talking about the issue where EDB acquired 2nQuadrant, and now I believe it's, up to 60% of the core team is composed of EDB employees. As they say, quote "There's been a long unwritten rule that there should be no more than 50% of the membership of a Core Team working for the same company". So basically, they say they're looking into this and discussing what to do, and they just wanted to put a post up for the community to let them know that they are aware of this and they're looking to see what solutions they could make. So it's interesting that they actually posted this on the PostgreSQL website. So just some information to keep in mind.
The next piece of content is "The Surprising Impact of Medium-Size Texts on PostgreSQL Performance". This is from hakibenita.com. Now, in terms of medium texts, he basically says small texts are very small strings, like names, slugs, usernames, emails, et cetera. You may be using a varchar instead of an actual text. Large text is large content such as a blog post, articles, and HTML content, whereas medium texts are things like longer descriptions, comments, reviews, stack traces, and things like that. So it's text that's not quite like a full blog post or a full document, but not very short strings of text. That's what he's calling medium text. Basically, this is a great article, talking about TOAST and explaining what it is and how it works. So he goes into what TOAST is, and that because PostgreSQL has a fixed page size, a tuple cannot occupy more than one page.
Therefore, TOAST was conceived as a way to store more than 8 KB within a row. So basically, it overflows into this TOAST storage. He has information here from the docs that basically say the TOAST will compress and or move field values out of line until the row value is shorter than the TOAST tuple target, which is typically 2 KB. So whenever a table is created, there's also a TOAST table created as well. He used this technique here to identify the TOAST table of a table he created. Then he actually looked to describe the table and you could see that it has a chunk ID column, which is an OID, a chunk sequence column, which is an integer, and a chunk data, which is the raw data. So basically, even the TOAST table can exceed that 8 KB. So basically, it splits it up into separate chunks of data, each with a sequence number that references the value that's TOASTed.
Now he is an example here where he inserts a small text value into the table. When you check the TOAST table you can see it's still empty. So everything fits within the main table itself. Nothing is TOASTed. But then when he inserted a longer text string with over 4,000 characters, you can see that now the TOAST table is being populated and that it actually spans three rows in this TOAST table. So this reference is the same ID. But as this diagram represents, it's split into three rows in the TOAST table, each chunked with a sequence, as you can see here. Now he also mentioned that it can also do compression and the compression kind of depends upon how random the string is. So to look at some examples, he actually created a function called generate_random_string that you can specify the length as well as the number of different characters to use in generating the random string.
So he actually inserted a 10 KB string and it was split up into six chunks within the TOAST table after this insert, using entirely random strings. But when he actually restricted it to only three different values, therefore it's easier to compress a string with only three values in it compared to a great many values. Inserting the same length, 10 KB is actually only stored in two chunks and only takes up 3 KB of space, so it compressed it by about 60 or 70%. Now what's even more interesting is that when he uses only one character and the same 10 KB to be stored, it compresses it so well it doesn't even store it in the TOAST table. Now he does mention for completeness that you can configure the TOAST table per table and that includes the toast_tuple_target, which is the minimal tuple length after which PostgreSQL tries to move long values to toast, and then the storage, which is the TOAST strategy.
The default is extended, but it also offers four other TOAST strategies. Basically, this varies whether things are compressed or not and whether they're moved out of line or not. Then he gets into the TOAST performance. For this example, he creates a small table, a medium table, and a large table. Those links are an indication of the string length. He inserted 500,000 rows into each of these tables using the different length strings. He turned off parallel queries and it's just doing sequential scans and he looks for a single ID in each table. The small table ran in 25 milliseconds, the median table eventually ran in about 173 milliseconds, whereas the large table did it in 37 milliseconds. So the question is why is the medium so much larger than both the small and the large? The reason is because of where the data is stored. So the small is all stored in line and it's a relatively small table.
There's essentially nothing being stored in the TOAST. Now, he purposefully inserted a value that just barely fit into the table and did not spill over into the TOAST for the medium value. So you can see the TOAST value is zero, but all those medium strings had to be stored somewhere. So they're stored in line in the table. But as a consequence, the table, as you can see, is almost a gigabyte, whereas the smaller one is at 21 megabytes. Then when he did the large one, pretty much all of that spilled into the TOAST. It's almost 2GB in the TOASTed table, but yet the size of the actual table itself is large. So when you're actually doing a scan, it's much faster to scan the small table, 21 megabytes, and the large table 25 megabytes, but it's a lot slower to scan that as an almost 1GB medium-sized table. Now, in terms of this analysis, what he was searching for was the ID. So the ID is located here.
If he was actually searching on the actual text value in there, you'd get a much different result because you would have to search through this TOAST table. That's what he did here where he searched through the TOAST values. Here you could see the large table doing a search took over 7 seconds. The medium got down to eventually 260 milliseconds, whereas the small was 50 milliseconds. So that makes a lot more sense. It's searching based on how large the data is. Then he looked at the impact of adding an index on the ID. And of course, when you query on the ID, you get approximately the same speed, less than a millisecond for each one when querying on the ID. But then he tried querying a lot of data for each of the tables using the ID. So using a range, say pulling half the values of the ID out, and here he saw something different. The small table was about 60 milliseconds, the medium table took longer at 284 milliseconds and the TOAST table was smaller at 70 milliseconds.
So I am a little surprised by this result. I would have expected the medium and the large to be similar because the ID index sizes should be the same between all three tables. Because you're selecting all columns, you still have to go to the heap to pull in the values required. I'm a little surprised that the medium takes longer than the large. He says that the database had to read a larger portion of the table, but because it's returning all columns, I would have expected it would have had to have read all the columns of the TOASTed table as well to pull back the results. So, definitely an interesting finding. But in terms of adjusting to try to get better performance, he suggests adjusting the toast_tuple_target to determine how much gets stored in the TOAST table or not. The next possibility is creating a separate table to store some of these values. But definitely a very interesting post covering TOAST and how it works and some performance impact you may run into depending on what values you're storing in Postgres. So definitely a blog post I suggest checking out.
The next piece of content is "ZHEAP: INSPECTING STORAGE SIZES". This is from cybertec-postgresql.com and they're talking about how zheap handles storage differently with regards to sizing. In constructing zheap, which is a different storage mechanism for Postgres, they've made the tuple header much smaller and the alignment has been improved on how data is actually stored at the page level. So first they created a sample table as a temporary table that's about 500 megabytes. Then they created a table like that one using the standard heap. So that's the default storage system and inserted 10 million rows into it. Then they did the same thing for the zheap storage system and inserted 10 million rows as well. Now, what's interesting is how much slower the zheap solution was.
Now this is still in beta, it's not production-ready yet, so that may have some impact. But in the standard storage system, it took 7.5 seconds to insert 10 million rows and it took 28 seconds to insert into zheap. So that's quite a bit of a difference. But what's interesting is that the size difference is about half the size. So if you look at the standard heap, it was about 500 megabytes. Zheap stores it in 250 megabytes. Now, they say this is due to the things mentioned before, which is that the tuple headers are smaller. So more of the transaction information is stored at the page level as opposed to at the tuple level as well and they do more efficient alignment and padding in terms of organizing the data. Now, what they also mentioned is that this excels when you have fewer columns and tables with many rows, you're going to get a lot more savings as opposed to having many, many columns in the table.
Because again, transaction information is stored at the page level as opposed to the tuple level. And then, of course, the hallmark of zheap, the reason why people are potentially investigating this is to do in-place updates. You can see when an update is done incrementing each row, the size of the table is exactly identical to the previous table. So there's no doubling in size. If you did this with a standard heap storage system, it would double in size because an update consists of a new INSERT and then the old rows vacuuming away. This isn't necessary using zheap, but the thing you need to keep in mind is that all of this information is kept around in a rollback area, or maybe it's called undo and there's actually a directory called undo where it stores all of this information.
So even though it's not stored in the table, it's stored elsewhere. Now, as a consequence, rollbacks are handled differently. With a heap storage system, all the rows are right there. So when you do an immediate rollback, it is essentially free because the rows are still there, whereas a rollback with zheap is more intense because it actually has to literally undo everything and put things back in place. So for example, this rollback on zheap took 41 milliseconds vs. about a 10th of a millisecond for the regular heap. So it's a trade-off. You have to consider if you move to zheap, you get in-place UPDATES with less table bloat. But if you do have a lot of rollbacks that can happen, it's going to zap your performance. So, definitely an interesting post about zheap. I definitely suggest you check it out.
The next piece of content is "POSTGRES 13 PERFORMANCE WITH OPENSTREETMAP''. This is from rustprooflabs.com. So this is another post about the performance of Postgres 13. Here he tested versions 13, 12.4, and 11.9 and he just used the OpenStreetMap data as a source for checking things out. When loading some of the data, there wasn't any appreciable difference. Then you run it through the PGOSM and that happened 12% faster than version 12 and 18% faster than version 11. Then they did some testing with pgbench and they got a 16% faster result with OpenStreetMap PostGIS data calculations.
They talked a little bit about the B-tree index deduplication that's present, and then they also checked the index time creation and you can see that 11 is faster than both, but 13 was faster than 12. I suspect that's because of the deduplication, it takes a little bit more time to create the indexes perhaps. But as is the hallmark for version 13, for some of these indexes, he's seen a dramatic 70% reduction in size. Now, in terms of performance, he didn't see that big a difference with the size of the data. But he estimates that when you get into millions of rows, then you're going to really see a performance difference with the indexes as the size impact comes into play. But check out this post for the performance of Postgres 13.
The next piece of content is "TPC-H performance since PostgreSQL 8.3". This is from 2ndquadrant.com, and this is the second in a series of blog posts. The first one looked at OLTP performance. This second post looks at more of a data warehousing data mart performance for Postgres. Again, they covered versions from 8.3 up to version 13. They used a consistent hardware setup to do these tests. So basically it gives you a trend of showing how performance has changed over time for the different versions.
They used a data set that essentially fits into the shared buffers, a data set that can fit into the memory, and a data set that has to be spilled onto disks. So basically, three different variants. They also added some parallelism as well for certain activities. So the data loading, you could tell that over time it's improved. 13 is maybe not the best, but still pretty good in terms of loading data. Loading data with parallelism enabled. It looks like 13 is probably the winner across all the versions, which is good to see. Then they got into the queries and you could see at the 1GB, 10 GB, and 75 GB, both parallel and unparalleled.
You can see the trend is for performance to increase over time and 13 looks to be about the best out of all of these. Then they looked at some parallel queries and there were some variances between the different versions. But again, the more recent seemed to overall be pretty much the best out of the different versions. So things are heading in the right way and there are no great regressions for version 13. So another post about Postgres 13 performance if you would like to check it out.
The next piece of content is "pg_rewind changes in PostgreSQL 13". This is from sql-info.de So this highlights some changes that have been made to pg_rewind. The first one is that you can now write a recovery.conf file in the case of versions prior to 12 and version 12 it actually adds a signal file. They do mention some caveats in terms of working with how to handle things and that generally it writes the connection information to the postgresql.auto.conf file because that is essentially the last file checked to assign configuration values to Postgres, so it chooses to place those there.
The next option they're talking about is restoring the target WAL. So this can tell the target that you're restoring when doing a rewind of a database system to look in the archive log directory to replay any missing WAL from there in order to catch back up with the primary. And also talking about changes to better handle automatic crash recovery. So starting an instance in single-user-only mode to ensure that crash recovery is performed before doing its sync to the primary. So if you're interested in learning about these changes with regard to pg_rewind in Postgres 13, definitely check out this blog post.
The next piece of content is "Data consistency issues in logical replication". This is from elephanttamer.net. He's talking about the scenario where you have a publisher and subscriber and the data on the subscriber can still be UPDATED, DELETED, and INSERTED into and that can cause issues with the replication. So he's basically covering some of these scenarios. So scenario one: data INSERTed on the subscriber side. So generally you're going to have a replica identity or a primary key in order to logically replicate over data. But the issue is that if you INSERT data into the subscriber table and then you get an INSERT coming in from the publisher table with the same primary key, you're going to get an error.
So a duplicate key violates the unique constraint. He says replication will stop and WAL segments will start to pile on the publisher. So definitely an issue. So basically you need to watch out and avoid these situations. But the way to get around it is to DELETE the data for that particular ID. Scenario two is data UPDATEd on the subscriber side. So much like you can INSERT data, you can actually UPDATE data. So again, it's an issue to be aware of. Similarly, you can DELETE data, which could be a problem if data then needs to be UPDATEd. Then if you have the log set at a certain setting on debug one, you could get "logical replication did not find a row for UPDATE in the replication target".
So that could be a concern. Now then he covers a few different ways where you can resynchronize data for logical replication. First, he says to lock the table for writes on the publisher, dump the data and copy this dump to the subscriber, truncate the table on the subscriber, restore a data dump on subscriber, and remove the lock on publisher. Or two, exclude the table from the current publication, truncate the table on the subscriber, and create a new publication subscription pair. Now, of course, he talks about a very black magic way of doing it. I'll let you take a look at that if you want to examine that, but definitely, some things to be aware of when you're using logical replication.
Now, related to that, there's an article called "Logical Replication Upgrade in Postgres". This is from blog.crunchydata.com. He goes through the process of using logical replication to do an UPGRADE in Postgres and they go through the whole process to be able to do that. Now, they didn't mention sequences here, I'm not sure if that gets brought over in the globals, I'll have to take a look. But it's definitely a process to get logical replication up and then choose to basically switch over to it if you want to do an upgrade.
Similar to that is an article from severalnines.com talking about "How to Upgrade PostgreSQL 11 to Postgres 12 with Zero Downtime". In this case, they're using logical replication again. So if you want to look through these three posts to see how you can potentially use logical replication to do an upgrade, definitely check them out.
The next piece of content is "Using PostgreSQL to Shape and Prepare Scientific Data. This is from blog.crunchydata.com. So this is data you want to analyze, ideally in Postgres. This describes a process of importing the data into Postgres and using Postgres to manipulate and massage the data to get it in the exact format and data types that you're going to be using for your analysis. So he has a process to run through doing it all in Postgres as opposed to using spreadsheets in order to update data values manually. So if you want to check out a more automated way to prepare your data for scientific analysis, you can check out this blog post.
The next piece of content is "The way PG store null value in record". This is from highgo.ca. So this post goes into some of the internals about how Postgres stores NULLS and some of the consequences with regard to comms when you delete them is it can take up some more space because deleted columns are considered NULLS for all of those values and that has a storage impact. So if you're interested in learning more about the internals in Postgres, you can check out this post.
The next piece of content also from highgo.ca is "Free Space Mapping file in details". So if you want to learn more about their free space map file in Postgres, definitely check out this post.
The last piece of content, the PostgreSQL person of the week is Damien Clochard. If you're interested in learning more about Damien and his contributions to PostgreSQL, definitely check out this blog post.