External Compression, Parallel Vacuum, JSON Aggregates, JSON & Arrays

In this episode of Scaling Postgres, we discuss external compression, vacuuming in parallel, working with JSON Aggregates and the JSON & Array datatypes.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Need for external compression methods in PostgreSQL". This is from amitdkhan-pg.blogspot.com, and he's talking about a means of doing compression within Postgres using external methods. There are some utilities that Postgres offers the ability to do compression, but it's usually just gzip. There are some internal ways of doing compression, like with The Oversized-Attributes Storage Technique or the TOAST tables. However, this author indicates there could be a need for being able to use external compression mechanisms to compress a table or to compress data to get better performance. So, as he says here: "...the idea is smaller data size means fewer data pages to scan, which means lesser disk i/o and faster data access".

Of course, the data decompression methods need to be fast enough not to hamper the query performance gains that you would get. Some of the compression methods that it would be good to utilize in some way are zlib, lz4, std, snappy, gzip, et cetera. Now, he also mentions another benefit of putting this feature in would be to utilize hardware acceleration for compression and decompression. He mentions a few of those here and looks like a few of them are on the ARM64 platform. Now, he does say that PostgreSQL uses its own built-in compression algorithm based on LZ for TOAST table compression. But he says, what if you could choose that compression algorithm, and what if you could configure the compression level, and what if it could use hardware acceleration? He says that the community is working on something exactly like that.

So there is a proposed feature in the making with an experimental patch available. So he took a look at it. He created two tables, a zlibtab table, and an lztab table. The zlib is, of course, compressed with zlib with a level of 4, and then he just used pglz compression on the other one. Now, what's interesting with each of these, you get different behaviors, and you get different levels of performance depending on what you're doing. So, for example, with a copy where you're inserting data, the zlib was almost twice as fast as LZ. The table size, LZ was smaller, looks like maybe 1.2 GB versus 1.7 GB. But in terms of select performance, LZ was twice as fast as zlib. So what's interesting is that it seems as if you could choose your compression algorithm and get different levels of performance.

Maybe you want to have the fastest queries, even though the table is a little bit larger or it takes longer to insert. Or maybe you're just interested in raw insert performance. Well, you can choose an algorithm that is best at that now he mentions you could even define your own compression algorithms in defining a particular compression handler and he gives an example of that. So I found this blog post super interesting in how we could potentially use compression and decompression of data in Postgres to get better performance. You are able to define the type of performance you're looking for. Are you looking for a way to save space? Are you looking for fast queries? Are you looking for fast inserts? So definitely encourage you to check out this blog post.

The next piece of content is "Parallelism comes to VACUUM". This is from 2ndquadrant.com. Now first they cover the different phases of vacuum as they exist. So you've got the heap scan, the index vacuum phase, the heap vacuum phase, the index cleanup phase, and the heap truncation phase. Now, of all of these, the vacuuming of the indexes takes the longest and that is exactly what they're looking to make happen in parallel with version 13 of Postgres. So by default, when you run a manual vacuum, so vacuum some tables, the parallel option will be enabled. So it's on by default. However, they say here quote "Parallel vacuum is always disabled in autovacuum". So that's a little bit of a disappointment because I would really love to have it run there. But I assume they may be enabling this in a future version of Postgres.

Because normally you see new features such as this come out where it's not a default, where it wouldn't potentially impact the vast users of Postgres but then maybe in a later version it's deemed safe enough to be the default but we'll just see what happens in future versions. Then if you do specify the PARALLEL clause, you can define how many workers you're going to be using or you can disable the vacuum by setting a parallel setting of 0, so no workers. Now, he also mentions down here that there are different settings based on the indexes and that you're not always going to get a parallel scan even though you ask for one. So if a B-tree index is small enough, it still may not do parallel operations because it would take more time to set up and coordinate those workers than it would just to do the work itself with one process. But otherwise, you do get B-tree indexes being vacuumed in parallel.

Now, we also mentioned you do have support for different phases for the other indexes that are mentioned down here such as the hash, gin, gist, BRIN, et cetera, and they may or may not be performed in parallel depending upon the phase. Now, in terms of performance, he set this up with a 6 GB table with eight 3 GB indexes and then ran it in parallel for different processes. So as you can see here, there's a dramatic performance gain by going to eight workers, which is the number of indexes that are on the table. In what he mentions here is, quote "...the execution time of the index vacuum accounted for more than 95% of the total execution time. Therefore, parallelization of the index vacuum phase helped to reduce the vacuum execution time. Therefore, parallelization of the index vacuum phase helped to reduce the vacuum execution time much". So definitely a great improvement coming in Postgres 13. And again, as I mentioned, I look forward to future versions where they may enable autovacuum to do this in parallel as well.

The next piece of content is "Postgres json aggregate functions are pretty cool". This is from the blog.k-nut.eu and he basically has a set of tables where a query looks like this. You have a username, an age, and a group name, and he wanted the API to return something that looks like this in terms of a JSON where things are grouped together. So different users and different groups are presented in this JSON format. So first he shows the different tables and how the data is inserted to be able to get a query that looks like this.

Then he converts it to JSONB by combining this information here and he removes the ID column because this wasn't necessary in the output, which gives the results shown here. He then uses the JSON aggregate function, or json_agg to put each row for each group in its own row. So you can see here you have two records inside of this and grouped on the group name. Then lastly, he did a json_object_agg, which combines this into JSON again into the final output that he was looking for. So this is a pretty quick post, but it shows you how you can get Postgres to output JSON in a format that could be directly served by an API.

Now, we've covered different articles that show this, and what they mention is that they usually get some big performance gains from having Postgres do this work through queries as opposed to using, say, your application framework to do the manipulation. So maybe you lose potentially some flexibility, but you generally gain some performance. So if you're interested in learning more about how to do this, definitely check out this blog post.

The next piece of content is "Webinar: JSON & ARRAY - Contemporary PostgreSQL Data Types [Follow Up]". This is a webinar that's given by 2ndquadrant.com. You can click here to look at the webinar, and there's a ten or eleven-minute preview down here. But it basically reviews the JSON data types as well as arrays and works with them to avoid having to follow any kind of an entity attribute value pattern which can really give poor performance. So basically it takes what was in the previous post and goes into those functions and more manipulations that you can do in more detail in a webinar format. So if you're interested in learning more about JSON and manipulating it in Postgres, definitely check out this webinar.

The next piece of content is "How to Get the Best Out of PostgreSQL Logs". This again is from 2ndquadrant.com and they're talking about setting up logging for Postgres. Some of their recommendations are not to make manual changes to postgresql.conf. Basically, use some sort of configuration tool like Ansible, Puppet, or Chef. They recommend using the logging_collector. Some people choose not to use it and just use the system logging of their operating system. But they advocate using the logging_collector. How to set your log destination to a particular format. They have recommendations that they suggest with regard to that. How to configure your log file names, how to configure your long line prefix, and the different configuration settings for setting up your logging. Then they cover some final things like using tools to analyze the logs, such as pgBadger and other management applications to be able to track log changes. So if you're interested in learning more about that, definitely check out this blog post.

The next piece of content is "POSTGRESQL VS POSTGRESQL VS POSTGRE VS ORACLE VS ALL THE REST". This is from cybertec-postgresql.com and they're talking about searching within Postgres where you're doing similar searches. Now, it's not purely about similarity searches, but there are ways to find information through text searching. The first thing they cover is the citext data type, which is available at this extension, or the citext which is a case and sensitive text. So basically you can do searches without having to worry about the case and it will find the appropriate records if you use this data type with this extension.

Then next, they go into like queries and how you can search and define portions of records using LIKE and ILIKE again, ILIKE is case insensitive using a similarity search using pg_trgm. They give some examples of using that. Then go into a full-text search which does an exact search, but only on certain words that exist within rows you're searching for. Then use a phrase search where you can define certain words, and follow others in terms of phrasing. So if you're interested in learning more about these types of searches, definitely check out this post.

The next piece of content is "Oracle to PostgreSQL - Cursors and ltrees". This is from 2ndquadrant.com. So this follows on a post that we discussed in a previous episode of Scaling Postgres where they're talking about how when you go from Oracle to Postgres, you don't have to use as many cursors. Now the last one, they talked about working with hierarchical data using recursive CTEs, this one covers ltrees, which is a label tree that is an extension that exists for Postgres that helps you work with very complex hierarchical data. So for example, they have the taxonomy list here and how this goes 21 levels deep with over 600,000 records and shows you how you can use ltrees to be able to work with this type of data. So if you're interested in that, definitely check out this post.

The next piece of content is "Three Easy Things To Remember About Postgres Indexes". This is from blog.crunchydata.com. This talks about how indexes can speed up other operations too, meaning not just searching on the where statement, but also when you're doing joins or you're looking to group and sort data. Having indexes in place can help those operations and speed them up. Second, indexes aren't always used, so depending upon your table and how you query the data, like for example, they're using the LIKE syntax here. There are cases where an index will not be used in that case. Then thirdly, that indexes come at a cost. Every time that you're inserting updating data, those indexes have to be updated and there's a performance impact for that. So it's a pretty simple post, but if you're interested in learning more, definitely check out this one.

The next piece of content is "An Overview of Trusted Extensions in PostgreSQL 13". This is from severalnines.com. So prior to Postgres 13, you needed to be a superuser in order to create extensions on databases. But apparently, with 13, there are certain extensions that are classified as trusted and certain ones that are untrusted. So something like they mentioned HStore here is in 13 and considered a trusted one. So you just need certain permissions like CREATE DB privileges to be able to add this to your database.

You don't need to have Postgres's superuser permissions, whereas other extensions like the File Foreign Data Wrapper are considered untrusted, and therefore it says you must be a superuser to create this extension. So it's an interesting adjustment that's coming with Postgres 13. They mentioned there are 24 trusted and 24 not trusted extensions. So basically it helps you give more power to users who have access to the system if they have certain privileges to enable some of these extensions for the database they work with. If you're interested in learning more, check out this post.

The next piece of content is "PostgreSQL Backups". This is from highgo.ca. They cover all the different ways that you can backup Postgres, from doing logical pg_dumps and pg_dumpalls to file system level backups using manual backup methods. Such methods are starting the backup, backing up all the files, and then stopping the backup using pg_base_backup. So if you're interested in looking at different backup solutions for Postgres, check out this post.

The next piece of content is "Combining pgBackRest and Streaming Replication, PG13 update". This is from pgstef.github.io. So it shows you how to use pgBackRest with Streaming Replication and get that all set up for version 13. So if you're interested, check out this post.

The last piece of content, the PostgreSQL person of the week is Marco Slot. So if you're interested in learning about Marco and his contributions to Postgres, definitely check out this blog post.

External Compression, Parallel Vacuum, JSON Aggregates, JSON & Arrays | Scaling Postgres 130

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript