Postgres 12 Released, Generated Columns, Parallel Estimates, Change Data Capture

In this episode of Scaling Postgres, we discuss the release features of Postgres 12, generated columns, parallel estimates and change data capture.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is PostgreSQL 12 is released. This is from the postgresql.org website and it was released on October 3, 2019. Some of the main areas they talk about in terms of what's been updated are overall performance improvements. So the first is significant performance and maintenance enhancements to the indexing system and partitioning. So indexing can get up to a 40% space reduction in gain in query performance and it handles workloads better where the indexes are frequently modified. There are also partitioning improvements where you have thousands of partitions and also improves INSERT and COPY performance as well into partitions.

They reduce the write-ahead log overhead when doing GiST, gin, and SP-GiST index types. You can create covering indexes. The INCLUDE clause on GiST indexes. CREATE STATISTICS now supports most common value statistics. Just-in-time compilation happens by default. Also, enhancements to the SQL JSON path common table expressions are inlined by default. But again, you can change it whether you want it materialized or not. Then generated columns which we'll discuss in a post a little bit later. Then some other interesting things mentioned in terms of administration is now you can REINDEX CONCURRENTLY, which is huge, and also pg_checksums.

Now there are a few other posts that we're going to talk about that discuss Postgres 12. The next one is "8 Major Improvements in PostgreSQL 12". This is from enterprisedb.com. The first thing I talked about is the partitioning performance. Again, when you have a lot of partitions that were improved. We saw a post on that in a previous episode of Scaling Postgres as well as INSERT and COPY statements, and performance improvements. Again going back to the B-tree index enhancements, doing spatial utilization and some of the other ones I just mentioned. Talking about the multicolumn most common value statistics so that you can define statistics across columns. The common table expressions or CTEs have the WITH query being end-aligned, and you can define whether you want it to MATERIALIZE it or not.

They mentioned prepared plan control and they mentioned that previous versions of PostgreSQL would use a custom plan five times and on the sixth, generate a generic plan and use it as if it's as good as one of the custom ones. But that behavior can now be controlled through a variable called plan_chache_mode and this allows you to use a generic plan right away. They say that this should benefit users who quote "...know their parameters are constant and know the generic plan will work". They mentioned the just-in-time compilation, the checksum control, and that now you can change a cluster from no checksums to checksums without having to dump and reload the data. You must be offline to make this change. So it sounds like it does require a stop and a start, but again being able to turn them on or off, would be beneficial. And they mentioned the reindex concurrently. So again, a lot of great changes in version 12.

The next post, again, related to the new version 12 feature is "Generated Column". This is from pgdash.io. And the first thing you talk about, what are generated columns? Well they just created a table here with three columns, which looks to be a width, a height, and then you can calculate the area, but that's always generated as a width x height and it's going to be stored. So this is stored in the table and whenever the width or height are adjusted, this will automatically update. So you can see here you're doing an update, you're going to set the width to 40 and now that area will automatically be regenerated again. They mentioned a few points you should know about generated columns. One is persistence. So currently, the value of the generated columns have to be persisted and cannot be computed on the flight query time and the stored keyword must be present in the column definition.

Two, the expression used to compute the value has to be immutable, so it can depend on other columns but not other generated columns. They can be used in indexes, but the values of generated columns are omitted from the output of a pg_dump in the COPY table commands. Now they have an interesting example here where they actually use it to make a column to support a full-text search. So they have scenes and some information here and they have a body column with the full text of this scene from a play and they generated a column as a tsvector and converted the body to a tsvector. So now you can easily do full-text searches. That could be a really great use case for generated columns. So if that's of interest to you, definitely a blog post to check out.

The next post is "Freshly baked PostgreSQL 12 and changes are brought to pgCenter". This is from blog.dataegret.com. So I believe this is a product for monitoring Postgres. I'm not that familiar with it, but they mentioned some of the changes that have been made. The first two they mentioned are more progress indicators. So when you're doing an index creation, you can look at pg_stat_progress to check the status of the creation of this index. And when you're doing a VACUUM FULL or a cluster, you can use pg_stat_progress_cluster. It also mentions that on pg_stat_database, two new columns have been added, chcecksum_failures and checksum_last_failures, so that you can check on those checksums that you can enable now once you start and stop your database. So again more information regarding changes to the system views to be able to do better monitoring. So if you're interested in that, check out this blog post.

The next post is "HOW POSTGRESQL ESTIMATES PARALLEL QUERIES". This is from cybertec-postgresql.com. They're talking about parallel queries which were introduced in 9.6. This particular post talks about just parallel sequential queries. So they set up a simple table here with two columns, one with, I believe, 10 million different values, and then the other column only contains two different types of values. Now, first, they just want to see what non-parallel looks like. So they set the max_parallel_workers_per_gather to 0 and then do the query. So they get a sequential scan and then they want to figure out how it calculates different estimates here. So for example, this is an EXPLAIN, it didn't run the query. So you're seeing this estimate here and he asks, well, how did it come to that? So he looked at the pg_relation_size to get the number of blocks.

Then he used this calculation taking the current setting of the seq_page_cost multiplied by the number of blocks and then the cpu_tuple_cost multiplied by, I believe, the number of rows and the cpu_operator_cost by the number of rows. And you get the exact cost estimate that they come up with for a non-parallel scan. Then he did the same thing where he looked at parallel sequential scan and he showed the formula that is used here to estimate how long it will take to do this particular parallel scan, particularly looking at the number of rows that it estimated here. They take the number of estimated rows divided by the number of cores plus 1 minus the leader contribution. That's the leader that's aggregating the results from each of the gather times, the number of cores. So this is a pretty interesting look at how PostgreSQL actually comes up with its estimates, which I was not too familiar with, but if you're interested in learning more about that, definitely a blog post to check out.

The next post is "FIXING TRACK_ACTIVITY_QUERY_SIZE IN POSTGRESQL.CONF". This is from cybertec-postgresql.com. So they're talking about an issue where in some of the system views, if you have very long queries and some of these system views tell you what queries are running for a particular activity, like a pg_stat_activity or pg_stat_statement, sometimes those queries get cut off. How you can prevent that is to modify the track_activity_query_size. But unfortunately, there are some costs to that. So you probably don't want to increase it by too much and it actually does require a restart of the server. But again, if you have a big system, he says here, increasing it is generally a good idea because more information is generally better. So if you've run into an issue with some of the queries you're looking at in system tables being truncated, definitely check out this post and this setting track_activity_query_size.

The next post is "PostgreSQL Change Data Capture With Debezium". This is from crunchydata.com. So this is a post that talks about using change data capture. So it's basically using the wall stream and specifically probably logical decoding in order to extract data as it's being produced. So you don't have to rely on triggers or going through logs. Debezium can actually be set up to use logical decoding to read essentially the WAL stream to be able to track what data has changed. Now the use case they're talking about here is that maybe you have a couple of relational databases. You can use the Debezium to look at the WAL stream and then send it to this JDBC connector to update an analytics database. So this is a very comprehensive post that goes through all the different commands to set up and get this up and running. So if you're interested in implementing a change data capture method or using the Debezium, definitely a blog post to check out.

The next post is "How to work with Postgres in Go". This is from medium.com. So if you use Go, you would definitely I think find this of benefit to get the most out of working with PostgreSQL. This is mostly about the interaction of Go with Postgres and general recommendations and he summarizes a lot of his conclusions at the bottom here. As you can see, it's a very comprehensive post but he suggests what Postgres driver you should use to configure limits for your connection pool size, collect connection pool metrics log what's happening in the driver. So a lot of recommendations on using Go with PostgreSQL. So if you're using that tool, definitely a blog post to check out to make sure you get the most performance out of it.

The next post is "The Transparent data encryption in PostgreSQL". This is from highgo.ca and it was recently announced they were looking to target this for PostgreSQL 13. This individual says "I've been working with a PostgreSQL community recently to develop transparent data encryption". So this goes through kind of the rationale and thinking about it and how to implement it. So if you're interested in that type of content, definitely a blog post to check out.

Another post by higho.ca is "A Guide to Create User-Defined Extension Modules to Postgres". So again, if you want to do extension development, this is a blog post that walks through how you can go ahead and get started creating your own extension in C. So if that's of interest to you, definitely a blog post to check out.

The final piece of content is "Trace Query Processing internals with Debugger". This is from highgo.ca. So this goes through the whole process of tracing a query through PostgreSQL. So this is again a very comprehensive blog post that goes through describing each step that the query goes through through PostgreSQL. So if you are looking to look at the internals and how PostgreSQL does its querying. This would be another piece of content to check out.

Postgres 12 Released, Generated Columns, Parallel Estimates, Change Data Capture | Scaling Postgres 84

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript