Postgres 12, Window Frames, Index Types, Synchronous Replication

In this episode of Scaling Postgres, we discuss the release of Postgres 12, how to use window frames, the different index types, and when to use synchronous replication.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is actually an announcement that PostgreSQL 12 RC 1 has been released. It looks like the release of PostgreSQL 12 is set to be released on October 3, 2019.

Related to this release, there's also a timely webinar that was done called "Webinar: New Features in Postgres 12 [Follow Up]". This is from 2ndquadrant.com and they go through the different topics they show. Here is an intro to PostgreSQL 12, SQL/JSON, improvements to partitioning, REINDEX CONCURRENTLY, progress monitoring, generated columns, case-sensitive collations, and plans for Postgres 13. So, definitely a webinar to watch. I haven't had an opportunity to watch it yet, but I will be registering to go ahead and look at the replay, but to get ready for Postgres 12. Definitely a timely piece of content.

The next post, also from 2ndquadrant.com is "PostgreSQL 12: A Few Special-Case Performance Enhancements". These are very short and abbreviated. They're basically talking about a minimal decompression of TOAST values. TOAST is where values get stored when they're too large to be put in a single record. It's kind of a spillover area and it says it decompresses prior to 12, the entire value. Whereas in 12, they say quote "...we only decompress what we need, which can speed up accessing the TOASTed value". So that could be of benefit for very large records that you may be storing. Another improvement is faster float conversion to text. And then the third one is a parallel query with SERIALIZABLE. So it's now possible to do parallel queries with SERIALIZABLE isolation levels. So just a few additional performance enhancements were done for PostgreSQL 12.

The next piece of content is "Advanced SQL - window frames". This is from mjk.space. So this is a very great article. I definitely suggest you check it out. He said there was a previous article on Window functions and that talks about Window frames, so areas within a Window. So here's a quick overview of what Window functions are basically allowing you to do things like rolling averages and things of that nature. He has these great graphs here that represent what he's talking about. So for example, he's looking at Windows' by release year. So you can see this is a particular Window and it's going to average the values across that Window to give you what the year average is. The same thing for the next Window, which is 2016, and the next Window which is 2017. So to expand upon his Window example, he wanted to have a Window frame example and he wanted to use quote "For each film, find an average rating of all strictly better films in its release year".

So you're basically doing a comparison among each row. So he gives an example here of where you had this partitioner, this Window, and then you're looking at the Window frames. He goes over the different syntax and then talks about the three different modes. So there's the rows mode that operates at the row level essentially and all the different ways that you can define that frame. He looks at groups mode and again how to define the frame start and the frame end and then follows up with the range mode then he breaks that into some real-world examples. Now I of course don't have time to cover all of this in-depth but this is a very great post and I definitely suggest you take the time to check it out if you want to get better with Window functions and or Window frames.

The next post is "What DjangoCon has to do with Postgres and Crocodiles. An interview with Louise Granjonc from Microsoft". This is from the citusdata.com blog and actually, this interview is mostly talking about a presentation that's going to be given called Postgres index types and where to find them. So that's mostly what the content is about. But this starts off great. Of these first three things, she's going to be covering in the presentation talking about what one wants the audience to understand that Postgres indexes are useful for two reasons: performance and constraints. So you can use a unique constraint to ensure uniqueness and also get you better performance when doing lookups.

Then she talks about a second thing: an overview of different options with indexing. So you could use partial indexes, unique indexes, multicolumn indexes as well as just standard indexes and then also know when to use the different types. So there's the standard B-tree but also gin that's generally used for JSONB or text GiST and then BRIN indexes which again can give huge space savings, particularly for data types that are more sequential in nature. Now going through this she talks about being a Django developer and then also what advice she gives to fellow developers in terms of learning postgres. So it gives a lot of valuable information for developers that are using PostgreSQL. So if you're a developer, I definitely suggest you check out this piece of content to see how you could potentially improve working with PostgreSQL.

The next post is "Synchronous Replication is a Trap". This is from the rhaas.blogspot.com and what does he mean by a trap? It means don't just rely on synchronous replication to ensure that your data is safe. On two systems, you need to take into account the holistic system that you're developing to ensure that data doesn't get lost. It uses an example of where you have a user inputting data into a web application that then talks to a database system and there are different failure points along that. Just implementing high availability or implementing synchronous replication will automatically make sure everything's safe. You have to do more and think more holistically about the system to actually accomplish not losing data. He just wants to make sure that people are using the features appropriately.

So like, one thing he says here is "I have few reservations about the use of synchronous replication for data consistency". Basically making the master remote apply for its synchronous_commit setting and then sets the synchronous_standby names to a value that will cause it to wait for all standbys to respond to every COMMIT. That assures that things are written to disk across all the synchronous replicas at the same time so that you truly shouldn't lose any data. But there are, of course, downsides to that that he discusses here is that if one of the replicas happens to go down now, your whole system is down because it cannot synchronously replicate to that replica. So you have to have a process in place to handle those particular conditions so that your processing can still continue.

He also mentions that quote "I also don't think it's a big problem to use synchronous replication to control replication lag". Basically, have a smaller delay when replicating to a replica by using synchronous replication. But the issue that he mostly covers here, says, quote "Where I think a lot of people go wrong is when they think of using synchronous replication for data durability". So basically a reliable system that doesn't lose data. And that's when he goes into the discussion about looking at it as a holistic system. And this is just one feature that may or may not help you accomplish that overall goal. But overall this is a great blog post, as is almost every blog post he does. So I definitely suggest checking it out.

The next post is "How to run short ALTER TABLE without long locking concurrent queries". This is from depesz.com. So this is a case he has recently seen where an application had to add a column to a table. The application ran ALTER TABLE ADD COLUMN without a default and everything stopped for many minutes. So basically there was some long-running query and then when this ALTER TABLE started running, it had an ACCESS EXCLUSIVE LOCK. But what happens is there's a lock queue and things started backing up behind this lock waiting for it. So of course, what you need to do now in this is use a timeout. Now he talks about using a statement_timeout, but as mentioned in the comments that he agrees with, what would be more appropriate is the lock_timeout settings. So generally when you want to do these types of DDL changes, definitely use a lock_timeout of some number of milliseconds or seconds and then if it does not complete, it just errors out and doesn't do the DDL statement. So definitely something to keep in mind when doing database changes to your database.

The next couple of posts are all related to security. The first one is very brief, but it talks about "Implementing Transparent Data Encryption in Postgres". This is from momjian.us. This says that this is something that they've been discussing and now it looks like it's actually moving forward and they're going to be implementing a transparent data encryption method where the first thing they're going to accomplish is all-cluster encryption. So that's what they're going to start with. The hope is, quote, "...this can be completed in Postgres 13". So we're already looking to version 13 and what can potentially be an end. It looks like this is something that they're going to try for.

The next post related to security is "USING 'ROW-LEVEL SECURITY' TO MAKE LARGE COMPANIES MORE SECURE". This is from cybertec-postgresql.com. They're talking about setting up two tables, one a company table and a manager table, and defining a relationship like this. Then go through and define a policy based upon a query of these tables to determine who has access to what data in the system. So they grant various roles and then show how it can be able to query different outputs. Now one thing they mention here is that relying on a query like this basically needs to be run to check that policy every time. So you could get into a performance issue. But it's just something to keep in mind if you're looking into investigating row-level security.

The next post is "New version of PostgreSQL Anonymizer and more...". This is from blog.taadeem.net. This is talking about the anonymizer. So it is a tool that they've developed and enables you to anonymize data. So if you have production data, it enables you to take that data, that database, scrub it, and create an anonymized version of it. So blanking out names or replacing with something else, or email addresses, contact information, things of that nature. They're talking about the GDPR rules in place, this could be important. And he's envisioning it for use cases. If you have a standing system and you want to take production data and put it in there, you can do that if you anonymize it. So you're not worried about production data being on a separate system. So if you're interested in tools like this, definitely a post to check out.

The last piece of content is there was PGComp Brasil in 2019 and they've just uploaded a playlist of what looks like 22 videos to the InfoQ Brazil YouTube channel. Now as far as I could tell, all of these presentations are in Portuguese, so I was unable to understand them. But I do know that we have an international audience. So if this is of interest to you, definitely the videos to check out.

Postgres 12, Window Frames, Index Types, Synchronous Replication | Scaling Postgres 83

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript