Postgres Releases, Performance Discussion, Index Types, Index Power Use | Scaling Postgres 178
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss new releases for Postgres & pgbouncer, have a performance discussion, describe the different index types and how to use them.
Content Discussed
- PostgreSQL 13.4, 12.8, 11.13, 10.18, 9.6.23, and 14 Beta 3 Released!
- PgBouncer 1.16.0
- Rubber Duck Dev Show Episode 8 | Database Performance
- Using Postgres CREATE INDEX: Understanding operator classes, index types & more
- Sergey Kuzmichev and Jobin Augustine - Power Use of Indexes in PostgreSQL - A User Perspective
- PostgreSQL vs Redis vs Memcached performance
- Impact of Network and Cursor on Query Performance of PostgreSQL
- Noset: PostgreSQL extension for blocking SET and RESET execution
- Migrating PostgreSQL to Kubernetes
- Horizontal Scalability Options in PostgreSQL
- Julian Markwort
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 13.4, 12.8, 11.13, 10.18, 9.6.23, and 14 Beta 3 Released!". This is from postgresql.org. Now, this release does include a security issue and a memory disclosure in certain query problems. It looks like the known vulnerabilities only happen if you have max_worker_processes set to zero. But they say it is possible if you have a different setting that there may be some vulnerabilities with regard to it. But it's not known at this time. Basically, this is a security issue that you should upgrade to address, as well as over 75 bug fixes and improvements that were done for a lot of these different versions. The other thing they mentioned is that PostgreSQL 9.6 is End of Life as of November 11 of this year. So be prepared for that if you have not upgraded to at least version 10. So another opportunity to go ahead and upgrade to fix some issues with Postgres as well as test out Postgres 14 Beta 3, which should be released sometime in the fall.
The next piece of content is that PgBouncer 1.16.0 has been released. This is from pgbouncer.org. The main feature of this is now you can make adjustments to the TLS settings without having to restart PgBouncer. You can just do a reload so it's not going to drop any connections. They also mentioned that the max length of passwords has been increased to accommodate some cloud services as well as better cancel request handling. So it does have a number of bug fixes as well, and you can click on the changelog link to find out more.
The next piece of content is we did have our Rubber Duck Dev Show presentation on "Database Performance" this past week. We discussed database performance in general but at about the 15-minute mark of the episode, it's about an hour in length, we go into a deep dive of PostgreSQL performance, although it is applicable to other relational database systems. But I kind of went through the process that I tend to use to optimize database performance and the things to look out for, and spent a lot of time, of course, talking about indexes. So if you're interested in that, you can go ahead and check out this piece of content.
The next piece of content is "Using Postgres CREATE INDEX: Understanding operator classes, index types, & more". This is from pganalyze.com and they're talking about indexes in Postgres. So this is a good complement to the previous presentation. But they have a focus on operators in the first part of the post where they're talking about how the parser uses them to determine the best plan that it's going to follow. It says how index types are related to that, in that certain operators work with certain index types. So for example, a hash index is mostly for equalities, whereas a B-tree can do equal to greater than, et cetera, as well as you can't do any sort of contains with a B-tree index, but you can with a gin index. So the operator kind of dictates what kind of index you can use, and they give some examples of that.
Then they go into the different index types that are possible in terms of B-tree index, a single column, or a multicolumn. As well as using functional index or expression indexes, such as taking the lower of an email, which we also discussed in our presentation, as well as partial indexes. So only indexing part of the data in a table, talking about index-only scans, and using the new INCLUDE clause to add additional columns to it. As well as following up with when you're applying and removing indexes from a large database that's active, you definitely want to add and remove those indexes concurrently so you don't lock up the table. So if you're interested in learning more about that, you can check out this blog post.
The next piece of content- "Power Use of Indexes in PostgreSQL - A User Perspective". This is a video on YouTube on the Percona YouTube channel and again more discussion about indexes and how they affect performance. This approximately hour-long presentation goes into all of the different index types and what they do, what their capabilities are from gin to BRIN, Bloom, GIST, the B-tree, etc. So if you want a great overview of all the different index types and how they can be used to help improve PostgreSQL performance, you can check out this blog post.
The next piece of content- "POSTGRESQL VS REDIS VS MEMCACHED PERFORMANCE". This is from cybertec-postgresql.com and he's talking about an article he recently discovered that talked about Redis versus Memcache performance and he asked the question on how storing this data in Postgres be compared to storing it in something like Redis and Memcached? He talked about his setup here he's doing a few things such as using an unlocked table and using the asynchronous mode to try and get a better performance to match what you would see in Redis or Memcached. The blog post that he looked at compared Memcached and Redis based on the number of records inserted and the speed both from a write and from a read operation perspective. So he replicated that for Postgres. Now, unfortunately, with these results, I have to agree that this is a little bit of an apple and orange comparison because he did not 100% know how they did the Redis and Memcache tests.
Frankly, what I think would be more beneficial if he had actually installed Redis and used his exact same test script to test Redis in the same way, to test Memcached in the same way as PostgreSQL. I think that would have been a much better comparison to look at the number of reads versus writes because you do see quite a disparity in terms of the read times, and how they're significantly better for Postgres. So I suspect there's something different because again, he's taking these test results from a blog post, whereas he produced these. So I think there's some discrepancy there. But this blog post was interesting in terms of seeing how he did the test and how he tried to approximate what Redis and Memcached are doing, basically kind of removing Postgres features. So when you have an unlogged table, you're not going to log that much to the WAL. So that automatically gives you a boost in performance. It looks like he only inserted a row at a time. I wonder what would happen if he tried to insert more than one row at a time because that should give a better performance. So it's an interesting blog post, but I do think it's kind of an apple to oranges comparison.
The next piece of content is "Impact of Network and Cursor on Query Performance of PostgreSQL". This is from percona.com and they're talking about how the network or the speed of the network can have an impact on your query times. So for example, based upon the amount of data you're transferring to a client or to your application servers from the database server is going to have an impact on performance. So for example, they were bringing over a large amount of data and they did an analysis of query times saying, 'Okay, what happens if we run this query on the database host itself?'. You can see most of the work here, they're saying it was done in about 1 second, around 1,069 milliseconds, but the log reported it was complete in about 1.6 seconds. So basically 0.6 seconds slower than what it actually took to run it looks like. They were trying to understand the discrepancy. When you look at executing from a remote application, the timings go from again around that 1 second to over 6 and a half seconds, so significantly longer.
So what they did is instead of pulling all the data, what they did is just pull the max from one of the rows. So it's much smaller, the amount of data being transferred. When running on the host itself, you can see it's nearly identical, the runtimes for actually running the query and what's reported in the log, and then from the application host it's still similar, nearly identical. So it looks like the time differences here are all due to the amount of data having to be transferred to the client and they use this capability in a pg_gather script that they use. But this wait information is also available in the pg_stat_activity view. What they see is that the majority of the wait time is due to client writes and that's writing data to the client. So all this discrepancy appears to be due to just the amount of data that has to be written to the applications client. Then they took a look at the impact of cursors and they looked at client side cursors. My understanding of this is they're pulling all the data down or a big subset of the data down and iterating through on the client itself that data not impacting the database when they're going through each record.
But if you're using a server-side cursor, you get a little bit of a false sense of security because what it logs is just the initial fetch and not subsequent fetches in the log. So you want to be cautious of that. But again, with the cursors, they saw a lot of client reads that were happening. So basically, this post is talking about being aware of how much data you're transferring from the database to your application servers, what is that distance, and what is bandwidth. The faster and the lower latency connection you have will get you better performance. A few other things they mentioned here are that try to avoid selecting all the columns from a table to be able to minimize the data transfer, and only select the columns you need. Also avoid fetching a large number of rows at once because again, that's just going to impact how responsive the application can be. They also suggest avoiding server-side cursors whenever possible and maybe doing the iteration on the application itself. Pull back a set of records and iterate over it on the client. So, definitely an interesting blog post and things to take into account as you're working with your application.
The next piece of content- "Noset: PostgreSQL extension for blocking SET and RESET execution". This is from ongres.com. It is interesting that they're talking about a new extension that they created called NOSET. What this does is a user, when they're using a PostgreSQL session, can actually alter different parameters of that session, including increasing their work memory or maybe increasing max_parallel_work_per_gather. You could also impact the resources of the server and its availability. So there may be cases where you want to limit a user being able to set these things so you're not hogging all the resources of the server. So that's exactly what this extension does and they talk about how to install it and how to enable it for your database. So if you're interested in that, you can check out this blog post.
The next piece of content- "Migrating PostgreSQL to Kubernetes". So this is from percona.com and they're talking about how if you have a Postgres server, how you can convert it to using Kubernetes. Now they're using their own operator, the Percona distribution for PostgreSQL operators. Different organizations have their own Kubernetes operator, essentially, and they're using their Percona-based one. And the technique that they're using is basically log shipping. So you have a Postgres server set up and it's going to ship its logs to a bucket, in this case, Google Cloud Services.
Then this Kubernetes pod is going to pick up those logs and apply them to the database cluster or clusters within the Kubernetes pod and that's how you get things in sync. Now, it's interesting that they didn't describe using, say, streaming replication or logical replication, but those should also be viable paths. But this particular example uses log shipping and they're using pgBackRest for that purpose. Maybe it's because the Percona operator has pgBackRest installed as a means to manage the cluster and that's why they did it that way. But anyway, they go through the whole process and give you the documentation on how to do it and cut it over. So if you're interested in that, you can check out this blog post.
Next piece of content- "Horizontal Scalability Options in PostgreSQL". This is from highgo.ca. They're talking about the different ways that you can horizontally scale Postgres out. Now, the first one you can do is read scalability. So you can have, again, one primary database that has many replicas and you can scale those reads across the replicas. That's one way to do scalability is through reads. Well, you can also do read-write scalability. Now, in this scenario, they're using Postgres's Foreign Data Wrappers. So basically you're sharding your database into multiple instances and then using a Foreign Data Wrapper to write to each one.
Now, there's not a convenient way to do this with Postgres yet we're moving towards that. But this is definitely something that takes a bit of work to set up. Another option that's the most popular one, I would say out there right now, is the Citus extension. So that allows easy horizontal scalability. And there are some costs if you want all the features of it, of course. The other option they mentioned is doing logical replication. So it is possible to copy data from one database to another and logically replicate it. So if you want to learn more about how to horizontally scale Postgres, you can check out this blog post.
The last piece of content. The PostgreSQL person of the week is Julian Markwort. So if you're interested in learning more about Julian and his contributions to Postgres, definitely check out this blog post.