Performance Impressions, Mystery Solving, GiST Performance, K-Anonymity | Scaling Postgres 89
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss Postgres 12 performance impressions, mystery solving, GiST and SP-GiST performance and k-anonymity.
Content Discussed
- PostgreSQL v12 initial query performance impressions
- Making Mystery-Solving Easier with auto_explain
- Achieving k-anonymity with PostgreSQL
- GZip in PostgreSQL
- PostgreSQL Query Plan Visualizer
- Streaming Replication Setup in PG12 – How to Do it Right
- Replication Failover with pg_rewind in PG12
- Configure HAProxy with PostgreSQL Using Built-in pgsql-check
- Introduction and How-To: etcd clusters for Patroni
- PostgreSQL 12 Generated Columns
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. The first piece of content is "POSTGRESQL V12 INITIAL QUERY PERFORMANCE IMPRESSIONS". This is from cybertec-postgresql.com. They basically have done some performance checks that they mentioned that they've done with previous releases for version 12. They're comparing version 11 to 12, predominantly using pgbench and, I believe, a number of other analytical queries that they've come up with. So they go through their testing methodology, the different setups that they used, the hardware software, the configuration options that they set, and then they get into the results, of course. Now they had a scale indicator here. This indicates whether a particular query or work was small enough to just fit in the buffers, whether it would consume more of the RAM of the machine, and whether something was so large it would probably spill onto the disks.
So those are the different scales that they've done here. Then they have the meantime for version 11, and version 12, and then what the meantime difference was, as well as these standard deviations. And there were some cases where 12 was faster. Those would be the negative percentages here, but overall it was a little bit slower than 11. So he took a total average and it was 3.3%. But it was interesting looking at some of the queries here. Now, a lot of these are very fast, like less than one millisecond, so it's hard to gauge speed at that type of resolution. But even some of the analytical queries here, so this one was 0.8% slower, this one was 1% faster, this one was 8% slower, and then 3% slower. So it definitely looks like it just looking at the analytical queries that he had developed, it's a little bit slower. The other interesting thing mentioned was the inserts being slower like this is 20% slower for the pgbench default insert, and the pgbench extra index insert is 12% slower.
Now, we mentioned that perhaps this is due to the space improvements to the index, maybe it takes a bit more time to get the data inserted. But what I also found interesting is that the standard deviation is huge. Like this is 177%. So looking at the average time, it's 0.4 milliseconds for this insert. However, the standard deviation was 0.02. So that's a huge standard deviation. The same thing with the extra indexes. It's 0.05 in terms of the speed of it, but the standard deviation is 0.02. So a huge range in terms of the inserts. So not quite sure what that is. And he doesn't have any theory from what I looked at here. Now, of course, he does mention these are predominantly artificial tests, so you have to take everything with a grain of salt. But definitely perhaps investigate the insert performance as you're using version 12 to see if there are any differences from version 11. So overall, very interesting blog post and I encourage you to check it out.
The next post is "Making Mystery-Solving Easier with auto_explain". This is from richyen.com. So they had a situation where they were working with a customer and they tried all these variations to get better performance after migrating from Oracle to PostgreSQL. So they said they adjusted shared buffers, and tried swapping the JDBC drivers they were using. They perhaps thought it was the replica, but just using the primary database had no impact. So what they did is they set the log_min_duration_statement to 0 to log all the queries and get some timing. It looked like it was an UPDATE statement. So we took one of the queries and ran it through EXPLAIN ANALYZE and didn't see an issue. Now, I don't know if it's the exact query that was hitting the database as we'll see lower down here, but they took one of the example queries, ran it through EXPLAIN ANALYZE and it was using the index.
There was seemingly no problem. So what they did is they used auto_explain, which is an extension that you can put in your shared preload libraries and you can configure it. They set the log_min_duration to be 0 and to turn on ANALYZE. So it gives you the actual real count as well as the cost estimates. When doing that, they noticed that the filter that was being applied was converting the integers into doubles. So it would take the column, make it a double afloat, as you were, a double float, and then take the number that was input, make it double precision as well, as well as the second column and the second value. So of course this caused a sequential scan and it was no longer using the index.
They finally traced it down to a third-party interface, adding an execute update method for database connections and basically it was, as they say here quote "...basically converted all the int arguments to double precision..." which prevented it from using the index that was there. So basically the additional software between the actual application and the database system was adding this to it and causing performance problems. So definitely something to be aware of. And also a tool, an extension, auto_explain, to be able to try and help diagnose it. I wonder though, with this, the log_min_duration_statement, I wonder if it was possible to capture more of it, to be able to diagnose it earlier without having to use the auto_explain. But definitely something to keep in mind if you suddenly run into performance problems.
The next article is "PostgreSQL 12: Implementing K-Nearest Neighbor Space Partitioned Generalized Search Tree Indexes". This is from 2ndquadrant.com and basically, they're looking at performing a nearest neighbor. So a simple linear distance using this operator here and they're seeing what you can get the best performance with based upon indexes. So, first, they didn't use any indexes at all and they just retrieved five results. The closest location to these coordinates with the data that they had on hand, it basically took looks like 2 seconds to be able to return the data that you see here. Then they added a GiST index to the location column and it took about three minutes to build. But after they did that, when they ran the query, it ran in one millisecond, so definitely a lot faster with the GiST index.
Then they tried an SP-GiST index again adding it to the location. This one ran in 0.358 milliseconds, so even twice as fast. And they have a table of the comparisons here, so unindexed, I'm going to ignore the estimates, but the actual query time was 2.5 seconds. With a GiST index, it was about a millisecond, 0.939, and with the SP-GiST index it was 0.358. In addition, the index size was smaller with the SP-GiST index compared to just the GiST index. And of course, the creation time took a minute and a half. So half the time of a GiST index. So if you're using K-nearest neighbor searches or looking for location distances, definitely use the SP-GiST index as opposed to just a GiST index. So if you want to learn more about how they set up the data and did this test and the results definitely in the blog post to check out.
The next post is "Achieving k-anonymity with PostgreSQL. This is from blog.taadeem.net. So this is talking about the PostgreSQL anonymizer. They say that they use various different strategies to anonymize data to make it easier to look at without being able to identify a single individual. They mentioned using randomization, faking, partial destruction, shuffling, and noise addition. Well, now they're talking about generalization. So in terms of generalization, it's taking certain data and making it more general. Like for example, in the case of a zip code, give a zip code range. So it's not specific, but it's a zip code range. They live somewhere within an area, or a birth date, maybe you give a range of years that that birth date would fall, but not the actual birth date. Or you could even say the year that they were born would be another way to give something more general.
So they give an example here of using this generalization technique where you have an SSN, a first name, zip code, birth, and disease, and then what they did was redact the first name, got rid of the Social Security number, and redacted the first name, so it was removed. Then they used ranges for the zip code and ranges for the birthday, but then kept the disease so you could analyze this disease based upon the ranges of birth and zip code without being able to identify the specific individual that had a given disease. Then they talked about being able to do the k-anonymity, which as they say here is "...an industry-standard term used to describe the property of an anonymized data set". So basically, it gives you the probability that someone can identify an individual from the set of data that it's given. So here you can specify which columns are being used as an indirect identifier.
So in this case the zip code and the birth. And you can run this function to give you a K factor. So in this case it's 3, which means for any value of the zip code and the birth, you will find at least three individuals in the generalized patient table. So any one grouping has at least three individuals. Now, as you get closer to one, that means that you can identify a given individual based on those combinations. So the more columns you have in a table, for example, you would be more likely to drop down your K to where you could actually identify an individual from the data given. So basically you want to keep that value above one to keep the data anonymized. So definitely an interesting blog post and an interesting tool given the different logs that are being enacted with regard to privacy.
The next post is "GZip in PostgreSQL". This is from the cleverelephant.ca blog by Paul Ramsey and he's talking about a very simple extension that he set up that supports doing gzip. Now this is not something to use to compress the data you're typically storing in PostgreSQL, but it just does, as he says here, a binding of zlib to SQL. So basically if you have a process where you need to extract data from PostgreSQL and then zip it up for consumption as a part of an API or you're receiving gzip data that you want to insert into PostgreSQL, this extension gives you a means to do that. So if you're interested in it definitely a blog post and a new extension to check out.
The next post is "PostgreSQL Query Plan Visualizer". This is from theartofpostgresql.com. So when you're analyzing a query plan, it gives the representation to you in text, but there's actually a PostgreSQL Explain Visualizer tool that's been developed and it's open-source. You could install it yourself, but on this actual blog post they did that and they're calling it the Explain Plan Visualizer. So you could paste your plan in here, and place the query that resulted in this plan. It will give you a graphical representation of it. So presumably something easier to read. So if you want to look into that, maybe check out this blog post as well as maybe the separate PEV tool that you can install on your website.
The next post is "Streaming Replication Setup in PG12 - How to Do it Right". This is from highgo.ca and this blog post just basically goes through how to set up streaming replication in PostgreSQL 12. Given the changes that have been made to the recovery.conf file. The fact that it's gone and they're now using the two different signal files. So this is another piece of content if you need to set that up.
The second related post is "Replication Failover with pg_rewind in PG12". This is from highogo.ca. Again, with the recent changes to PostgreSQL 12, this goes through the process of replication failover or with pg_rewind. I call it failing back. So once you've failed over to a new primary, have a replica be the new primary. If you want to fail back to the old primary, pg_rewind helps you accomplish that process and they go through how to do that in detail. So if you're interested in that, definitely a blog post to check out.
The next post is "Configure HAProxy with PostgreSQL Using Built-in pgsql-check". This is from percona.com. Now, we've covered the previous article related to this HAProxy using Xinetd, and I believe they didn't use pgSQL check in that case. They use a custom script to do it, which to me looking at that post seems like a better way to do it compared to this one. But this one their objective was to use the built-in pgSQL check, but in order to do that, it doesn't really do authentication to verify that it can connect successfully in order to check whether a database system is in replication mode or not. So they actually set it up to use this tool using modifications to the pg_hba.conf file, which I'm not necessarily a fan of, but it is a way to use it. This blog post explains how to do these changes so that you can do failover switchover procedures using HAProxy and the built-in pgSQL check tool. So if you're interested in doing that is the blog post to check out.
Next post is "INTRODUCTION AND HOW-TO: ETCD CLUSTERS FOR PATRONI". This is from cybertec-postgresql.com. So if you use Patroni and want to use etcd for your consensus holder, this is a pretty long comprehensive blog post that explains etcd and how to set it up for that purpose.
Then the last post is "PostgreSQL 12 Generated Columns". This is from fluca1978.github.io. It's another blog post that talks about the new generated column features in PostgreSQL 12.