Planner Optimization, Fish to Elephants, Large Databases, Spatial Data | Scaling Postgres 213
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss ways you can optimize the planner, becoming part of the Postgres Community, considerations for large databases and mapping three different types of spatial data.
Content Discussed
- Optimizing with the PostgreSQL deterministic query planner
- 5mins of Postgres E15: Tuning Postgres' Deterministic Query Planner, Extended Statistics and Join Collapse Limits
- Teaching Elephants to Fish
- Working With Large PostgreSQL Databases
- Using Uber's H3 hex grid in PostGIS
- Mapping Fracking Wells in Colorado with PostgreSQL
- Analyzing historical flight data with MobilityDB
- global deadlock in a distributed database cluster
- Formatting SQL code with pgFormatter within Vim
- Easy Postgres Major Version Upgrades Using PGO v5.1
- Philippe Beaudoin
- Rubber Duck Dev Show Episode 40 | To DRY or Not To DRY
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Optimizing with the PostgreSQL deterministic query planner". This is from aha.io, and they're talking about the perspective that quote "The PostgreSQL query planner will always discover the optimum query plan, given its statistical knowledge of your system and your data". Now, I usually don't deal with absolutes, so I'm a little hesitant to say always, but if you give the query planner more information, as much information as you can, it will find you, the vast majority of the time, the right plan. Does it do it 100% every time? I'm not willing to go quite that far, but this is a great post that describes different ways you can give more information to the planner to optimize your queries.
So the first area he suggests changing is setting the right cost for page access. So here they're talking about the ratio between the sequential page cost and the random page cost. So this is something you adjust in the configuration parameters of Postgres, and by default, the random page cost is set to four and the sequential page cost is set to one. The reason this is set this way is there's an assumption that random reads from a hard disk are going to be 40% slower than a sequential read and that maybe 90% of your disk pages are cached in memory. So a wonderful ratio. And what they basically did is they re-engineered this assumption into this function and they said, well, we're not using magnetic disks, of course, they're using NVMe drives or something equivalent, some sort of network fast-based storage.
So what is the appropriate ratio? Well, their cache hit rate is close to 99%, and they're assuming there's probably a 5x ratio instead of 40x for the cost of a random access vs. a sequential access. Therefore computing that into the function, the random page cost, instead of being 4 should be something like 1.05. So that's what they set it to. And they saw a market decrease in full scans happening unless you had very small tables. And that, of course, resulted in better performance because the system was assuming that a random page cost of an index scan would be faster than doing a full sequential, full table scan. The second area he looked at was setting up statistics for dependent columns. So they have a query here where they're looking at a particular table with an account ID, product ID, and initiative ID.
Postgres considers these indexes independent. So if you're looking for this, it's potentially going to try to do a bitmap merge of all three indexes to try and get the right answer. But the reality is these are not necessarily independent and the data looks more like this from their perspective. In other words, this is all-encompassed within an account ID. The product ID is a part of it, and the initiative ID is a significant part of the product ID. But there's this overlap and you can communicate this type of dependency to the planner for it to make better plans on how to query it. You do that using CREATE STATISTICS. So they used CREATE STATISTICS feature stats, and they defined dependencies on the account ID, product ID, and initiative ID. And then once they analyze the table, of course, the statistics get updated.
And when querying it, they saw a 20,000-fold improvement. Now, I've seen improvements like this using say, compound indexes or something like that can give a drastic improvement. But essentially, you're relying on the planner to choose the path that a custom-crafted compound index may take. Basically, it can be seen that because the initiative is such a small subset, you can query that and just throw out using a filter any product ID or account IDs that don't match. Well, you won't have an account ID that's not one with the initiative three. And it's just a few product IDs that have to be filtered out. So that allows a planner to make a far more efficient plan. The next area they covered is bewareing the join collapse. Now, this is fascinating, a lot of this, because some of this is actually new to me. And the situation is when you have a lot of joins, it can really start slowing down.
The reason is that Postgres uses a deterministic method to find the optimum plan to execute your query. And when you have too many joins, the possibilities go exponential and planning time would take forever, or at least a really long time. So there was actually another organization that developed a genetic algorithm-based query optimizer that is based more on heuristics. So it kind of does the genetic concept of evolution to pick the best plan. So there's a bit of randomness or non deterministic methods to seek out what the best plan is. Now, it's not always going to be the best plan. There's a chance that it's going to choose a poor plan, not the best one, the way a deterministic would. But allows you to not check all the paths, but most of the time pick the optimum plan.
And when to switch from this deterministic to non deterministic is determined by the geqo_threshold. By default, this is set at 12. And then how much actual work that planner does, the heuristic planners, based upon the geqo_effort, which is from 1-10, and it defaults to 5, well, they've had good experience increasing the threshold to 20. So it's pushing back when that heuristic calculation gets used. They found good experience doing that and they're considering when it does go over that threshold and they start using the heuristic algorithm, potentially increasing the effort value to spend more time to try and get the best plan. Because for them, it's more important to get an accurate plan than to make the queries a little bit faster by cutting down on the plan time.
But this is a pretty interesting adjustment that has allowed them to get a bit more performance for their particular queries. Now the other area that they didn't mention that you'll definitely want to do is increasing your default statistics target. So by default, it does a sampling of so many rows for each table to collect statistics about them. Now, as your data sizes increase, you're probably going to want to increase the amount of rows sampled because I've had queries that suddenly started performing poorly. To resolve the situation, all we had to do was bump up tenfold the amount of rows that were included in the statistics. So that's another area increasing that either on a full database-level or at a table-level. So the planner has more relevant rows to work with to determine what the best plan should be. But this is a really great post I encourage you to check out.
Now related to this, this was also the episode covered in "5mins of Postgres E15:: Tuning Postgres' Deterministic Query Planner, Extended Statistics and Join Collapse Limits". So he reviews this post as well and I definitely encourage you to check out his episode for his perspective on it.
The next piece of content- "Teaching Elephants to Fish. This is from timescale.com and they're talking about Babelfish. So if you're not familiar, Babelfish is an extension that was developed, I believe, by Amazon AWS that allows a PostgreSQL instance to be line-compatible with Microsoft SQL server clients. So basically you can have a Microsoft SQL server client that speaks its standard wire protocol to other Microsoft SQL servers. But now I can actually talk to Postgres and send the exact same types of data.
Now with regard to this, this post is talking about the community members because looking at the DB-Engines ranking, you can see over the last nine years that Postgres has continued to increase. The other relational database systems have been declining slightly, but the Microsoft SQL server has been declining more so. He's saying there's the potential for many more new community members to come into Postgres from the Microsoft SQL server because of this. And he basically says we need to have a plan for this. And he goes over five areas that we should
consider in terms of doing community management for potential new members. One is to lead with empathy and curiosity. Two is lower the bar for the entry-level #pghelp. The third is to support new members by cultivating more leaders. Fourth, seek leaders proactively as opposed to just having them come up. And then fifth, develop consistent messaging around the community. So if you're interested in keeping and developing a healthy community, definitely a blog post to check out.
Next piece of content- "Working with Large PostgreSQL Databases". This is from percona.com and they're talking about things you should do when you're working with a large database as opposed to a small database. Now that's kind of a loaded question. What is small? What is large? Some people may say a small is on the megabyte size, some say it's the gigabyte size where there are people that have large databases dealing with petabyte sizes. But the other thing to consider, because a lot of the things mentioned here I don't necessarily are all small vs. large, but it's also when you have a large database, you're going to have more customers.
Those customers are going to have more eyes on your product. And if you go down for any reason or have issues, it's going to be far more visible to more people. Large databases also imply you're going to be making a lot more money. So there's more money on the line if something happens to the database. Then a large database implies again more money that you probably have more engineers working on the product, and more people working with the database. So that also pushes a lot of the things mentioned in this article, like they're talking about starting to use more automated tools. Well, that becomes more important the more people's hands that are in the mix and you don't want to do things manually talking about consistently monitored and going through an active tuning cycle.
Again, these become more important and the more people that are using it. Now from a technical perspective, as you get large, you will have more bloat you have to deal with, and you have to do more autovacuuming tuning compared to when the database was small. You have to have optimized indexes that are much more important on the large databases vs. small databases. So there are definitely technical differences that need to be addressed. But this post does a pretty good overview of the things you need to be aware of as your database increases in size into that large area, whatever you call it. But if you're interested in learning more, you can definitely check out this blog post.
The next piece of content- "USING UBER'S H3 HEX GRID IN POSTGIS". This is from rustprooflabs.com. They're talking about an extension called H3 that was developed by Uber for developing hex grids. And it's actually an extension so you can create it in your PostgreSQL database. It allows you to create things like these geometry boundaries for hexes and even work from different perspectives. And here you loaded in some OpenStreetMap data to give a hex representation of different concentrations of traffic signals. So if you're interested in that, you can check out this blog post.
The next piece of content- "Mapping Fracking Wells in Colorado with PostgreSQL". This is from selectfrom.dev and he's talking about using PostgreSQL to analyze and draw maps for this fracking. So again, this uses PostGIS to be able to map these and present that data. So I encourage you to check out this blog post if you're interested.
Next piece of content- "ANALYZING HISTORICAL FLIGHT DATA WITH MOBILITYDB". This is from cybertec-postgresql.com and MobilityDB is actually an extension that sits on top of PostGIS to do spatial temporal analysis of the data. So he walks through getting this set up, shows the stack that he's using for doing this analysis and shows you the different commands to go ahead and do this type of analysis on historical flight patterns. So if you're interested in that, you can check out this blog post.
The next piece of content- "Global deadlock in a distributed database cluster". This is from highgo.ca. And of course, a deadlock is when one session takes a lock in a row and needs to lock another row, whereas another session has locked the row it wants to go to and locks the row that the first session already has. So basically, there's no way to resolve that locking solution. So you have to declare a deadlock and kill one of the sessions. Now this happens on the local PostgreSQL cluster side, but what happens when you're starting to try to do a distributive cluster of PostgreSQL systems?
Well, then you need some sort of global deadlock mechanism. This blog post walks through setting this up using the Postgres Foreign Data Wrapper. So they create two coordinator nodes and two data nodes and show the example of a deadlock happening. And basically, it waits forever. So basically this is a scenario that we can get into with the current path of trying to maintain a cluster of PostgreSQL instances. So how are we going to solve this particular problem? And they have a little bit of a discussion about it, but basically, this is something in the process and it highlights the issue at hand and some thoughts on how to address it. So if you're interested in that, you can check out this blog post.
The next piece of content- "Formatting SQL code with pgFormatter within Vim". This is from endpointdev.com and we've mentioned pgFormatter in previous episodes of Scaling Postgres, but here he's using it within the Vim editor to actually auto-format your SQL. So here's a quick little animated GIF that shows taking unformatted SQL and then showing the format of the SQL. And he mentioned a link for someone doing it in Emacs and he's using it with Vim. So if you're interested in that, you can check out this blog post.
The next piece of content- "Easy Postgres Major Version Upgrades using PGO v5.1". This is from crunchydata.com and this is using PGO, their Postgres operator for Kubernetes. So if you're using Kubernetes to manage your PostgreSQL cluster, their operator allows you to do a full version upgrade. Now it does require a stop and then a start of the Postgres instances, but it does seemingly allow you to do it quite easily using their operator. So if you're interested in that, you can definitely check out this blog post.
The next piece of content is the PostgreSQL person of the week, which is Philippe Beaudoin. If you're interested in learning more about Philippe and his contributions to Postgres, definitely check out this blog post.
The last piece of content, we did have another episode of The Rubber Duck Dev show this past Wednesday evening. This episode is about whether "To DRY your code or NOT TO DRY" your code. So if you're interested in more long-form, developer-based content, maybe you'd like to check out our show.