Scaling Out, Planner Estimation, Create Statistics, Stay Curious

In this episode of Scaling Postgres, we discuss how to scale out, how the planner estimates, uses of create statistics and investigating PostgreSQL running environments.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Preparing your postgres data for scale-out". This is from dev.to, and they're discussing different preparations you would need to do if you want to potentially scale out your database. Now, they talk about two different types of scale. Scale-up is just basically putting your database on a larger instance, giving it more memory, CPU, disk, et cetera. Scale-out is when you add one or more instances and try to scale out across multiple databases. They have a little table that talks about the ease of doing both the effectiveness, unique benefits, and gotchas. Basically, the easiest is definitely this scale-up. It's relatively quick to do. The gotcha is a single point of failure, whereas scaling out requires some more planning upfront. However, it does have some interesting benefits in terms of maintaining the database being online. Now, one of the first areas they cover is: do you want to do read-only or read-write?

Because read-only is just basically setting up some read replicas, which are relatively easy to do. And we have some tutorials at Scaling Postgres that discusses how to do that. Read-writing is harder because then you're going to actually have to shard your data across different databases. Now, going into sharding, they talk about different patterns you can use. You could do range-based partitioning and they use the example here. So you have a set of last names and you want to put certain last names into separate database systems. Now, it's a derived example, but it still demonstrates what range-based partitioning means. But of course, the disadvantage of that is you can get hotspots with it. As they mentioned, there are tons of people with the last name Smith in the United States. The next type of partitioning you can consider is vertical partitioning.

So for example, this is basically a service-based approach. So maybe you have some microservices and each of those have a separate database. Well, with that your service can be scaled up and each individual database that is responsible for part of that service doesn't have to be as large. But as you scale, you may eventually get to the point where the databases of one of those services are too large and you need to scale it out. Then they go into hash-based partitioning. Basically, you have some sort of value and you create a hash of it and you assign data to that particular partition. Now, I think a lot of solutions like Citus may be using something similar to this or some sort of partitioning key in order to do sharding. So I would say this is probably one of the more popular ones.

And then lastly, he talks about directory-based partitioning. Basically, you have some service that sits between your app and the database and dictates where it goes. So this is probably more complex because you have to have a separate service that interprets where the data should go. And then, of course, there's a difficulty when you actually do decide to shard because eventually, you may want to bring that data back together and then how do you do cross-shard, joins, and things of that nature? But this blog post is a pretty good overview of the things you need to consider if you eventually get to the point where you actually want to scale out across multiple database instances. So if you're interested in that definitely blog post to check out.

The next piece of content is "Planner selectivity estimation error statistics with pg_qualstats2". This is from rjuju.github.io and he's talking about a Qualstat, specifically Qualstats2 is an extension that keeps statistics on the predicates found in where statements and join clauses. So generally, like for example, I believe pg_stat_statements does not include the exact values being used, whereas this is something that Qualstats tries to track. Now they had a previous post that they discuss here where it's talking about being able to do index prediction on what indexes would be good to add. This particular feature talks about being able to look at relationships between columns within a table in terms of estimating if they need additional statistics. So by default, statistics are tracked per column and they don't track relationships between columns. So for example, it doesn't track if there are so many zip codes in a state or any kind of relationship like that.

He developed an example where the planner, even after vacuum analyzed, believed it would come up with 12,500 rows but it actually only came up with 0 rows. So there were no rows that matched this query. So this is an example of the planners using their estimates based upon looking at the statistical coverage of each column individually and trying to make a conclusion about both of them because the WHERE statement takes both into consideration and there is a relationship between these but the planner doesn't know that. So it's making an estimate of say, half of the rows are this and half of the rows, looking at them separately to get some kind of an estimate, but it's very inaccurate because there are actually 0 rows. He goes over the source code and how this comes to this conclusion and looks at the pg_stats view to get that information.

And then it gets worse when you try to do a join. So in this example here it thinks it's going to find 313,000,000 rows where there are actually 0 rows. So it can get worse. Now, what pg_qualstats, at least version 2 can do, can help detect this problem. So here's an example of a query you can run against this pg_qualstats table. It's part of the extension and it looks at when looking at this column in isolation, the mean ratio means how relatively accurate the statistics are is close to one. So just looking at one column gives a good estimate. But looking at one column in relation to another, meaning and with another, as he references here, or val2 column, the mean ratio is huge, it's very far from one.

So what that means is that these would be good candidates for creating statistics so you can define the relationships between these when they're looked at in combination. He covers a discussion of this talking about extended statistics where you can actually use CREATE STATISTICS to create these two values so it understands the relationship. Now, when you do a query, you can see the expected 1 row, but it got 0 rows so much more accurately, it's not say, hundreds of millions off. Now, with that, he also talks about some slides that Thomas Vondras put together for a talk on the subject with regard to CREATE STATISTICS.

Now actually he did a webinar here that was published called "Webinar: All you need to know about CREATE STATISTICS [Follow Up]". This is from 2ndquadrant.com. So this is an hour-long webinar about CREATE STATISTICS. So you can just click the link here and register for it and you can get immediate access to looking at it. So if you want more information about CREATE STATISTICS and how it can help you educate the planner on relationships between columns, definitely a webinar to check out.

The next piece of content is actually a YouTube video and it's "PostgreSQL at low level: stay curious!". This is on the All Systems Go! YouTube channel. What he covers here is not necessarily so much PostgreSQL, but PostgreSQL on how it runs on an operating system in containers, all sorts of different tools to look at to see what kind of behavior you're getting for your Postgres instance at a low level where Postgres runs. Is it running in a Kubernetes cluster or a VM? And he talks about all the different tools to analyze it from perf to strace. So if you want to learn more about the actual environment you're going to be running PostgreSQL in, definitely a YouTube video to check out.

The next piece of content is "Researching PostGIS slowness (2019 edition)". This is from 2ndquadrant.com. He's talking about a customer they were working with where they had a particular query that was 60 times faster when there was only a difference of 0.0000001 between two different values in terms of calculating a point. But again, it was 60 times faster without that present. They tried to replicate the environment but they used the exact version of PostgreSQL, PostGIS, and lib project and they couldn't see any differences. Then they tried Ubuntu 18.04 and finally they saw a difference and what it led to is actually a difference in glibc versions. So once they upgraded to the most recent glibc versions, both queries became fast. So this is just something to keep in mind. PostgreSQL is great, but it also relies on other libraries, as does PostGIS and other things in your environment. So be sure to keep aware and keep up on patches and upgrades for all the different libraries that you're using in your system to make sure you're getting optimum performance. So if you're interested in this story, go ahead and check out this blog post.

The next piece of content is "Optimations in GROUP BY vs SELECT DISTINCT". This is from thatguyfromdelhi.com and he's talking about differences in performance when doing SELECT DISTINCT, a certain number of columns from a table versus doing a GROUP BY. Sometimes GROUP BY can give you better performance and it has some specific optimizations that if there is a column present, it only uses that column for the Group Key. But he also made an interesting observation that a unique index wouldn't do it, but a nonunique one would and he looked over different variations of choosing different columns, and different keys to see what the differences were. So it's an interesting blog post to cover if you're interested in learning about specific optimizations you could be using with GROUP BY or SELECT DISTINCT.

Next piece of content is "Creating a PostgreSQL procedural language - Part 3 - Executing User Code". This is from2ndquadrant.com. So this is part three that we've covered in previous episodes. This is using Julia language and this is the next phase where they're actually going to be executing user code. So if you've been following along, the next blog post is available.

The next piece of content is actually a new website that seems to have been put up recently. It's called postgresql.life and it looks like its intent is to interview a PostgreSQL person of the week and as of the 27 February 2020, they've posted the first interview: Lætitia Avrot. So, quite interesting, and maybe check this on a weekly basis to see all the different people working in Postgres.

The next piece of content is "Parallel Vacuum and Upcoming PostgreSQL 13". This is from highgo.ca. So this talks about the parallel vacuum coming in 13 that we've discussed before. They talk about how you can enable it, the different settings that will be available, and look a little bit at its behavior. It's a little early, so they had some unexpected results so far, but interestingly you have to give it a parallel command when you run vacuum. I wonder if autovacuum cannot be done in parallel yet and if you have to do it as a manual vacuum. I didn't see any reference to that, but it's just a question I'll have to keep track of as version 13 gets closer to release later in the fall.

The next piece of content is "On recursive queries". This is from habr.com, and it looks like it's been translated from Russian at postgrespro.ru. And it's all about recursive CTEs. So WITH RECURSIVE AS. So if you're interested in this content, definitely a blog post to check out.

The next series of posts are all about PostGIS. The first one is "PostGIS Day in STL", meaning St. Louis. There are a number of presentations talking about PostGIS that have been presented here. This is from the cleverelephant.ca.

Next piece of content is "VISUALIZING OSM DATA IN QGIS". This is from cybertec-postgresql.com. So if you have an interest in doing this mapping, definitely a blog post to check out.

The next post is "Setting up SSL certificate authentication with Pgpool-II". And this is from the highgo.ca. With version 4 of Pgpool, they added some additional, as they say, securing authentication features. One of them is talking about SSLs. So they talk about how you can generate an SSL certificate and start using it with Pgpool II.

The last piece of content is "How to use the SVM Machine Learning Model with 2UDA - PostgreSQL and Orange (Part 2)". This is from 2ndquadrant.com. So again, this is about machine learning. Again, I don't have a background in this, but part two of this blog post is available.

Scaling Out, Planner Estimation, Create Statistics, Stay Curious | Scaling Postgres 103

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript