Int to BigInt, Resistance is Futile, Optimizer Statistics, Advanced Query Optimization | Scaling Postgres 153
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss how to convert an integer primary key to a bigint, how container resistance is futile, how the optimizer uses statistics and how to set manual statistics.
Content Discussed
- Changing your huge table's data types in production [FOSDEM 2021] PostgreSQL
- Deep PostgreSQL Thoughts: Resistance to Containers is Futile
- PostgreSQL: ANALYZE and optimizer statistics
- PostgreSQL: CREATE STATISTICS – advanced query optimization
- Querying JSON Data in PostgreSQL
- Query Optimization in Postgres with pg_stat_statements
- Why Partition OpenStreetMap data?
- Partition OpenStreetMap data in PostGIS
- ArcGIS Feature Service to PostGIS: The QGIS Way
- Valeria Kaplan
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is actually a YouTube video. And it's called "Changing your huge table's data types in production". This is from FOSDEM 2021 in the PostgreSQL devroom. This is actually on the channel of Jimmy Angelakos. He's talking about changing data types on Huge Tables in a production database. So do it without impacting production. The example that he used for a lot of the scripts that he shows in this presentation is where you have an integer primary key, and you need to make it a bigint because you're about to run out of integers. So he goes through the whole process and shows you these scripts.
Basically, you first add a new bigint column and then use triggers to start copying the data from that integer to the bigint. Then you set up a job or process to backfill the data in that big int column to make sure that it matches what's in the int column. Once everything is in sync, then as a part of one transaction, you do a series of DDL to rename the column, drop the old column, set up the new primary key, et cetera. But this technique can, of course, be used for other data types. He just is using the example of where you have an integer that's about to run out, and it's the primary key of a table. So if you want to learn more about this technique that he's describing, definitely check out this YouTube video.
The next piece of content is "Deep PostgreSQL Thoughts: Resistance to Containers is Futile". This is from blog.crunchydata.com in the first part of the post he's talking about. There's some people that think containers basically aren't ready for prime time, particularly for deploying your databases. This post talks about how trying to resist this trend is probably futile, and we're heading down that path anyway. He talks about what a container is and even running PostgreSQL on some operating systems, they've already placed certain things in C groups, which is kind of what containers use to do some of their containerization. So you're kind of already doing that. It's just making it more formal when you use a container.
Now, he does list some considerations you need to address when you're dealing with containers. Running Postgres. Number one is the out-of-memory killer. We covered that post last week in which he presented blog.crunchydata.com. The next is storage because a lot of containers are designed to be ephemeral. So how do you handle persistent storage? How do you handle restarts and emotions? And this is basically talking about the orchestration layers and work is being done in that area. And they're talking about the Crunchy Data operator that they have developed. Then custom deployments.
That means someone has set up an automated container orchestration, but then they want to make tweaks to it, and sometimes that causes issues with those tweaks. Now, personally, I haven't deployed PostgreSQL in a container. I don't see a reason for my use cases or for a lot of customers I work with, I don't see a reason for them to use containers. It seems like the push towards containers is because an organization is already using them, say for their application deployment. So they just kind of want to normalize everything. But personally, I don't see a great pressing need to use containers with Postgres. But like anything, it depends on your use case. So if you're considering using containers with PostgreSQL, you may want to check out this post.
The next piece of content is "POSTGRESQL: ANALYZE AND OPTIMIZER STATISTICS". This is from cybertec-postgresql.com and he's talking about statistics and how they are used in regard to completing queries. So it goes through the general process of processing SQL in that it goes through a parser. It then goes to a traffic cop that sends utility commands this way but queries down to the rewrite system which then prepares the query for the optimizer. The optimizer consults statistics in order to determine the best path for the executor. Now, he has an example here where he creates a table using generate_series and he uses an EXPLAIN plan to see what the optimizer is planning to do.
Then he looks at the pg_stats table, which is actually where these statistics are stored. He shows you all the different columns, then he shows an example of what some data looks like and it looks at these statistics per column. So you're looking at the ID column here and it gives a histogram of different bounds as well as other information that it uses to determine the best way to approach querying this column. Then you see a name column here and it's storing something a little bit different. It doesn't have a histogram with different bounds because pretty much everything is {hans} in this column. So it presents the most column value and their frequency from the statistics that it analyzed. So if you want to learn more about statistics and how PostgreSQL uses them, you can definitely check out this post.
The next piece of content- "POSTGRESQL: CREATE STATISTICS - ADVANCED QUERY OPTIMIZATION". This is also from cybertec-postgresql.com. Continuing the discussion of statistics, this is where you can actually use the command CREATE STATISTICS to create dependencies between columns. So normally these statistics are per column, but the CREATE STATISTICS allows you to define relationships between columns. Now, some of those relationships could be between the country and the language that is spoken. Those are highly correlated. Or there could be a relationship between the zip codes within a state, the states within a country, or provinces within a country, those are correlated.
But in his example, he used a generate_series and he defined a correlation between the X and Y column in that the Y column is always just 50,000 greater than the X column. Then he did a query looking at this and when doing an EXPLAIN, it is expected to get a million rows. But when it actually did an EXPLAIN ANALYZE, it actually only pulled out 10,000 rows. So there is a correlation between X and Y that when you're assessing these statistics based on per column and doing calculations on the number of rows that you expect to be there, you're going to get differences. But if you then use CREATE STATISTICS, give it a name and he's using the (ndistinct), there's a number of different statistics you can create. He's using (ndistinct) between these two columns. Then when you do an EXPLAIN, you get the properly estimated 10,000 because it did identify a correlation between these columns when using CREATE STATISTICS. So if you want to learn more about how to do that, you can check out this post.
The next piece of content is "Querying JSON Data in PostgreSQL". This is from aaronbos.dev and he's talking about using JSON data in Postgres. Now this is a relatively simple post. He has a basic set of JSON here that he's working with and he goes through how you can pull it out as part of a SELECT using the arrow (- >) syntax. This returns JSON and then he looks at the arrow notation with a double arrow (- >>) which basically returns text instead of JSON. With that, you can do WHERE queries to determine particular values in the JSON and also how you can nest responses to navigate through the JSON to return the exact data you want in a SELECT clause. He then covers the containment operator so you can use it in a WHERE clause to see if this record contains a particular type of JSON as well as the question mark operator, which helps you to determine if there are any keys that match within a particular set of JSON. So it's a pretty basic post, but if you want to get started using JSON with PostgreSQL, definitely check out this post.
The next piece of content- "Query Optimization in Postgres with pg_stat_statements". This is from blog.crunchydata.com and this is a pretty basic post again, and it's talking about pg_stat_statements. It talks about how to get it set up. Basically, it's part of the contrib module. So as long as you have that installed, you need to add pg_stat_statements to your shared preload libraries and then create the extension for each database you want to use it in. He shows what the view looks like, but once you get it up and running, you can start tracking the statements that are being run against Postgres. He gives an example of a report so you can look at your queries, taking the most time to run or queries that are being run very frequently.
He also covers how you can reset the statistics at a particular point if you want to look at fresh data after you've created an index or done some schema changes. He also covers how you can configure it by just seeing how many statements you're going to be recording in pg_stat_statements, as well as the type of statements you want to track. Of course, once you have identified a statement you want to improve, you would then use EXPLAIN, EXPLAIN ANALYZE, et cetera to figure out how to optimize that query. Now, he also mentioned you can enable IO statistics so you get something for the block read time and block write time.
But this is not enabled by default, so you need to adjust the track_io_timing parameter in the postgresql.conf file. But you need to be aware that this could be a performance burden on your system, so you want to make sure that's not going to negatively impact your system. He did talk about a utility that I actually haven't heard of before is pg_test_timing. So it can actually look at your timers and help you determine if you're potentially going to be impacted by enabling this parameter in your system. So if you want to learn more about pg_stat_statements and how it works to analyze your queries, definitely check out this post.
The next piece of content is "WHY PARTITION OPENSTREETMAP DATA?". This is from rustprooflabs.com, and he's talking about a decision process he's going through on whether he should partition data that he's received from OpenStreetMap into his Postgres database. So he's using PostGIS to analyze this. Right now, he's juggling different schemas as he loads data in to do analysis. He's wondering if converting this to using partitions would make things easier. Now, this is the first post that kind of talks about how he's working now, basically loading data and identifying it in a particular schema and then potentially having to change those schemas. Then sometimes he wants to combine data, which requires unions, which kind of brings some different issues. So he's really wanted to look at partitioning to see if this can improve his workflow.
Now the second post is "PARTITION OPENSTREETMAP DATA IN POSTGIS". So this post goes into more detail about how he's planning to do it and how the implementation would follow. Now, he hasn't determined that he's 100% going to do this, but these are posts that are walking through the process about how he's considering doing it and if it would make sense for him.
Now, a third post is coming. It's not ready yet. Perhaps next week's episode will have the third post to determine whether he's decided to go forward with this or not. But these two posts are a great review of when you decide to do partitioning. It is a burden to set up and maintain, so you want to make sure that the benefits outweigh the cost. So definitely check out these posts if you're interested.
The next piece of content- "ArcGIS Feature Service to PostGIS: The QGIS Way". This is from blog.crunchydata.com. So this is a post about GIS and working with different services to get it integrated into PostGIS. So if you're interested in that, you can check out this post.
The last piece of content, the PostgreSQL person of the week is Valeria Kaplan. So if you're interested in learning more about Valeria and her contributions to Postgres, definitely check out this blog post.