background

Explain Analyze, Sequence Counts, Best Books, Partition Migration | Scaling Postgres 112

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to use explain analyze, the problem with counting sequences, the best books to learn Postgres and how to migrate to declarative partitioning.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers are doing well. But our first piece of content is "Optimizing SQL - Step 1: EXPLAIN Costs and Plans in PostgreSQL - Part 2". This is from highgo.ca. And as you can tell, this is part two of the post about EXPLAIN. Now, we covered the first version a number of weeks ago explaining how EXPLAIN worked, and it actually shows you how you could actually build a query based upon an EXPLAIN plan. This one goes into ANALYZE and talks more in-depth about interpreting some of the output. So in terms of their setup, they just have a big and small table. But the big table still only has 100 unique rows and the smaller table is just a little bit smaller. The first thing they talk about is costs. Basically, PostgreSQL uses a cost-based optimizer. 

So it takes into account some of the configuration settings, like the sequential page cost, and others take into account the statistics that exist for a given table, as well as indexes and everything else. And it comes up with a cost estimate for each operation. So, for example, it's looking at the different operations here. How long would it take to do a hash, a sequential scan, or a sort? And it comes up with a cost estimate for each one. And you'll see that there are two different values listed here. So they say the first part of it is the estimated startup cost. So how long would it take to get to the point of doing this step? Like, does anything need to happen beforehand? This one is one of the first things it can start doing is this sequential scan and also this sequential scan. So essentially its cost for the startup is essentially 0. 

And then this is the estimated total cost for this action, the number of rows it anticipates, and the amount of data that it has to work with. So basically you need to look and see where the greatest cost is coming from when you're doing an EXPLAIN or an EXPLAIN ANALYZE to see what should be the slowest part. In this section here, they talk about EXPLAIN ANALYZE, which actually runs the query. So when you do ANALYZE, it actually runs the query, whereas when you just use EXPLAIN, it doesn't run the query, it gives you an estimate. But I really like using ANALYZE, of course, because it actually runs a query and it gives you actual results so you can hear, you see the actual and it gives you the time. 

So how many milliseconds does it take to run a particular action? So this is the actual cost estimate for doing this stage of it. This is the cost estimate for this stage of it and again, tells you how many rows and how many loops it has to go through because there are certain types of queries where it has to do more than one loop. So for example, it has to do this materialized step 100 times based on the query and the data and how it's laid out. So with this information, you can then look at your EXPLAIN ANALYZE plan and see what is resulting in taking the longest time. then that is the area you would need to optimize. So if you want to learn more about EXPLAIN ANALYZE, definitely check out this post. 

Now, a great compliment to the previous post is "How to use EXPLAIN ANALYZE for Planning and Optimizing Query Performance in PostgreSQL". This is from the enterprisedb.com blog. So they actually take it from the perspective of a customer who contacted them and said that a particular query wasn't using an index, it was doing a sequential scan, and wondered why that was. So they used EXPLAIN to actually figure out, okay, why it's not doing it. There was an existing index and basically just needed to rework the query in order to start using the index scan again. But then it goes into talking about that, of course, the planner uses a cost-based approach. What's interesting about this post, it actually tells you how it derives the cost when just using EXPLAIN. You can see the formula here, the number of blocks times the sequential page cost. Again, this is a PostgreSQL setting that you make. 

So it takes the number of blocks that are going to be involved, and multiplies it by that, adding it to the number of records times the cpu_tuple_cost. Again, that's another configuration variable. Plus the number of records times the cpu_filter_cost. Again, another configuration variable. Using these, it calculates the exact cost that's represented here for this sequential scan. So for those of you interested in details, it definitely kind of shows you exactly how this cost-based optimizer comes up with the cost that it comes up with. And then they talk a bit about statistics and how they are involved in terms of doing the estimates as well. And then they talk about EXPLAIN ANALYZE and it actually runs the query. So you get actual timing for each step of the process that a query runs through. Then they have this summary table of all the different scan types and joins that are available. So again, if you want to learn more about EXPLAIN ANALYZE and how it works in PostgreSQL, definitely two blog posts to check out.

The next piece of content is "SPEEDING UP COUNT(*): WHY NOT USE MAX(ID) - MIN(ID)?". This is from cybertec-postgresql.com, and they had done a previous post, SPEEDING UP COUNT(*), and they said that in some of the comments, someone's saying, why don't you just do a max(id) or min(id)? But the problem is it will return the wrong answer. Now, I mentioned this in last week's episode of Scaling Postgres, talking about sequences and how sequences can have gaps. So if you try to insert a row and have an error or insert a row and do a rollback, the next sequence you get will be higher than what you expect. It will have a gap in it. Sequences are just guaranteed to be higher than the previous one. It's not guaranteed not to have gaps, so you're going to have gaps. 

So you can't use that to get an exact count and they go through and give you an example of how you can insert something like they tried to insert where it would be an ID row of four, but they rolled it back and here you can see the gap is present because they used a rollback. Now this could be used depending on how well you know your table as a means to give an estimate, but it would be a very rough estimate and it depends on how many errors you would tend to have in your table. So I wouldn't necessarily advocate it, but it could give you a rough ballpark sense of how many rows are in the table if you wanted to get an estimate as opposed to having to count every row. But then again, it may be easier to just use the stats tables to get a rough estimate of how many accounts there are. So if you want to learn more about this, definitely check out this blog post.

The next piece of content is actually a YouTube video and it's called "My Favorite PostgreSQL Books". So I believe it was a webinar that was done on the EnterpriseDB YouTube channel and it goes over a set of books, each for a different type of person. Like if you're more of a developer, it suggests one book. If you're more of a beginner, it suggests one book. Or if you are looking for a cookbook of sorts. So I think it's about four or five books that they mention and they kind of give you a summary of it and their opinions on each book. So if you're looking for more educational material about PostgreSQL, maybe check out this webinar.

The next post is "How to migrate from inheritance-based partitioning to declarative partitioning in PostgreSQL". This is from 2ndquadrant.com and version PostgreSQL 10 introduced native partitioning or declarative partitioning versions. Prior used what they called inheritance-based partitioning. So this post walks you through the process of being able to migrate from one to another. And I'll just go over the overview here. First, they created a new table with the same columns as the parent table from the existing partitioning scheme. Then for each child table in the heritage scheme, they detached this child table from the parent using ALTER TABLE with NO INHERIT. Then attach this child table to the new parent using the ATTACH PARTITION, then create any needed new indexes on the new parent table. 

And they did mention in this post that even if they exist on the child they won't be redone so that's definitely convenient. So hopefully that would happen pretty quickly. Also, you may need to add views, functions, or other objects, drop the original table used with inheritance partitioning, and then lastly rename the new parent table to replace the old one. And then they also mentioned when doing this you would definitely want to do a backup of your tables and even do it within a transaction. So first they start off BEGIN, then they do all the steps and then they conclude with a COMMIT, of course. So if you need to do this, definitely check out this post for a technique to migrate from inheritance-base to declarative partitioning.

The next post is "Using PostgreSQL for JSON Storage". This is from crunchydata.com. Now, this is a very brief post, relatively short, but it talks about using JSON in PostgreSQL and it gives an example of JSON that say you have and storing it in a column that they just happen to call JSON underscore content and ideally with a gin index to make queries faster. And they show you how you can extract certain data from it. So this is a way to get the user's last name from this syntax and then also use the WHERE clause to actually pull out a particular piece of content from it or a particular record from it. So again very brief but if you want a quick intro to using JSON, check out this blog post.

The next post, again very brief, is "How to set up application_name for PostgreSQL command line utility?". This is from elephas.io. This actually would apply to other applications as well. In this example, if you try to set the application name using some variable with psql, it doesn't work. It actually needs to be part of the connection string. So here you see application_name equals a specified name and then when you show application_name, it will appear. It says you can also use an environmental variable PGAPPNAME before making the connection. So if you have a particular application that you want to be defined as an application name within PostgreSQL which is convenient for logging, you could use one of these two methods to do that.

The next piece of content is "A tale of password authentication methods in PostgreSQL". This is from 2ndquadrant.com and it's walking through the evolution of passwords in Postgres. So at first, it started off with just a password in clear text. They said this isn't great, let's try to encrypt at least a little bit. So they used the crypt utility within Unix or Linux but that had disadvantages again because of some clear text issues. So they moved to MD5 which resulted in some benefits, but there were also some negatives. So now the one that they suggest people use is SCRAM and they talk about all the advantages and how it is superior to the previous methods. So if you want to learn a little bit more about the evolution of passwords in Postgres and the benefits of using SCRAM, definitely check out this blog post.

The next piece of content is "Oracle to PostgreSQL: ANSI outer join syntax in PostgreSQL". This is from 2ndquadrant.com as well and there is some specific syntax that Oracle is sometimes used for doing joins and this basically explains how you would do that in PostgreSQL. So for example, this would be the Oracle version using this (+), whereas you could just use the ANSI standard RIGHT JOIN or RIGHT OUTER JOIN syntax in PostgreSQL. How to do a full join, again, you would use the FULL JOIN syntax in PostgreSQL and even how you would do a cross-join. So if you have a need to migrate from Oracle to PostgreSQL and want some migration techniques, check out this blog post.

The next post is "My Favorite PostgreSQL Extensions - Part 1" and "My Favorite PostgreSQL Extensions - Part 2". These posts are from severalnines.com.  So the first one that they list is the Postgres Foreign Data Wrapper or Postgres FDW. So this allows you to connect to external Postgres servers and query and get data from them. The second one they mentioned is pg_partman. So if you want to partition tables, pg_partman makes the management of those partitions much easier. So this is another extension that you can use to do that. In the next post, the first one mentioned is pgAudit. So if you want to set up audit tables to track INSERTS, UPDATES, DELETES, et cetera, in your PostgreSQL instance, you could check out this extension. 

The next one they mentioned is pg_repack. So this helps you vacuum and reclaim space from tables while the database is online. Normally, the only way to reclaim space is with a VACUUM FULL, but that locks the whole table. But this extension, pg_repack, uses a technique so that you can resize tables while they are online and reclaim bloat in your tables. The last one they recommend is HypoPG. This lets you set up hypothetical indexes to see how they could be advantageous if you were to add them. What could be the benefit of doing an EXPLAIN within there? So they help you get a sense of whether particular indexes help with particular queries. So if you are wanting to learn more details about these extensions and how to use them, you can check out either of these two blog posts.

The next piece of content is "How the Citus distributed query executor adapts to your Postgres workload". This is from citusdata.com. Now they're talking about Citus, which is an extension for Postgres that helps you shard across multiple Postgres servers. They're talking about the open-source version so this is included in it. Now, I'm not that familiar with using Citus, I've never used it. But I like this blog post because it kind of gives some insight into potentially where Postgres is going with its sharding and talking about some of the issues they were dealing with and some of the use cases and how they were adapting their solution to handle these use cases better. So if you want more insight into kind of sharding and kind of potentially where the future will lead, this is a very interesting blog post to check out.

The next piece of content, the PostgreSQL person of the week is Melanie Plageman. So if you're interested in learning more about Melanie and her contributions to Postgres, definitely check out this blog post.

The next piece of content is "A Deep Dive into PostGIS Nearest Neighbor search". This is from crunchydata.com and it is a deep dive talking about the Nearest Neighbor search using PostGIS. So if you are interested in learning more about that, check out this blog post.

The next post is "INTERSECTING TRACKS OF INDIVIDUALS - MOBILITYDB". This is from cybertec-postgresql.com and this is a follow-up on a post kind of talking about contact tracing that's been discussed recently. This post talks about setting up data structures within PostgreSQL so that they are MobilityDB enabled. This is an extension for PostgreSQL to make some of this easier. It's built on top of PostGIS, they mentioned here, setting up trips based on initial mobile points, and then discovering the intersection of infected individuals to retrieve possible contacts. So if you want to go more in-depth on this, definitely check out this blog post.

The last piece of content is "How to use Tree Machine Learning model with 2UDA - PostgreSQL and Orange (Part 4)". So this is part four of the machine learning posts from 2ndquadrant.com.

episode_image