Prepared Plans, Materialization, Recursive CTEs, Safe Migrations | Scaling Postgres 68
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss prepared plans, materialization, recursive CTEs and how to run safe database migrations.
Content Discussed
- Tech preview: How PostgreSQL 12 handles prepared plans
- Tech preview: Improving COPY and bulkloading in PostgreSQL 12
- Rethinking the Database Materialized View as an Index
- Checking the sequences status on a single pass
- FizzBuzz (in both plpgsql and SQL)
- A recursive CTE to get information about partitions
- [YouTube] rails db:migrate:safely by Matt Duszynski
- [YouTube] When it all goes Wrong (with Postgres) by Will Leinweber
- [YouTube] Postgres & Rails 6 Multi-DB: Pitfalls, Patterns, Performance by Gabe Enslein
- Indexes in PostgreSQL — 10 (Bloom)
- What's New in Crunchy PostgreSQL Operator 4.0
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about prepared plans, materializationrecursive, CTEs, and safe migrations. My name is Kristen Jameson and this is Scaling postgresepisode 68.
[00:00:21] All right, I hope you're having a great week. Our first piece of content is tech previewhow PostgreSQL Twelve handles prepared plans. This is from Cybertechgresql.com, and they'retalking about prepared plans where you can basically prepare a plan ahead of time and thenexecute that plan. But some of the behavior is different or different what people expect prior totwelve, and then they're doing some adjustments with twelve. So this post examines that. Sobasically he creates a table here with just a serial column as the ID and a name as the text. Heinserts about a million names called Hans and only two names called Paul. So basically, theindex will be very efficient finding this. But Postgres will use a sequential scan for this. So helooks at that here, he goes ahead and creates an index on the name. And when he looks forHans, you'll see it does a sequential scan, and even in parallel to be able to pull out the data. Soit does not use the index even though it exists, because it's more efficient to just scan the wholetable to find all the instances of the name Hans. Whereas when he does Paul, it does use theindex. And this is the optimizer checking the statistics to ensure that there's not so many peoplenamed with Paul or Hans to be able to determine whether to use the index or not. And then herehe looks at the PG Stats table and shows that part of these statistics, it says some of the mostcommon values are Hans. So there's a high probability it will not use an index when trying tosearch for this name. It'll just do a sequential scan. Now, in terms of preparing plans, you canprepare a plan and then you can execute that plan, passing in the value you want to test. So wesee here that even though it executes the plan Paul, when you do an explain plan, it actually isgoing to do an index scan. And even when using Hans, it actually chooses a different plan. It stilldoes the sequential scan. And he says, quote, even if we prepare explicitly, we will still get afresh plan before a generic plan is created and a generic plan is made, assuming some constantparameters. So this is kind of a judgment call. And he points out here is that you could create anew plan for every execution that considers the current parameter value. That will always lead tothe best plan.
[00:02:48] But then the purpose of preparing that statement to avoid plan calculation kind ofgoes out the window because you have to calculate the plan for every time. Or you could use ageneric plan that does not take the parameters into account, but then you may not come up withthe best plan and it says by default PostgreSQL chooses a middle road. It will generate acustom plan during the first five executions of the prepared statement that takes the parametervalues into account and from the six on it checks if the generic plan would have performed aswell. And basically it will use that generic plan from that point on. But PostgreSQL twelveintroduces a new variable that allows you to control the behavior more explicitly. So for example,you can do a set plan cache mode and there's a couple of different ways you can set it. So forexample, if you set it to force the generic plan you will see you'll always get a parallel sequentialscan for this. So again, that's not the best plan for Paul, but you're forcing what its generic planis and you can also set it to force a custom plan or auto which is the default value. So definitelya new parameter to be aware of if you want to adjust the default behavior in PostgreSQL twelveonce it comes out.
[00:04:08] The next post is Tech preview improving copy and bulk loading in PostgreSQL twelve.This is from CyberTech Hyphen postgresql.com as well and this uses the copy function whereyou can copy in data from a file into a table and before you had to just do the whole file. But withversion twelve they're offering a where statement. So you can actually use a where clause fromone of the columns that you're importing to basically define which set of data you're going to do.So for example, this had a file that had 1000 rows in it, one to 1000. He says give me all therows where the row I'm loading that's x is less than five so it only loads four rows. So asopposed to having to load everything and --then selecting from that table to another area, only the data you want. Now you can use awhere clause in order to fine tune exactly how much data you want to upload into PostgreSQLtwelve. So definitely a great feature to be added.
[00:05:09] The next post is rethinking the database materialized view as an index and this isfrom blog Timescale.com and this is TimescaleDB which is an extension for using time seriesdata with PostgreSQL. So they're comparing and contrasting indexes versus materialized viewsand basically indexes are updated in real time so that you can use them to track what's going onin the database. Materialized views actually are refreshed on a periodic basis and you usually dothis when you're having to do aggregations of a lot of data because if one new data point isadded it's easy to add a reference to the index but it's harder to then recalculate a whole sum ora whole average. There's techniques you can use to do it, but with a materialized view yougenerally refresh the whole thing and they're taking an approach to treat a materialized viewmore like an index, because as they say here, time series data is different. So you don't havewrites all across the area. It's basically continuously being appended to say a table. And withthat you could have all of your materialized data here and at a certain threshold it knows that allof this is materialized and then you have the non materialized. It's not a part of thematerialization yet, but you can add that in on the fly as you need because you know what thisthreshold point is. Now we've seen some other references to cheating this. Like there have beenfew blog postgres in older editions of Scaling Postgres where actually citus Data had come upwith a couple of different techniques to do essentially the same thing, where you're querying asummarization table or some sort of a materialized or a materialization of some sort, and thenadding in the fresh data from a given point to give you essentially real time analytics. And theyare talking about the same thing now, but it looks like it's something that's more they're lookingto build in as an actual feature in Timescale DB.
[00:07:12] So it's a pretty interesting thought experiment. And if you're interested in this type ofthing, definitely a blog post to check out.
[00:07:18] The next post is checking the sequences status on a single pass. And this is fromLuca Ferrari at fluca 1978 GitHub IO. And here he's talking about a relatively simple function hecreated called Seq underscore check that basically can tell him how long it's going to take beforehe's going to run out of IDs for a given sequence. So it looks like he's using integers for thesequence, I believe. So here's his function and it allows you to tell when your sequences aregoing to run out. So definitely a potentially useful tool.
[00:07:53] The next post is FizzBuzz in both plpgsql and SQL. So FizzBuzz is this programmertest and he kind of explicitly what it is. Basically print out numbers from one to 100 and atmultiples of three print Fizz, multiples of five print buzz and then multiples of three and five printFizz buzz. So he did an implementation using plpgsql. So it was pretty straightforwardprogrammatically, but then he did a recursive CTE. So this is pure SQL just doing it in arecursive CTE. So I found this particularly interesting. So if you're wanting to learn more aboutrecursive CTEs and potentially how they could be used, maybe this is a blog post you may wantto check out.
[00:08:39] The next post is a recursive CTE to get information about partitions. And again, this isfrom Luca Ferrari's blog. And basically he uses partition tables in his database and he wanted toknow, say which partition is growing more, growing less, and each partition relative to eachother. So basically we have a hierarchical relationship. You have parents and you have the childpartitions so he's using a recursive CTE to get and collect this information so it could outputinformation such as this. So it reports for each partition how many say tuples there are pages,whether something's partitioned or not. So it gives him a way to get a status of his partitions.And again, this is another example of using recursive CTEs that are provided through SQL. Soagain, if you want to keep more up to date on how to do recursive CTEs, definitely a blog post tocheck out.
[00:09:37] The next blog post is actually a YouTube channel. Now I viewe --d several YouTube channels from a Rails comp that recently happened and some of themtalked about databases and specifically PostgreSQL. So this one is Rails comp 2019 Railsdbmigrate Safely by Matt Ducinik So he basically talks about how to migrate your database safely.The first one here is don't add columns with a default value. Now, version eleven of PostgreSQLmakes this almost safe. I mean, it'll do static values, but if you're using a version prior to eleven,you definitely need to take into this account. You need to add the new column without a default.Go ahead and backfill data and then add that constraint. In the next consideration is don'tbackfill data inside of a transaction, basically because you can lock access to it. If you're goingto do it, you need to narrow down the scope of the rows that you're going to impact with thatbackfilling of data. Maybe 100 rows at a time, a thousand rows at a time. So something to beaware of. He says, don't make schema and data changes in the same migration. Basically, themore piecemeal you do it, the safer it generally is. Do add postgres indexes concurrently andalso drop them concurrently, so you also want to be sure to do that. And they also had somedowntime where they did a drop and a build in the same one. He didn't mention this, but youdefinitely want to add indexes first. You could have more than one index on certain columns, soit's always best to add an index and then drop concurrently the one you're not using. And ofcourse, always important is to monitor and test your database performance. So these were themain takeaways, but definitely a good presentation that talks about considerations and again,some real life experience with things that actually impact production. So if you want to get moreinsight into that, definitely a presentation to check out.
[00:11:27] The next presentation is from Railscomp 2019 is when it all goes wrong with postgresby Will Lean Weber, and he's from Citus Data, and this presentation talks about something hasgone wrong. Now from his experience, generally 95% of the time there's something in theapplication that caused something. Maybe 4% of the time it's auto vacuum related, and then 1%of the time it's something else. So maybe there is something wrong with the database orsomething of that nature. So vast majority of the time it's usually something the application isdoing that's causing the problems in PostgreSQL that you're seeing. Now he had a veryinteresting section here where he was saying, let's look at what your database is doing in termsof CPU, memory, disk and parallelism, how many connections are connecting to it and what'sactive. And he basically highlighted each of these in various different combinations to sayprobably what's going on. Like if you had high disk and parallelism it could be this, or if you havejust a high CPU, it could be this. And here he's saying when your memory is high, well, maybeyou have some large group buys that are going on or you have some high disk latency due anunusual page dispersion pattern in the workload. So he goes over several of these scenariosand potentially what it could lead to, or what the reason could be if you're seeing these metricshigh in given set of combinations. He also went over various tools that you could use to bettermonitor your PostgreSQL installation from an OS perspective, things like Perf and iostat andpgrep and Htop. So if you're interested in that, definitely a blog post to check out.
[00:13:07] The last presentation from Railscomf is Postgres and Rail six Multidb Pitfalls Patternsperformance by Gabe Insulin. Now, he does talk about the multidb a little bit and he saysgenerally what you're probably going to be using this for at first is like replicas. So split yourread, write to say, one primary and your reads to replicas. But then at some point you may wantto consider sharding. So you could actually take whole tables and put them on their owndatabase. But of course if you do that you need to now think about how are you going to dojoins, are you going to do joins in the application. So that's going to start getting to be reallydifficult. And at that point you may want to consider doing something like cydus to actually shardyour database across say, accounts or customers or something of that nature. So not a lot of itwas with regard to multidb, but he also went in a whole discussion about different tools you canuse such as explain --analyze to find my queries are slow. If you're having slow replication, do a lot of data in thetable, maybe you need to consider partitioning and even about the importance of connectionpooling. So it is a general performance focused talk, not exclusively on multidb, but it's definitelya major part of it. So definitely an interesting presentation to check out.
[00:14:32] The next post is indexes in PostgreSQL 1210, Bloom, and this is from Haber.com.And again, this looks like it was posted again and translated from postgrespro Ru. And they'retalking about Bloom indexes in PostgreSQL. Now I'm not that familiar with Bloom, but it lookslike it's a highly compact way to do membership checks of an element in a set, but it does allowfalse positives, so it has some benefits and negatives. So if you're potentially interested in Bloomindexes, definitely a blog post to check out.
[00:15:10] The last post is what's new in Crunchy PostgreSQL operator 40, and this is fromCrunchydata.com, and they've released version 4.0 of their operator, which basically, as theysay here quote, extends Kubernetes to give you the power to easily create, configure andmanage PostgreSQL clusters at scale. So basically this is using Docker, so if you want to usePostgreSQL Kubernetes, definitely a blog post to check out.
[00:15:38] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show Notes. Be sure to head over to Scalingposgres.com, where you can signup to receive weekly notification of each episode, or you could subscribe via YouTube or itunes.Thanks. --