Query Planner, BRIN Indexes, Approximate Algorithms, Delayed Replication | Scaling Postgres 53
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering the Postgres query planner, BRIN indexes, approximate algorithms, and the uses of delayed replication.
Content Discussed
- Debugging the Postgres query planner
- PostgreSQL BRIN Indexes: Big Data Performance With Minimal Storage
- Approximation algorithms for your database
- How we used delayed replication for disaster recovery with PostgreSQL
- Webinar: Achieving High Availability with PostgreSQL [Follow Up]
- The Current State of Open Source Backup Management for PostgreSQL
- pgBackRest - Performing Backups on a Standby Cluster
- “Exclusive backup” method is deprecated – what now?
- Using parallelism for queries from PL functions in PostgreSQL 10
- Parallel queries in PostgreSQL
- Why you should use a relational database instead of NoSQL for your IoT application
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres we talk about the Postgres query planner brinindexes, approximate algorithms and delayed replication. I'm Kristen Jameson and this isScaling Postgres episode 53. You alright? The first article this week is debugging the Post asquery planner. And this is from the gocardless.com blog and in it they're basically talking aboutthey have a two terabyte database and they saw that a query started or a set of queries startedoperating slower. And apparently, I guess based upon the amount of data, the statistics causedthe planner to start using a different plan. So they say, quote we saw a concerning number ofqueries that were very long running that normally execute under 50 milliseconds and this wasthe query that they were seeing the difficulty with. And so they do give a little bit of a backgroundabout how their data is laid out. And they have a payments table that connects to a paymenttransitions table and they make a number of observations here, like approximately 20% of thepayment transitions table will be marked with a payout ID. Now the first thing that they did toexamine this is use explain to get a sense of what the query plan is using and they noticed thatit's using a nested loop which is a little bit more inefficient compared to some other methods ofdoing the join. And so basically they go through looking into it and trying to get a sense on whatthe statistics think are there, how many it should expect. And they were looking at the PG statstable to get some of this information and then they had a discussion about what their ideal planwould be in going through it. And they even drill into the source code trying to determine kind ofwhat went wrong in terms of where they were going with the planner. And they think that themajor thing was regard with it looking for a limit of one, it was using a plan that would searchthrough too many rows to be able to find one because the statistics were off. It thought justlooking for that one sequentially would be the fast way to do it. And then they talked about howthey fixed it. And how they fixed it is basically adjusting the Payout ID statistics paymenttransition table to be able to sample it at a higher rate. But of course they mentioned here thatyou don't want to sample it too higher rate because then your vacuum or your analyzes havingto be run very frequently the auto analyze function. So I thought this is an interesting blog post,but one thing that I tend to like to do, going back to the Explain plan, what I always like to do isdo an Explain analyze so it can actually give you an estimate of what the planner thinks thecosts are and what it's going to be seen. And when you add analyze it actually runs a query andgives you what those reports are so you can more easily see discrepancies between what theplanner thinks are there versus what it actually took to do the query. So in general I always liketo do that.
[00:03:06] The other thing I noticed is that immediately when they said approximately 20% of thepayment transactions will be marked with Payout ID. So that means 80% of them don't have oneand they have an index, a simple index on it. I guess my first inclination would be to do a partialindex on it. So perhaps even before looking to adjusting the statistics, maybe put a partial indexwhere the Payout ID is not null. That should give it much more, much fewer rows to sort throughand will probably increase the speed of the query in general. But anyway, I thought this was agood post that goes through how they diagnosed a slow query that came out for a large tableand how they chose to use statistics as way to resolve what the query planner was doing.
[00:03:52] The next post is PostgreSQL Brin indexes, big data performance with minimalstorage. And this is from the Crunchydata.com blog and basically they're exploring the Brinindexes. So now this they're thinking about using for a use case where maybe you have sometime series data. So they're talking about a sensor reading. So they create a table that has aunique ID, what was scanned for the sensor as a float and then a timestamp and they want to beable to query by the timestamp. So first they inserted 10 million rows into this table and thenthey did a query for a time range, grabbing a month of the data by day and turning off parallelismand running the query. It ran in 1.6 milliseconds. When they turned on parallel queries and ran it,it --actually ran in 460 milliseconds. So over three times as fast. But one thing that they didn'tmention in the blog post that I observed here is that they're actually doing a disk merge from allthe data that's being pulled back. They're actually doing a merge on the disk, or I should saythey're doing the sorting on disk which is going to really slow down the query. Whereas they'reusing memory sorts when they're doing it in parallel, I guess due to the size of it being lowerthan work memory. So I wonder what the timing would be if actually the work memory wasbumped up a little bit to be able for this to fit in it. Then what would the difference betweenparallel versus just a single execution processor? Now this is all done sequentially with noindexes. So he goes ahead and adds a standard B tree index and with that it's at 1.1 second. Soit's faster than a sequential scan but not as fast as doing it in parallel. But again, we're runninginto it's doing the sort on the disks. So that may be the reason for the speed difference here. Andthen he made note of what the index size is, 214 megabytes, and then he tries a Brin index andthat ran in 967 milliseconds. So just a little bit faster than the B tree index.
[00:06:01] But look at the size, it went from 214 megabytes to 32 KB. So that's one thing thathe's mentioning here is that Brin can be really good if you have a lot of data in something, andespecially if it's time series oriented. That's kind of where Brin indexes can give you a little bitbetter performance, but they give you huge gains in terms of disk space savings.
[00:06:27] So in this case, it's 100th the space of the B tree index. So he said because it tends toexcel with larger amounts of data. So in this case, he tried a Brin versus a B tree index with 100million rows and again with a no parallel query, just sequential scan, it's doing the disk's mergeand it ran in 10 seconds. Again, resorting to doing it in parallel, it keeps it in memory, doesn't goto disk, and it does it in 2.6 seconds. Adding the index causes it to go down to 1.1 second. Sodefinitely faster than the parallel sequential scan. And the size of the index is over 2GB in size,but with the Brin index it runs in 975 milliseconds. So 1.1 for the bind Btree index and 0.9seconds for the Brin index. So again, the Brin index is a little faster. But again, what's amazing isthe disk space savings. Whereas the B tree index is at over 2GB, the Brin index is 184 KB. Sowhat's interesting about the Bren index is that in certain cases it can give you betterperformance versus the B tree tends to be with sequential data and you have a lot of it, but itgives you huge savings in terms of disk space. Now that could be advantageous becauseperhaps you're able to cache Bren indexes more than a B tree. So, definitely interesting blogpost, I definitely suggest you check it out.
[00:08:03] The next post is approximation algorithms for your database. And this is from theCitusdata.com blog. And here they're covering two extensions you can use to give youapproximations, because they're talking about once you start getting a lot of data, gettingaccurate counts when you're wanting to do things like doing a distinct count or a top end, like atop five or top ten from a list or maybe even a meeting, can be hard to do. And there's twoextensions that can help with that. So Hyperlog Log is an extension you can add to PostgreSQLthat gives you approximate counts. And it does things like this to help you get a unique count ofvisitors. And you may have some sort of analytics that maybe an approximate count is okay. Andthis would be a use case for using an extension like this. Then they also talk about top endwhere you can get the top end of a set of data. So in this case it does a top ten, for example.And again, this is doing approximations so it's not 100% accurate, but it gives you anapproximation when you have a large amount of data. So if you're interested in doing somethinglike that and you have a lot of data, definitely a blog post to check out.
[00:09:17] The next post is how we use delayed replication for disaster recovery withPostgreSQL. So this is something that I actually haven't implemented but I've thought a lot aboutand that's basically they're using delayed replication to be able to have a relatively recent copyof the data that they have without it being immediately in sync and not having to do a full restorefrom a backup. So basi --cally what they're doing is they set up a Replica server that stays 8 hours behind their primarydatabase. So it's kept in sync, but it's always 8 hours behind. Now, where that's interesting isbecause if you have something that happens in your master database and if you have a multiterabyte database, this is where this comes in.
[00:10:02] If something happens, a table is dropped erroneously or something of that naturehappens. To restore a multi terabyte database can take a really long time. But the problem is ifyou have like a Replica in sync, generally that table drop happens immediately on the Replica.Whereas the advantage of this is that it delays that replication by a number of hours. So if thatincident happens, you can actually pause the replay on that Replica, set it up to a particular tothe time right before the incident happened, replay that replica up to that point in time, and thenyou can correct or grab the data or do something with it to be able to restore data in yourproduction system. And they talk about the main configuration option that does this in yourrecovery.com file when you're setting up your Replica. Is recovery mean apply delay? So theyhave it set to 8 hours. So this particular Replica is always following what the primary database isdoing but it's just committing those changes 8 hours behind where the primary is. And they goover even some of the code about how PostgreSQL implements it and how they have put it inplace to be able to do data recovery without having to do a full restore and backup. Now, they'renot saying this takes place of a backup. As they say, replication is not a backup mechanism. Butif you have multi terabytes of data, and if you need to do some sort of a data restore event,having one of these delayed replicas can be a great way to restore data relatively quickly withouthaving to. Restore the entire multi terabyte database because you already have it there. And youjust reconfigure the recovery time to an exact time when you want to recover that data. So youhaven't considered using this. Definitely something to investigate.
[00:11:58] The next post is Webinar achieving high Availability with PostgreSQL follow up. Andthis is from the Second Quadrant.com blog. Basically they did a webinar about achieving highavailability with PostgreSQL.
[00:12:13] So you just click on this link and you fill out some contact information. They give youaccess to the webinar. So they talk about availability in terms of recovery point, objectiverecovery time, objective setting, replication for your high availability, how PG bouncer can beused. They talk about handling backups and they do cover some of the products that SecondQuadrant offers in terms of the Replication Manager, their barman backup recovery tool, as wellas going the Master Master route with their bi directional replication. So if you're interested inthat sort of thing and achieving higher availability for your PostgreSQL instance, definitely awebinar to check out.
[00:12:56] The next post is the current state of open source backup management forPostgreSQL.
[00:13:02] So by default we have PG Dumps and we also have PG based backup for doing whatI call physical file backups. But they also go over some other products that are available such asPG Barman by Second quadrant PG Backrest. PG Pro backup by Postgres Professional andBart by EDB. So basically they go over each of the features. What are some pros and cons ofeach of them. So if you're interested in adjusting how you're doing your backup or you want todo something different, definitely blog post to check out to look at a comparison between thedifferent tools that are out there now related to that crunchydata.com. Their blog has put out ablog post called PG Backrest performing backups on a standby cluster. So really this is talkingabout setting up two systems and getting PG backup up and running on it. So it goes into reallyin depth about all the commands you would need to use to set up both PostgreSQL in terms oftwo instances and doing setting up PG Backrest to be able to backup using a standby system.So if you're interested in exploring PG Backrest, definitely a blog post to check out.
[00:14:15] The next post is exclusive backup method is Deprecated what now? So we're talkingabout exclusive backup. So that means the method that you would use to do physical backupbefore PG based backup came out, that is you would execute a PG start --backup with some sort of label before you start taking copies of the files and then do a PGStop backup after you are all done. And they call that exclusive because you can only have oneof those going at a time. The instance that just has this because it actually creates a file in yourPostgreSQL data files saying what backup is going on and what the label is. So exclusive interms you can only have one at a time. Now they mentioned the problem with exclusive backupmethod here is that if you have PostgreSQL crash or the operating system crashes during abackup, that file is left around and it says, quote, there's no way to distinguish the data directoryof a server crashed while in backup mode from one that's trying to be restored. And you'll geterrors when you try to bring up a system after it's crashed during a backup. And he mentionsthat PG based backup overcomes this. So this backup label is not written to the data directorybut actually just added to the backup. Now they also mentioned in 9.6 you could actually do anonexclusive backup for PG start backup and PG stop backup. And that basically keeps thedatabase session open. It doesn't use a file and in the same session where you started you cantell it to go and stop the backup. But there can be some issues with issues with it because thatsame database session has to stay open. But they also mentioned the documentation that theexclusive method will eventually be removed and he says you probably have until 2021 beforethat's actually fully removed. But that can be a problem because some backup software rely on apre backup command or commands to run and then a post backup commands and then they aregenerally not happening in the same database session, so it doesn't maintain. So how do youhandle that? So basically he has come up with a set of scripts to do this. So if you potentiallywant to use their script here to do a non exclusive backup of your PostgreSQL database system,definitely a blog post to check out.
[00:16:40] The next post is using parallelism for queries from PL functions in PostgreSQL tenand they talk about using intra query. Parallelism was introduced in postgres 9.6, however, itdidn't work within PL SQL functions. So when you tried to do it within a function you wouldactually get a sequential scan. But in version ten it does paralyze it. So again, this is just a goodreason to always try to keep up on your versions of PostgreSQL because it can give you a lot ofperformance improvements.
[00:17:21] The next post is parallel queries in PostgreSQL and this goes through kind of thecurrent state of parallel queries starting with 9.6 and looking at different ways where parallelsequential scan was introduced, how it works in terms of process of the communications, how toconfigure different workers and cost estimates. And then talking about Nested loop joins andHash joins and merge joins as well as a partitionwise join and parallel append. So if you'rewanting to get more in depth about parallel query execution in PostgreSQL, definitely a blog postto check out.
[00:17:56] The next post is why you should use a relational database instead of NoSQL for yourIoT application. And this is from Timescale.com blog. And Timescale is an extension forPostgreSQL that allows you to use it as a time series database, basically having very smallpartitioning of time series data and it's basically talking about the advantages. Of using SQL overNoSQL in terms of it offers, joins, aggregations, window functions, common table expressionsand roll ups, and how schemas are actually a good thing as opposed to having no schemas andno structure at all. Also talks about their reliability. So it's basically advocating using SQLdatabases for your time series data.
[00:18:48] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes be sure to head over to Scalingpostgres.com, where you can signup to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes.Thanks. --