Checking Vacuum, Fast Distinct Count, Group By Speed Up, Health Check | Scaling Postgres 57
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering checking on vacuum, how to get a fast distinct count and speed up group by and a health check playbook.
Content Discussed
- Scaling PostgreSQL: Check your Vacuum!
- Best way to count distinct indexed things in PostgreSQL
- Speeding up GROUP BY in PostgreSQL
- A health check playbook for your Postgres database
- Indexes in PostgreSQL — 4 (Btree)
- Waiting for PostgreSQL 12 – REINDEX CONCURRENTLY
- Loading Data Into PostgreSQL
- Where and when you need a root.crt file
- GeoJSON Features from PostGIS
- Text search: a custom dictionary to avoid long words
- pgBackRest archiving tricks
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about checking vacuum. Fast distinctcount group up. Speed up in postgres health check. I'm Kristen Jamison and this is ScalingPostgres, episode 57.
[00:00:21] Alright, I hope everyone is having a great week. So our first piece of content isScaling PostgreSQL check your vacuum. And this is from Git. Engineering, I believe. Blog. Andin it they're talking about using a table. And they were doing a lot of inserts, like 5000 requestsper minute, 30GB inserted daily. And because it was mostly inserts and they didn't need some ofthe historical data in order to avoid void deletes, they went ahead and used partitions. So that'sgreat, they can just drop the oldest partition whenever they want to. And they don't have to havethe overhead of deletes because due to postgres's MVCC architecture, the deletes need to bevacuumed up. And they said, quote great, we had tables with daily partitions and we thought itall had it all figured out. Surprisingly though, after about two weeks in production, the responsetime suddenly and sporadically escalated to around 200 milliseconds from what was about 20milliseconds. And basically they discovered that it was an issue with auto vacuum because eventhough they were inserting this data, apparently they were doing some updates and deletes thatof course required auto vacuum to run to clean up the old dead rows. So basically they went withwhat they're calling the event store approach. So only doing inserts, keeping the data immutable,basically trying to avoid updates and deletes as much as possible. But again, they say, quote,like the previous solution, this ID was running just fine a couple of weeks and everything lookedgreat. Unfortunately, about three weeks into this variation, we suddenly started experiencing thesame issue as before, a sudden unexpected performance degradation. Now, the next thing theydid is move to postgres Ten because they were on 9.6. And one of the reasons they did that isbecause 9.6 uses a callback function which actually inserts a row into the parent table and thendeletes it, which essentially leaves a dead row and causes the vacuum process to kick off. Now,part of this is due to the fact that they were using an orm, and they say here since orms rely onPostgreSQL's returning statement for every database transaction, which can only be achievedby first inserting the row into a parent table. So there may have been another way around it, butbasically they're saying they could choose to move away from the orm or use the nativepartitioning solution in version ten, which is what they went with doing. Unfortunately, yet again,they ran into a problem with auto vacuum and that is due to the anti wraparound mechanism.And of course this is due to postgres's MVCC architecture with determining who can see whatrows are visible. And it uses this transaction ID, this XID, to be able to map what rows are visibleand what rows are not. If you hit 2 billion in your XID, it can essentially shut down your databasebecause it requires a manual vacuum to clean and freeze those rows so they aren't visibleanymore. And what was happening is they were putting about 200 million rows into the tableeach day, which happens to be at what point a required auto vacuum kicks off to avoid atransaction ID wraparound. So how they solve this issue is they actually adjusted their autovacuum freeze max age to 500 million transactions and that avoided the wraparound autovacuum from kicking off. So basically, as they say here quote, we figured it out. So I thought thiswas a very interesting story about how they discovered and explained how auto vacuum worksin different cases and for their use case what kind of issues they were running into. So, definitelya pretty great blog post for getting a story based education about using auto vacuum.
[00:04:03] The next post is best way to count distinct indexed things in PostgreSQL. Now theTLDR is that when you're wanting to do a distinct count, this is the wrong way. So from sumtabledo a count and then of the distinct text hashes in it. So what you'll get is a sequential scan eventhough you have an index on the text hash. And really what you want to do is select all thedistinct text hashes as part of a subquery or CTE and then do a select count of all of those rows.Then it can use the index and return data much faster. In his example here, it returned the resultin 5 seconds v --ersus 40 seconds. So if you're needing to do distinct count of items, definitely keep this inmind. And he also references this blog post from Citizens Data that you can examine that kind ofgoes into more detail about it.
[00:04:59] The next blog post is speeding up group by in PostgreSQL. Now, they describe thisas a way to squeeze the last couple of percentage points out of a query so to basically squeezea bit more performance and they classify it as a small optimization, but it can help. So basicallythey created a table and generated some data within it looks like about 10 million rows. And thenthey ran a simple aggregation, doing an average of a numeric column and grouping it by twoother columns and it will do a sequential scan and it says PostgreSQL will read the entire tablesequentially and perform a hash aggregate which is where it spends most of its time in the hashaggregate. And basically the only change is reordering the group by putting the value with themost distinct values first, grouped by the less frequent values later. And that will run the hashaggregate algorithm more efficiently. And here he shaved about 400 milliseconds off of what wasa 3300 millisecond query, so definitely more than a 10% improvement. Now, what's interesting isthat they mentioned it's very highly likely that future versions of PostgreSQL will already dosome of this change automatically. But in the meantime it's just something to keep in mind if youwant to make your queries as fast as possible. So definitely a blog post to check out.
[00:06:22] The next post is a health check playbook for your postgres database and this is fromSlitistdata.com and basically it's a playbook for you to what to consider and examine to maintainthe performance of your database. The first one is the cache rules everything around me whichis basically monitor your cache hit ratio and try to maintain it at 99% by doing things to maximizehow your cache is being hit. This could be getting more memory for the server, this could begetting rid of unused indexes. Whatever you would need to do to try and maximize the amount isdone using memory only and not having to go to disk.
[00:07:00] The next consideration is beware of dead tuples. So basically keep on top of yourauto vacuum settings to make sure that your Bloat is as small as possible as well as helps youavoid auto vacuum wraparound issues.
[00:07:15] And as you can see here, have different queries that you can run to determine thestate of the database for these different things. And then they talk about over optimizing is athing where they're talking about periodically you want to get rid of unused indexes because theywill slow down your insert update delete performance as well as cause you use more disk spaceand potentially might reduce your cache hit rate. And then of course they consider checking onyour query performance. So using things like PG stats statements to understand what queriesare running against your database, you could also turn on the slow query log so you can seewhat queries are running slowly in your database and get a sense of what queries you need toimprove to improve your overall database performance. So generally pretty good review of thethings you should keep on top of with regard to maintaining your postgres database.
[00:08:04] The next post is simply called Indexes in PostgreSQL for Btree and this is fromHabr.com and I believe it is from a previous post on Postgres Ru and it's been translated fromthis site which is in Russian. So this is the English version of it and it really goes into acomprehensive discussion of Btree indexes from how they work to sorting to how to handleorder of columns and nulls and even mentioning things such as covering indexes in postgreseleven and what's more than that. This is actually the fourth post with regard to this and theyhave number one was on the indexing engine of Postgres, the interface of access methods andthe hash index. So if you want to learn more about PostgreSQL and its indexing, these aredefinitely a series of blog posts to check out. The next post is waiting for PostgreSQL twelvereindex concurrently. Now, right now there is the reindex command, but it doesn't really workconcurrently. So usually that means large databases or any active database can't really use it.They have to actually do a create index concurrently and then drop the index concurrently. Andthen if you want to maintain your n --aming scheme, then you have to alter the index to give it a new name. But with this patch,presumably you can do a reindex and it even works on primary keys. So this looks to be a greataddition in PostgreSQL twelve.
[00:09:38] The next post is loading data into PostgreSQL and this is from a presentation thathappened at Pgcomf 2019. I'm not quite sure which one, but this is on the PostGIS US site andit's a PDF presentation and it talks about different forms of loading data into postgres. So fromusing the copy to copy, from talking about loading data into large binary objects, using foreigndata wrappers to load data in. So you can connect to say, an external postgres database orOracle database to load data in, as well as just reading directly from it, or even using the foreigndata wrapper file to be able to pull data in from a file. And they also covered the Http extension.If you want to reach out and contact an Http site to pull data into postgres, that's covered as wellas some command line tools such as a PG loader and some of the others mentioned here. So ifyou have a need to load data into PostgreSQL, definitely a presentation to check out.
[00:10:39] The next post is where and when you need root CRT file. And this is fromsecondquadrant.com blog and this is a discussion, this is a very short post, but it's talking aboutTLS, which is basically SSL. And in setting that up with PostgreSQL, it talks about the caseswhen you would need a root cert file. Basically if you want to do some things to authenticate,you're contacting the right server and where and when you would need this root CRT file. So ifyou use SSL connections or TLS connections with your PostgreSQL networking setup, definitelya blog post to check out the next post is GeoJSON features from PostGIS. So again, this isabout PostGIS and something I wasn't familiar with until I looked at this post. And GeoJSON, it'sbasically a standard to convey geometry or geographic information within JSON. And it sayshere quote, with a little scripting, it's possible to make a passable function that converts the rowsfrom postgres into GeoJSON. So he sets up a simple table here. He shows how you can convertit to JSON B from just querying the table. And then he shows this function that he developed. Itactually converts it into the GeoJSON format. He says, quote now we can turn any relation into aproper GeoJSON feature with just oneish call. So he shows you how to use it. So if you usePostGIS and want to convert things to GeoJSON, definitely a blog post to check out.
[00:12:14] The next post is Text Search, a custom dictionary to avoid long words. So this isabout text search and this is actually very in depth post that involves looks like a little bit involvedworking with C code. So it's definitely some low level stuff he's talking about. With Text search,basically text is converted into TS vectors and normally when you're converting into TS vectorsyou're going to be reading some sort of a document. He says what about undesirable pieces oftext? So he says quote, for instance, when indexing mail messages, badly formatted messagesmay have base 64 contents that slip into text parts and that these can be pretty big. So basicallyhe wants to avoid these types of super long terms when converting to a TS vector. One solutionthat he mentions here is filtering by length and he mentions several dictionaries that exist in thePostgreSQL source code that you can use as a reference to build your own custom index to dothis. And here's where they get into the different C functions for kind of setting this up and thenhow you could potentially use this. So if you have a need for this use case, this is an interestingblog post to check out.
[00:13:28] The last post is PG Backrest Archiving Tricks, and this post specifically coversasynchronous archiving of the wall files and it says with the PG Backrest info command forperformance reasons, PG Backrest doesn't check that all the needed wall segments are stillpresent, whereas Checkpg Backrest is built for that. And he covers setting up PostgreSQL PGBackrest and ways to potentially avoid that. So if you use PG Backrest, this is definitely a blogpost I suggest checking out.
[00:14:01] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to where you can sign up to receive weeklynotifications of each episode. Or you could su --bscribe via YouTube or itunes. Thanks. --