Parallel Query, Real Time Analytics, Killed Index Tuples | Scaling Postgres 38
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering parallel queries, real time analytics, killed index tuples and unexpected downtime.
Content Discussed
- Update on Basecamp 3 being stuck in read-only as of Nov 8, 12:41pm CST
- Killed index tuples
- Parallel Query in PostgreSQL: How not to (mis)use it?
- ANSI, Schmansi!
- Building a real time analytics API at scale
- WAL: Everything you want to know
- pgDay Paris YouTube Channel
- PGCon YouTube Channel
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about parallel query, realtime analytics,killed index tuples, and downtime. I'm creston. Jameson. And this is scaling postgres episode38.
[00:00:22] All right, we actually don't have a lot of content that was released this week, but thereis a lot of video content that I'm going to be presenting in this week's episode either way. Thefirst article is update on Basecamp Three being stuck in read only as of November eigth at 12:41P.m. Central Standard Time. So if you're not familiar with Basecamp, it is a software as a servicethat does essentially project management. So they had an outage that lasted a number of hoursand the reason was they ran out of integers for presumably an index. Now I'm not sure if they'reusing Postgres. I thought maybe they were using MySQL, but still, I felt that this blog post was agood reminder to people to be sure to check the size of your integers, particularly if they're beingused for primary keys, and you want to make sure you don't run out of those or really useessentially big integers for your primary keys. So basically they had some downtime and it goesthrough the process and what happened in terms of getting the system back up. So even thoughthis may not be postgres specific, it's always a good reminder to check and make sure you don'toverrun your integer sizes, particularly when it's the primary key for particular tables. Sodefinitely a blog post to check out. Oh, and I should say this is from the Signal Versus Noiseblog.
[00:01:51] The next article is killed index tuples and this is from the Cybertechn Postgresql.comblog. So basically this is a concept and it's actually a feature that was introduced in 8.1, but thisis a recent blog post that's describing it and it talks about the concept of different sessions beingable to see what tuples are active in the database. And you basically use x, min and max inorder to determine that. And once a tuple has been classified as dead, so for example, that tuplehas been deleted and no other session can see it, it can get vacuumed up. And this sectionbasically talks about the table heap in that respect. However, they're also talking about indexand they say a visibility information is not stored in the index. So where they use x, min and maxon the heap for each row or the data table for each row, that visibility information is not stored inthe index. And as a consequence, an index scan has to inspect the heap tuple to determine if itcan see an entry. And of course this access can result in random I O. And they mentioned thatthis architecture can be more expensive than other database systems, but they've done a fewdifferent things to minimize that. Like in 8.1, they introduced an bitmap index scan. So it'sbasically scanning a list of heap blocks to try to avoid some of that random I O. And they alsointroduced an index only scan as long as all the columns required are in the index, and thevisibility map shows that all the tuples in the table are visible. Now specifically here they'retalking about killed index tuples. Now they're saying there's an example where suddenly you canget very fast index responses, or at least variability in the response time of certain indexedqueries, and they believe some of that is due to these killed index tuples. So in their example,they created a table called Whole. They inserted about a million rows into it, did an analyze, andthen they created a hole in the table, essentially deleting all the IDs in this range, and then theydid another analyze. Now they ran the query and they got a hit in the shared buffers. So it wasessentially cached information and executed in 222 milliseconds. But when they ran it again, thatsame query again, they get a hit, the buffers that's less, but it completes much faster. So 14milliseconds. And what they say is, quote, what happened is that the first execution had to visitall the table blocks essentially in the heap, and killed all the index tuples that pointed to the deadtuples in the heap. The second execution didn't have to do that, which is the reason why it wasten times faster. So this is definitely an interesting behavior and might explain why you're gettingsome variable queries if you have cases where a lot of tuples are marked as dead and thenremoved and you get some better index performance. So if this is interesting to you, definitely ablog post to check out.
[00:05:02] The next --article is Parallel query in PostgreSQL how not to use it or misuse it. And this is actually apresentation done by Rafia Sabi and I believe Amit Kapila, they're from Enterprise DB, and Ihave the link to the presentation here and it talks about parallel query and PostgreSQL, and itgoes over how it works and recommendations for how to essentially configure it, whether you'reon 9.6 or ten or eleven. So if you're interested in doing some parallel query work, definitelypresentation to check out.
[00:05:40] So the next post is going back to the YouTube channel. Postgres opened SiliconValley in 2018, where they presented a bunch of different YouTube channels. I've watched abouteight more or six to eight more presentations, and I would say these are kind of the toppresentations that I would suggest you review if you haven't reviewed any already. So I wouldsay my top one for developers and DBAs is the ANSI Schmancy presentation. This is by sayrope sarkuni of Jackdb. Basically he's making the argument that a lot of application frameworkstry to treat the database as an agnostic data store, whereas he's saying that you can reallyexpand your capabilities and performance and do a lot of interesting things and make things alot easier if you actually use the power that's within the database. And of course, they're talkingabout PostgreSQL. So I did mention this presentation, not this exact one, but a previous one hegave at another conference. But I believe there's some additional information in this one. And ifyou haven't gotten a chance to review it yet, I would definitely suggest watching this onebecause there's a lot of good stuff in it.
[00:06:57] The next recommendation of the presentation again at the Postgres Open conferenceis building a real time analytics API at scale. And this is done by someone from Algolia, which isa search engine provider essentially. And he goes through and talks about how they'vedeveloped a real time analytics API basically using roll ups. Now they're using Cytus. So hisexample is using the Cytus, which is an extension that does sharding of your PostgreSQLinstances. And they have presented a number of blog posts and presentations from theCitusdata.com blog where they describe how they build roll ups and essentially this is animplementation of that. And they talk about other extensions that they use to be able to do reallyfast approximate distinct counts using Hyperlog log and also doing top end calculations. So ifyou're wanting to do real time analytics, this is definitely a presentation to check out.
[00:08:00] The next suggestion is a wall for DBAs, almost everything you want to know. Now thisisn't necessarily for developers, but it kind of gets into the nitty gritty of wall and how to use it,how to manage it. So I would say this is a suggested presentation if you wanted to check it out.Another one of the ones that I would say is pretty good. So if you want to know a little bit more indepth about the wall, definitely a presentation to check out.
[00:08:26] The next piece of content is actually another YouTube channel. And this is the PGDay Paris YouTube channel where they've posted about ten videos from PG Day Paris. Nowthese were posted months ago, but I actually just found them on the feed. So if you haven'tchecked these out, another source of content, the last piece of content is actually anotherYouTube channel. This is CEO saying there's a lot of video content this week is Pgcon and theyhave 30 to 40 different videos that have been posted about PostgreSQL. Now, some of thepresentations don't look so great. Like, for example, I was looking at this one and it never showsthe slides. You get the audio, but it's probably going to be really hard to watch this video to get alot of content out of it. Others are showing the slides and you get the audio with it. So some ofthese presentations may be hit or miss, but again, a lot of video content from presentations thisweek that does it for this episode of Scaling Postgres, you could get links to all the contentmentioned in the show. Notes be sure to head over to Scalingpostgres.com, where you can signup to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes.Thanks. --