background

Real Time Analytics, Index Decrease Performance, work_mem | Scaling Postgres 17

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we review articles covering using Postgres for real-time analytics, how indexes can decrease performance and how to configure work_mem.

Content Discussed

YouTube Video

Podcast Audio

Transcript



[00:00:00] In this episode of Scaling Postgres, we talk about realtime analytics, indexes,decreasing performance, configuring work, mem, and window functions. I'm creston. Jamesonand this is scaling postgres episode 17.

[00:00:23] All right, our first article is scalable incremental data aggregation on postgres andcitus. And this is from the Citusdata.com blog. Now, that title was quite lengthy. However, mostlywhat they're talking about is they've come up with a new technique for doing real time analytics.So they have a scenario here where you're loading events into a raw data table. They're sayingin batches, and then periodically you want to aggregate these new events into a roll up table. Socreating a summary, for example, and select from that roll up table to display a real timeanalytics dashboard. So this is a new technique that they came up with, and they have a linkhere in this hyperlink new approach that takes you to the actual code that they present in this.

[00:01:16] And they tested it using their Citus data solution, which is basically an extension ofpostgres that lets you set up nodes and direct codes and queries to those specific nodes. Sosome of what they tested is not stock postgres. They did test that, and the techniques that theydiscuss here can be used for standard postgres. However, they saw much better performanceusing their Citus extension for postgres. But still, the rationale and the techniques could bebeneficial to you as a developer using postgres. So, the first thing that they set up is a rollupstable to manage what roll up tables exist. And basically, they want to retain the sequencenumber for each row that has already been calculated. So if you have a minute roll up of aparticular set of data, or a table that has 1 minute roll ups of the data, or ten minute roll ups orhourly roll ups, you would use this table to store that table and the last sequence that theaggregates were calculated for. Now, this particular technique works as of postgres ten becausethey are able to use the PG sequence Last Value function.

[00:02:33] So here they show the code that they're using. So they created a new function calledincremental rollup window that actually does the roll up process. So first they select from therollups table what roll up name they're interested in, and then they do some of this trick that theycalled it doing a lock on the table to ensure that all rights have been written to the table beforethey do the aggregation. And here's where they're using the PG sequence Last Value, and theyupdate the rollups table. And it looks like the output to this function is the last aggregated ID, orthe window end, as well as updating the rollups table. So this updates the rollups table,essentially the master list of what the last aggregated ID will be. Now, their raw data table thatthey're using is called page views because they're tracking page visits like perhaps a CDN wouldbe doing. And they found using a brin index gave the best performance for looking up a range ofsequence IDs. And they have this additional step here that's possible with Citus. You wouldn't doit with standard postgres of course, but they actually create a table distributed across differentshards because in this scenario they're using a four shard Citus data solution. Or I'm sorry,maybe that may not be the number of shards but at least four instances of Citus. And then theycreated a table called Page Views one min that stores the roll ups of the counts of views forevery minute. Because if you're having thousands counts of views per minute, this allows thistable to be queried much faster because there will be less rows contained in it. And they'reinserting this 1 minute roll up table into the master list of the roll up tables that they have. Sothey just have the one 1 minute roll up table and then they create a second function for actuallydoing the aggregation work. So relying upon the previous function, we looked at incremental rollup window that grabs the next window that they want to calculate, finding the start and the endfor it and inserting that new aggregate into their Page Views 1 minute table and doing on conflictdo an update instead. So basically insert the new value for the new minute or update that value.And with this function in place, you can simply call select all from Do Page View Aggregation inorder to run the aggregation and update the 1 minute View table. And they say here b --y running the Do Page View Aggregation function frequently, you can keep the roll up table upto date within seconds of the Raw Events table and they give an example of what theperformance is like querying the raw data versus the aggregated data. So they go from 869milliseconds when querying the Raw Page Views table versus the same counts when queryingthe Page Views 1 minute roll up table. They get it in five milliseconds, so over 100 times fasterwhen you're aggregating data in this fashion. And then of course, they compare performancebetween Citus and Raw postgres and they saw dramatically better performance with Cytus dueto its ability to do insert select commands across multiple server instances because it canhandle not only the insert load from the raw data but also the aggregation. Now of course theyare invested in Citus so some of these statistics may be a bit biased. But they did use a largerinstance here and its performance didn't keep anywhere in line with a smaller series of four Citisinstances. So whenever I see this, I wonder if there's different techniques that could be used tomaybe not achieve cytus's level of performance, but get close to it in terms of doing things likewould partitioning of the table data help in any way in using these techniques? But definitely aninteresting blog post to check out for understanding the techniques and how they're doing it. Andeven though you may not want to use this exact technique, maybe you can take some of theideas they use to figure out ways to scale your database more efficiently. So definitely a blogpost to check out.

[00:06:54] The next post is adding an Index can decrease select Performance. And this is fromthe Cybertech.com Blog, and they talk about when you add a new index, usually the cost you'rethinking about is your inserts and your updates are going to be a little bit slower because youhave to maintain that index. And generally when you add an index, you expect it selects to gofaster. Now of course, if those index cover the data that you're selecting from, it will be faster.But there are some cases where it can actually reduce performance of some queries. And hehad an example of a query here where they have a three column table they were selecting fortwo of those columns, and one of them had an index on it. And he's ordering by the Sort andthen limiting it. Now, things are fine with this query. It finishes in four milliseconds, given howmuch data he has in it, which it looks like maybe a million rows. But when they add an additionalindex on the Sort column to make sorting, presumably make sorting faster, the planner choosesa different plan and tries using this new index to find just 20 values that match the criteria. But itends up reducing performance significantly from four milliseconds and increases it to 28milliseconds. And I've seen this on occasion when you're ordering and there's a limit, becausesometimes if the limit is small, the planner says, well, I can just start going through this index oreven do a sequential scan. And I'll eventually run into the values of interest based upon thestatistics, whereas a lot of times, just using a direct index is much faster. And you can look forthis too. If you say eliminate a limit clause from a query, sometimes that may run dramaticallyfaster. Now, he mentions two ways to actually speed this type of query up. One is drop theindex. That's misleading, essentially the planner. The second is to rewrite the query so thatpostgres cannot use the offending index. And he uses an interesting thing here. One is using anexpression as a Sort key because you can't use an index from an expression. And the other isusing a sub query with an offset of zero. But I also found interesting that there were some othersolutions in the comments that may be more effective. One is proposing a partial index oh, I'msorry, a conditional index. And the other is doing a multicolumn index. So these scenarios mayalso work to give the most optimum performance. But if you run into this situation I woulddefinitely consider removing a limit clause and if you need to an order by clause just to see whatthose impacts have and maybe try some of the different techniques that are available that theblog poster mentioned as well as in the comments here. And of course using Explain plan youshould be able to find the most optimum way to write the query and or structure your indexes.

[00:09:55] The next post is --Materialized Views with PostgreSQL for beginners and this is a little bit of an older post, but Isaw it come through the PostgreSQL ecosystem this week and of course it's talking aboutmaterialized views for beginners and it seems that this is from a developer so this would be froma developer's perspective. And they had a situation where they have real time traffic going intoan object data store and then there was a daily batch of data that was being sent to postgres butthey say unhappy user because they wanted apparently more frequent data. So basically theywanted to build an aggregation table using views. Or I should say at first they did not start usingViews, they just wanted to create an aggregate table and refreshing the table by deleting rowsthat have been updated and inserting new versions every 2 hours. And they quote here as a badidea because the aggregation table is experiencing read access locks in the table. So basicallythey were running into locking issues. So then they looked into materialized views and just toremind you, a view, just a standard view is a virtual table that you can specify precisely how youwant it to look, but it looks to the raw tables to actually generate the data for you. Nowmaterialized view is actually no longer virtual, it's like a real table. So you are storing the datathat is comprised in that view, in this materialized view and you can add indexes to it and treat itlike a normal table in that respect. And indexes on a materialized view can give you a prettylarge performance boost. So they started using this technique but then they ran into anotherproblem because as they were refreshing it because this materialized view is not dynamic, youhave to tell it to be refreshed on a periodic basis. It was locking out users during the rebuild. Sowhat they came upon is actually refreshing the materialized view concurrently, so much likeindexes, you can index a table but it impacts access to that table. If you run it concurrently ittakes longer, but it doesn't impact general read, update or insert access to the table. And the keything that she also mentions is here is to use the refresh concurrently, you must define at leastone unique index on your materialized view. So this post gives a kind of a good overview andshows you some of the pitfalls that they ran into when they were setting up their materializedviews and even chose that as a particular solution. So definitely a blog post to check out thenext post is Configuring memory for postgres. And this is again from the Citusdata.com blog.Now this particular post is focused on workmem. Now, workmem is a per connection setting. Soit's not a global setting for the whole database server, but it's used per connection and actuallydepending upon the operations, multiple amounts of workmem can be used for particularoperations within a given session. Now, they start off by saying by default workmem is set tofour megabytes in postgres and that each postgres activity, such as joins certain sorts, canconsume four megabytes before it starts spilling to disk. And you can find out if it's spilling todisk by searching for temporary file in your PostgreSQL logs. If you have log temp files enablednow they go down here to say perhaps a sane default is something like 64 megabits. Excuseme, 64 megabytes for work mem. If you're looking for a one size fits all answer, that seemspretty high from the experience that I have, they mentioned something a little bit lower andincreasing it further down in the post. That seems like a more sensible thing to use. So definitelyif you're going to consult this post, be sure to read it all the way through and not just look at thefirst suggestion here. And then they go into the thing I mentioned that the setting essentiallyimpacts per connection and talk about how much doing calculations based upon how muchmemory your server has versus how much is using for different operations and particularlyparallel operations. Parallel queries can also have a big impact on how much of the work memyou're using for each connection. And down here in the section, more workmem, more problems.This is a good setting I can think of that, say starting small with say, 16 megabytes and graduallyincreasing workmen when you start seeing temporary file logging indications. Now they go on tomention if you start seeing out of memory errors in your logs or even the out of memory killer inLinux starts being kick --ed off, the thing you want to do is actually reduce your work memory because you're using toomuch of the memory available in postgres. And just to add a few guidelines or how I think aboutit, if you have a highly transactional database where the queries are relatively simple, but you'vegot a lot of them and you want to have a high number of connections, then you can keep yourwork memory pretty low because you're not going to be needing it to do a lot of joins or sortingoperations. Whereas if you're having a more analytic type load and you have very complexsorts, very complex joins or a lot of them and pulling back a lot of data and maybe you don'thave quite as many connections, then go ahead and bump up you can then go ahead and bumpup your work memory a lot higher because with less connections, you can have a larger workmemory to handle those types of loads without having to fall back to disk. But generally this is areally good post with really good guidance for workmem. So definitely check it out if you dopostgres database tuning.

[00:15:44] The next post is using window functions for time series Internet of Things analytics inpostgres BDR. Now, I found this post interesting, not necessarily with the postgres BDRaspects, but it does give a very good introductory overview of using window functions. So in thisexample, they're using an example of IoT temperature sensor data that's being collected andusing window functions to show how you can query this data to rank them. So as you see hereon the right column, it shows the different rank based upon the reading. So it has a very basicwindow function doing rank ordering by the temperature reading, how low or how high it is. Thenthey talk about what impact dense rank will have on the data and then even using the Lagfunction to show the difference between the previous value. So I know this was talking aboutBDR, but generally this is a great post just to give you a good introduction to Windows functionsand how they work. So if you're interested in that sort of thing, definitely a good blog post tocheck out. Oh, and I should say this is from the second quadrant PostgreSQL blog.

[00:16:56] The next post is how Redshift differs from PostgreSQL and this is from theStitchdata.com blog. So if you're not familiar, Redshift is basically a column or analyticaldatabase provided by AWS. It's Amazon Redshift and as they say here, it's the most popularcloud data warehouse available. And if you didn't know it, Redshift is actually a fork ofPostgreSQL. So it has a lot of similarities in how you use it. But there are some key differences.And this post is interesting because it goes over a little bit of the history in terms of thecompanies involved. So I found it pretty interesting post and then it goes into some of thedifferences in SQL that you're going to be running into versus creating tables, inserting datavacuums. It even talks about a lot of the unsupported features and data types because this wasforked a while ago from my understanding from postgres. So there's a lot of features that itdoesn't have but it has been built up to be optimal for data warehouse purposes. So definitely aninteresting blog post to check out.

[00:18:10] The last post is deploying PostgreSQL on a docker container. Now, at this stagewhen we don't have native sharding available with Postgres, I haven't really seen this a lot withpeople putting postgres on a docker container, but perhaps particularly because anything in adocker container, the storage is essentially ephemeral so you have to be very careful with howyou're using it within a docker container. I haven't really done it, or don't really have that muchexperience with it, but if you're thinking about or interested in it, here's an introductory blog postof how to deploy PostgreSQL on a docker container. And this is from the several nines.comblog. But perhaps in the future, when PostgreSQL has more native sharding, that implies thatyou're going to have to be managing multiple database instances to satisfy query requests, thenrunning those in docker containers may make the most sense, but definitely a blog post to checkout.

[00:19:16] That does it. For this episode of Scaling Postgres, you can get links to all the contentpresented in the show notes. Be sure to head over to Scalingpostgres.com, where you can signup to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes.Th --anks, Sam. --

episode_image