Optimizing For Analytics | Scaling Postgres 334

In this episode of Scaling Postgres, we discuss how to optimize your database for analytics, how to speed up counts, improvements to TimescaleDB and why you should stop using serial.

Content Discussed

YouTube Video

Podcast Audio

Transcript

I'm going to try something different this week. Normally I put the most important or the best blog post first and then the other blog post down the line from there. I'm actually going with a theme this week, so there was a set of three or four, maybe even five blog posts that all covered the theme optimizing Postgres for analytical performance. So that'll be the main theme of this show. But of course we'll also cover other posts posts after that theme. So it's not necessarily the case that the best or the most important blog post is first this week. But I hope you, your friends, family and coworkers continue to do well.

Our first piece of content is “Optimizing your Database for Analytics”, and this is from Karenjex.blogspot.com and she did a presentation by the same name and presented it at Pycon Italian 2024 and Euroco in 2024 as well. In addition, there's a video of the presentation, but she converted this into a blog post. So she's talking about ways you can optimize your database for running analytics. And she says most applications use their database as a transactional processing database or an online transactional processing OLTP database. So generally you have a lot of concurrent users, short queries, multiple selects and updates are going on at the same time. But in contrast, when you're wanting to use your database for analytical purposes, it's generally considered OLAP or online analytical processing. So there's a lot more reporting, there's decision support or even business intelligence. And OLAP is highlighted by having more complex queries over multiple tables with many joins. The data set you're covering in these reports is generally much larger and a high resource utilization of the server to perform all of these complex joins and queries, and historically how it was done, you had a dedicated OLTP database, or maybe more than one, that would then send the data over to an OLAP database for doing the online analytical processing. However, there is a newish phrase that was termed Hybrid Transactional Analytical Processing or HTAP, trying to do everything in the same database. But of course there's a problem with that in that you kind of short change each workload. Transactional processing won't be optimal, neither will analytical processing. So you have to do some compromises. So your analytical queries maybe aren't as efficient. You may slow down your application for its transactional processing and things of that nature. So she goes through a demo of using PG bench to showcase some of this for the transactional processing wrote an analytical query and tried to run them to show you the effects. Now, in terms of what you can do if you're trying to do this, there's different ways you can adjust configuration parameters to optimize both of these use cases. She talks briefly about connections, but I think the more important one in terms of segmentation is work memory. So usually you want a smaller work memory for the transactional processing because generally those queries are super fast and small. So you usually don't need a lot of memory for doing joins or sorts, whereas usually you need a lot more work memory for analytical processing. So maybe you set these configuration parameters at the user level. So whichever user is doing transactional processing needs less, whereas the user doing the analytical processing needs more work memory. She also discussed setting the statement timeouts differently as well, and you could set these at the user level too, maybe having shorter statement timeouts for the transactional processing and longer timeouts for the analytical processing. She talked about the importance of index strategy in that usually with transactional processing you want generally fewer indexes so that you're not impacting inserts, updates and deletes as much, whereas you can usually have more indexes on an analytical processing. But some ways you can speed up your analytical queries is to do pre-calculation if you can, or pre-sorting or pre-aggregation. So this could be using stored and generated columns, which is a feature in recent versions of postgres. Or you could use materialized views, or you can even build your own summary or aggregate tables that will give you these analytical queries much faster. But of course you can also go back to having your own dedicated database for doing analytical processing. The simplest way to do it is if you have a primary database and multiple replicas, maybe you want to set up one as your analytical database. The disadvantage with this is that you do have to deal with query timeouts. You can't really add different indexes and you can't really change the layout of the schema. So basically you need to make all the changes in the primary and they just get replicated to the replica. Another solution is doing logical replication, so you could have a logical replica of the primary database that keeps it up to date. And the advantage of this is that you can replicate only certain tables if you want to, and you don't even have to have the logical replica against the primary. As of version 16, you can create a logical replica against a standby or a replica server as well. So those are some different ways of dealing with analytical queries in your postgres database environment. So check this out if you want to learn more.

Next piece of content there was another episode of Postgres FM last week, and this one was on “Slow Count”, and Nikolai and Michael discussed why it can be slow and maybe some things that you can do about it. So the reason why the counting is generally slow in postgres is, first of all, it's a row storage, meaning that if you want to count how many users are in a particular table that joined last year, well, you have to go through each one of those rows that may be on one or more pages in the disk. So that just takes a long time to parse through and count every single row. Another disadvantage is the multiversion concurrency control that postgres uses, meaning that the visibility information is stored on the rows itself, so it definitely has to parse through every row to see what rows are valid or not for a given query. Other database systems may have it easier if they have a redo log, for example. So those are some reasons why counting is slow in postgres. But in terms of speeding it up, one way is of course to do column storage. So that way you're storing particular values in one column, and generally that's on one or a few pages, as opposed to visiting many pages for many rows. There are some extensions that do that, but also some other productized extensions like TimescaleDB have that capability built in. The other thing that can help is having an index on what you're counting, so that maybe you can just do an index only scan so you don't have to go to the heap. Of course, visibility information would still have to be checked, but that should allow things to be a little bit faster. Further, recent versions of postgres, you can do these scans in parallel, so that will help speed up how fast it takes to count something. Another possibility is doing estimates, so maybe you do an explain plan for the query and see how many rows it's projected to give back, and that's what you could use as an estimate. Of course you really need a up to date statistics for this to be anywhere near as accurate as it needs to be, and it doesn't really work for joins that well, and things of that nature. Another option is doing the extension hyperlog-log, which can give pretty good estimated counts of distant counts. So that's another possibility. And then they rounded out talking about pre aggregation. So some of what was mentioned in the previous blog post maintaining a manual summary of accounts. Of course you will need some method to keep that up to date, so you're basically taking a hit at the point where you're inserting, updating, and deleting something to keep this count accurate in exchange for a very fast select query to get counts. And of course, if you're storing a count, you need to be very careful how often that gets updated or changed, because you could run into locking issues. And Nikolai definitely recommends, as do I, that these kinds of updates or changes to counts should be done asynchronously as opposed to synchronously. So maybe triggers aren't the best option, but put these changes into some sort of job queue to be processed on an asynchronous basis. But if you're interested in learning more, definitely listen to the episode here, or check out the YouTube video down here.

Next piece of content “Making Postgres Faster: New Features for 7x Faster Queries and 500x Faster Updates” this is from timescale.com, so I think this is a release week for timescale because they did mention three major new features across these four blog posts. This first blog post is just a summary here of it, so we'll just look at the individual blog posts for it. So the first feature and major blog post they mentioned is Hyperstore, a hybrid row columnar storage engine for real time analytics. And I think this already existed previously, but I guess with the improvements they've made, they've actually started calling it hyperstore. And basically they store recent rows in row storage and then historical data in column storage. So that lets them take advantage of their compression to get up to 95% or something crazy compression ratios for a columnar storage. But they go in and explain their row based disk layout as well as their column based disk layout and basically their hyperstore in this diagram. So basically they have their row storage. At some point in time it gets automatically converted into the column store format. So basically at the point of ingestion it's all row storage. Over some period of time it becomes column storage and then eventually it gets purged, because this is time series data after all, for the most part. And they talk about a number of the features that are available in it, including micro partitioning, SIMD vectorization to take advantage of cpu capabilities using skip indexes as well as compression. Now, the next blog post they talked about is boost postgres performance by seven x with chunk skipping indexes. So basically postgres has the concept of partitioning. Well, they called them micro partitions in timescale and they basically call them chunks. So chunk skipping is a way to skip over particular partitions when needing to determine what chunks or partition labels you need to contact to satisfy a given query. And with the new version of timescale they're actually enabling developers to use metadata to dynamically prune and exclude partitions. So normally you have to use the partition key to do partition exclusion. To exclude partitions, you're not going to be querying to get the answer to a given query. Well they're adding a metadata layer on top of it, so you don't only have to use the partition key in order to do partition exclusion. And they say they've seen seven times faster performance and an 87% smaller footprint due to the compression. Now how this works is taking advantage of correlation. So presumably this new metadata feature works best when there's correlation between what the partition key is and the other data you want to partition by, that's not part of the partition key. So in this case they enabled chunk skipping for a device id and basically ”Timescale tracks the minimum and maximum values for that column in each chunk. So you can think of it a little bit like a block range index. So timescale will know. For this particular chunk, the smallest device id is 100 and the largest one is 200. So if you're looking for a device id of 300, it knows that value does not exist in that chunk, so it can exclude the chunk. And they say “You can add as many chunks, skipping indexes on additional columns as you need for your use case”. So this is a pretty interesting feature. And they go into more depth of a order id example where it would exclude things. So I think that's pretty cool.

And then the third blog post is “Bridging the Gap Between Compressed and Uncompressed Data in Postgres: Introducing Compression Tuple Filtering”. So timescale has had compression for a while and on the fly compression for a while, and even being able to update, delete and insert data into columnar compressed data. So a lot of columnar solutions don't allow you to update that data, it just basically happens once. But their solution does enable you to update, insert or delete into already compressed columns. So basically they uncompress what needs to be changed and then they recompress it and store it. But because of how they've changed things, they've had significant improvements with the most recent version that gives up to 500 times faster updates and deletes and ten times faster upserts. So things where you're updating or deleting data when using columnar compression should be a lot faster. Now the specifics were a little bit vague for me exactly how it works, but from my understanding is they always decompressed the entire compression batch whenever they knew they needed to touch it, whereas the enhancement in the most recent version during the process of decompression, it's able to better determine if it needs to decompress all batches, or just one that's or ten or however many, and that enabled it to be a lot faster. But to me this seems like a pretty significant enhancement. So normally I don't include that many product specific announcements on Scaling Postgres. I usually stick to open source postgres, but in the cases where I see features that are super cool that gee, it would be nice if these could eventually get into postgres. I do like to showcase those so you could see what's happening in the greater postgres ecosystem. But if you want to learn any more about this, feel free to check out these blog posts.

Next piece of content “Stop using SERIAL in Postgres” this is from naiyerasif.com and this is a pretty comprehensive blog post about why you shouldn't be using serial and you should be using identity columns. So both of these are ways to create auto incrementing integer columns in postgres. Serial is the non SQL standard that was developed by postgres very early on. Identity columns are the SQL standard. So the first issue, as with serial, is with permissions. So if you grant a particular user access to a table that was created with a serial column and that user tries to insert a value in it, they're going to get a permission denied for this sequence. Why? Because you actually need to separately grant usage to the sequence for this person as well, whereas with the identity column it's more a part of the table. And I think just granting the table rights are sufficient. The next issue is that serial lacks integrity guarantees. So for example, you can specify a manual value and insert it into a serial column and postgres will accept it. But if you then try to insert a default value at this point, you're going to get a constraint violation because the value already exists. In contrast, an identity column, when you try to insert a manual value, it'll throw you an error at that point and says a non default column is not possible. But you can see the hint here that you can use OVERRIDING SYSTEM VALUE if you need to override that. Another issue related to this is that you actually can drop the sequence connected to a serial column and that could have definite issues that you'll experience. Whereas if you try this with an identity column, you'll get an error related to that. The next issue is that serial is not a true type. So when you actually create the table, you actually specify serial, but the actual table that you see has no reference to serial at all. So if you do a /d that particular table, nothing will say serial, it just has this definition here, whereas if you use an identity column, it will appear pretty much exactly as you created the table as. Next issue, SERIAL has awkward ergonomics, meaning you have to reach out to a sequence if you want to change the sequence numbering, whereas you actually can just alter the column when dealing with the identity column and tell it where to restart. And finally, like I mentioned before, serial is not an SQL standard, whereas identity columns are. Now I definitely agree with everything in this blog post, and ideally you should use identity columns versus serial columns. However, my issue is that a lot of the application frameworks I still use when you use their ORM’s, still generate serial columns. So I think what we have to do is reach out to the ORM maintainers and say hey, you guys should really switch from using serial to identity columns. But check out this blog post if you want to learn more.

Next piece of content “HYBRID SEARCH WITH POSTGRESQL AND PGVECTOR” this is from jkats05.com and this is similar to a post that was covered a number of weeks ago in scaling postgres, where they were combining similarity search with PG vector with a full text search in postgres. So hybrid search is searching with both at the same time and combining the results to hopefully give you better search results than just relying on one or another, because one is looking for actual words as they're used with a full text search. The other one is looking for semantic meaning, and the method used to combine the results at the end is reciprocal ranked fusion. And that's something the previous blog post mentioned as well. But he goes into that and actually creates a function to use this technique to combine the results. So basically he builds this hybrid search example in Python. So he uses a number of python libraries, creates a particular table, creates the function for the reciprocal ranked fusion. Here's the Python code for generating these embeddings. He then creates a gen index on a t's vector for the description column, and then creates an HNSW index on the embedding column for the semantic search. He shows an example of what the pg_vector query would look like, the semantic search query, and then the full text search query, and how he's doing the ranking here and then this is the query that puts everything together. So he has the semantic search or the PG vector query here, the full text search query here. He does a union all between them, groups the results and then has the reciprocal ranking fusion score here in orders by it so that you hopefully get search results that are better than one of these tools individually. So if you want to learn more about this, you can definitely check out this blog post

Next piece of content “Window Functions for Data Analysis with Postgres” this is from crunchydata.com and window functions are definitely a capability you need to have in your back pocket as a developer because they have particular use cases that help you write a lot less application code. So they enable you to do things like calculate running totals, create rankings, do lag and lead analysis, as well as doing moving and rolling averages. And how window functions work are basically you have a particular query that returns a set of data. It allows you to do operations over a window of that result set. So some examples of the shown here they show running totals and this first column is a running total of all the values that are in the total amount. They show an example of the first order and the last order for a given set of orders here. They show an example of a sales rank here, grouping by a given day for and the total number of sales. They show an example of a lag analysis here and then showing an example of a rolling average over the last seven days, and then an NTILE function that allows you to tile or bucket your results set. So definitely check out this blog post if you want to learn more.

The last piece of content “Schrodner's Backups: A Lesson in Uncertainty”, this is from drunkdba.medium.com and it starts off with a story about backup and restore issues and the database files not being available to do the restore. And ironically, this is with PG backrest, which I remember there was a post a few weeks back talking about if you're doing backups, be sure to use a tool like pgbackrest. So even using these tools you can apparently get yourself in trouble. But what I thought was interesting with this blog post is actually the latter part because he talks about how their backups are taking so much space and so much time to do, and then he mentions quote enter PostgreSQLl 17 to save the day. Now what he's referring to here is the new incremental backup feature, and apparently this will really allow them to save time both with the backup and restore procedure and make it easier to deal with. So I thought this was interesting from a use case of someone starting to use these new backup features and finding benefits from it. So if you're interested in that, definitely check out this blog post.

Optimizing For Analytics | Scaling Postgres 334

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript