background

Easy Incremental Data Processing? | Scaling Postgres 347

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the new pg_incremental extension for processing data pipelines, a set of Postgres monitoring queries, handling alter table retries and large object dump performance.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Before we kick off the show this week, I want to wish you a happy holidays and I hope things are going well with you. I'm going to be taking a week off, maybe two. It depends how much Postgres content comes out. So it'll be another week to two weeks until you get the next episode of Scaling Postgres in the New Year. But I hope you, your friends, family and co workers continue to do well.

Our first piece of content is “pg_incremental: Incremental Data Processing in Postgres”. This is from CrunchyData.com and I think it was a week or two ago I talked about OrioleDB and why I thought having this new storage system was important is because Postgres doesn't do that great with update heavy workloads. So generally if you see a use case where you have to do a lot of updates, you should either not do that on Postgres or change it to append only workload. Or if you want to get really fancy, do an append only workload and then generate aggregates periodically to store in a table. That way you can just query that summary table to give really fast displays of results without having a heavy update burden. And as capability has a name in other products like Timescale calls theirs Continuous and Aggregates. They basically do continuous aggregation of some append only workload. Well now thanks to the folks at Crunchy Data, we have an open source extension that gets us something similar called PG Incremental. That quote helps you create processing pipelines for append only streams of Data such as IoT Time Series and even event data workloads. So some of the use cases are creation and incremental maintenance of rollup segregation and interval aggregations, incremental data transformations as well as periodic imports or exports of new data using SQL.

So let's take a look at what this does. So imagine you have an events table here and data is constantly streaming into it, but you wanted to do some view counts. So this is a summary table of how many things something was viewed. Now for the events table they are going to be using a Brin index to give you fast efficient range scans. And this is an append only table. So Brin should do a pretty good job with it. They're inserting a million rows of data for this example and they create a pipeline using this extension and they're creating a sequence pipeline. So it's going to be looking at the sequence to know how to do the aggregations. They're calling it a view count pipeline and going to be looking at the events table as the source. And this is the work that needs to be done. Basically insert into the view counts table and it looks like they're truncating it by day by user ID and the count. So you get a count of user events. And because this is sequence based, they're looking at a range of event IDs and if there's a conflict, go ahead and do an update instead using the syntax. So now your view count table looks up to date. You would also have it by day here. So by day, by user, the sum. Now, they do make note that with sequences there is a risk for an earlier sequence to complete and perhaps be missed, but they say quote. However, pg_incremental waits for those transactions to complete before processing a range, which guarantees that the range is safe. And it not only supports sequences, you can also do time interval pipelines as well as file list pipelines. So it can look in a particular directory. If a new file is there, it can process that file and then just continuously processing files as they come in. They're showing some more examples here. One is a processing raw JSON data pipeline. Again, they're using sequences for this, but they're taking raw events data coming in, that's JSON and converting it into a table format and inserting it into events. The next example is doing an aggregation with a time based pipeline. So they're using time ranges in this case and they're showing an export to a Parquet file in S3 on how that can be achieved. And it looks like this is calling a function that actually writes out to S3. And then a fourth example of showing files being processed in a pipeline. So it's looking in a particular directory or an S3 bucket in this case, and loading in new data coming in. Now this does have a dependency on pg_cron, so that is how you schedule how often the aggregates get run, or at least that's my understanding. And you can check the status of your different pipelines. So it shows you a sequence pipeline here and the time interval pipeline here and even the CRON based jobs here. So this seems to be a great tool that would allow you to avoid having to write a lot of application logic to do the same thing. Because I've done different techniques like this over the years. But having it all in one tool is super convenient and it's open source. So it's pretty interesting how Crunchy data has been releasing two or three different extensions that are just small focused extensions that help you get different jobs done, so this one appears no different. So if you're interested in this definitely check out this blog post and the pg_incremental repo here.

Next piece of content “Some of My Favorite Things – Postgres Queries”, this is from stborden.WordPress.com and when you've worked with postgres for a while, inevitably you copy down and record different scripts that you use and maybe you modify them for your own use cases for doing different things with postgres and this post has a ton of different I call them monitoring scripts for postgres, so definitely a handy resource to keep around. What I did find interesting about his set of scripts is that they seem to be three times longer than other scripts I've seen or other references that I've used. So he's doing things like the top SQL by the mean execution time or the total execution time or by the execution count. So clearly this is mainly looking at pg_stat_statements. The next one is looking at object sizes including the toast. So that's always beneficial to have. The next is SQL statements using the CPU by percentages. I think the next is a Stat / Vacuum Projection Script, so that's interesting and quite long. Next is Used Or Rarely Used Indexes. The next is Ranking Wait Events that are happening in the system again with pg_stat_activity and then Looking At Tables With Missing Foreign Key Indexes and finally Closing Out With A Blocking Lock Tree. So what is causing particular locks happening? And he says this is just a sample of what's available at the very top. He does have a link to his GitHub site that has many more. So if you're interested in that definitely check out this blog post.

Next piece of content “How to ALTER tables without breaking application?”, this is from depesz.com and there's nothing that can really help an alter table command that actually has to do a table rewrite. Like maybe you're changing a data type that pretty much requires a table rewrite, but you can do things to avoid longer locks by instead of just changing a column in place, you create a new one. Create some triggers to keep it up to date with the existing one, do a backfill of the data and then do a switch over to that new column. So that will help avoid a long running alter table if you're just trying to change a column type. But what this post is talking about when you have very fast alter table commands that can run but they still require an exclusive lock on the table, even if it's for a short time, and that could result in a lock queue forming and queries waiting to be executed. So of course the best way to handle this is to use a lock timeout. So if that alter table command exceeds that lock timeout, it will go ahead and stop itself. It'll protect the system from getting into a lock queue situation. But if that locktimeout is triggered, you're going to need to retry it. So a lot of what this post is talking about is how to do that retry. So he has one example here using an anonymous do block. So basically he sets his lock timeout, does a loop, does the alter table command. If it succeeds, it just exits and then he resets the locktimeout. But if there is an exception, the locktimeout is triggered. He catches that and then performs a sleep. So basically it will keep trying until it gets that alter table command completed. The next way you can do it is just using a procedure. So it's very similar in structure, basically setting the lock timeout, doing the loop, executing your alter table command. If there's an exception, then sleep. But he's added the alter table command as a parameter. So now you can simply call that procedure and put in whatever your alter table command is. And it will keep trying it until it succeeds. So if you're interested in doing that, definitely check out this blog post.

Next piece of content ”Quick Benchmark: Improvements to Large Object Dumping in Postgres 17”, this is from credativ.de. in this one they're talking about large objects or maybe blobs or binary large objects because they are using the large object import option. So they're storing these large objects in the table. I actually don't have much experience using this particular feature of postgres, but they wanted to check the performance changes from 16 to 17 because some enhancements were made to PGDump. So they tested both 16 and 17 with these large objects, and what they discovered is that Postgres 17 actually uses three and a half times less memory as Postgres 16, which is pretty incredible. As well as the file sizes are a lot smaller in the dump files. So in Postgres 17, the custom dump file was 7 times smaller than 16, and the TOC file of the directory dump was 25 times smaller. So that's incredible. So definitely if you're using large objects and you're dumping those large objects out, definitely looks like you should upgrade to Postgres17 as soon as you can.

Next piece of content “JSON in PostgreSQL : a query tuning case”,  this is from dbi-services.com and he was dealing with a very inefficient ORM generated query that apparently the original query was over 1500 lines. I can't fathom that type of query. Now the one he's presenting here is only 84 lines, but even with that 84 lines it took 90 minutes to run this in I guess in the actual database. And the key issues was a lack of some indexes or I think he said they were trying to use some likes that of course won't use an index. They had some inefficient looping going on, high cost sorting and even some full table scans. So basically he went through the process of optimizing this. Oh and you can see a fair amount of JSON that is happening as well. But through his optimization he was able to get it down to three minutes. So definitely a lot faster. So if you want to learn some lessons from that, definitely encourage you to check out this blog post.

Next Piece of Content There was another episode of postgres FM last week. This one was on “jOOQ” and here Michael and Nikolai hosted Lucas Eder, who's the creator of jOOQ and jOOQ is basically a Java, I guess, object oriented query. Not a language, I guess more of a DSL. So basically it's an ORM, although a lot of the descriptions don't say it's an ORM, but it looks like that's basically what it is. But looking at the website, what it appears to focus on more is being SQL centric. So it's definitely more database centric than than a lot of the ORMs that exist, which sounds pretty interesting to me. And I think Lucas said it basically feels like SQL as you're writing it now. Even though it sounds interesting, I don't think I'm going to get started with Java unless a client needs me to. But if you want to learn more, definitely encourage you to listen to the episode or watch the YouTube video that's posted down here.

Next piece of content “PG Phriday: Whats Our Vector Victor”, this is from bonesmoses.org and this is a post that basically reflects a presentation he gave at the Postgres conference in Seattle in 2024, where he's basically focused on PGVectorize and how it has all the tools to put together all sorts of different AI use cases in terms of semantic search or doing generative AI work. And he shows the process of working with it here because he says the typical case of working with a rag app is you have to do all of these steps now. They're not too complex to do. But he says compared to doing this to vectorizing a table and then vectorizing the rag for the user interaction, he finds this process much easier. But if you want to learn more, definitely Check out this blog post.

Next Piece of Content “A Sneak Peek Into the State of PostgreSQL 2024”, this is from timescale.com and this is a blog post announcing that the State of the PostgreSQL 2024 survey is now available. This runs through some highlights, but there is a link to the actual report which is right here at this link. So it describes all the things in this year's survey that you can feel free to check out. In addition, they had a separate blog post on the “State of Postgresql AI in 2024”, so if you're interested in AI related survey questions, you can definitely check them out here.

And the last piece of content the next “PostgreSQL Hacking Workshop - January 2025”, and this is from rhaas.blogspot.com and he'll be hosting discussions of Andreas Freund's talk NUMA versus PostgreSQL and he says you can sign up using this form here. And the next couple of months they will be doing a deep dive into postgres statistics as well as the Wire protocol. So if you want to learn more about that, definitely check out this blog post.

episode_image