OLTP To Parquet For Analytics | Scaling Postgres 353

In this episode of Scaling Postgres, we discuss how data can flow from your OLTP to parquet files for analytics, parallel queries, view inlining and partitioning advice.

Content Discussed

YouTube Video

Podcast Audio

Transcript

For many years we've had the term OLTP Online Transaction Processing and OLAP, or Online Analytical Processing. And basically the transactional processing is basically your line of business functions. You want to process some types of transactions in the database. That's when you're doing OLTP. But if you want to run a lot of reports and analysis, that's generally OLAP. And to do that type of analysis, typically it required exporting the data from your transactional systems to another database system. Maybe this is a data warehouse, a data mart, a data lake. And a lot of times it took a long time to extract the data and load it into the other system. If you were lucky, it was a day. But that's been getting faster with different techniques over the years. Right now they're doing things called change data Capture to try and keep that analytical data as up to date as possible. Well, the first blog post this week talks about an incremental way to do it. Also, be sure to stay till the end where my consulting corner we’ll talk about partition tables and optimization. But I hope you, your friends, family and coworkers continue to do well.

Our first piece of content is “Incremental Archival from Postgres to Parquet for Analytics”. This is from crunchydata.com and they had previously announced that they created the extension pg_parquet that basically allows you to export and import data to and from parquet files using essentially copy commands. And they also created pg_incremental, so it allows you to create a never ending series of time intervals or files using the pg_cron extension. That's something it depends on. And this blog post talks about using these both together to develop yourself an incremental data warehouse. Something that is kept up to date, not in real time, but probably real enough for a lot of use cases. So they go through the process of how you would create an events table that you're tracking, inserted some data into it, and gave you an example of the function along with the extensions that you would need to use to start to get this running to generate these Parquet files in S3. Now once you have the Parquet files in S3, it's up to you how you want to analyze them from that point. They of course mentioned their solution, the crunchy data warehouse. But you could use any tool to read and analyze these parquet files. Maybe you could use, I think it's pg_lakehouse is one that embeds DuckDB, which is a database engine within Postgres. Or there's also pg_duckDB, maybe you could use Those or the solution they're mentioning here. They do give an example of this solution, but they also show you an example of a performance for an analytical query. They're running on 100 million rows and if you're using the raw data on the original server, it took over five seconds to process this query. Whereas if you use parquet files or an even optimized form of that, an iceberg file, you get 10 times better performance from the optimized column storage. But if you're needing to do some analytical processing and looking for a way to extract data from your operational data stores, you may want to check out this option.

Next piece of content “Postgres Parallel Query Troubleshooting”, this is from crunchydata.com again and Postgres allows some queries to run in parallel, but it says certain things may trigger a downgrade of the parallel query to a serial one. Basically a non parallel they talk a little bit about how parallel queries work and how some of the reason you may not be getting a parallel query is that you could have exhausted the worker pool, which is basically the max parallelworkers parameter. Or you may get less workers than you would expect. Maybe that's because your max parallel workers per gather is lower than you want. That means every query has access to this many workers. Or maybe there's so many queries going on at the same time, there's not enough worker resources for each of those queries to run, so they have to run serially. Or the other reason is optimizer stats. Maybe the optimizer has determined it's more efficient to just go ahead and run a query serially as opposed to in parallel. And in terms of simulating the issue, they basically created a large table and they set the max workers per gather to four so you could see what a parallel plan looks like, and then they set it to zero. Then you can see the explain plan for a serial running of that query. But the other interesting thing is how do you detect in production? Well, of course you can see it using explain analyze. But pg_stat_activity also gives you an indication of the parallel workers that are in use by just looking at the back end type column here. But in the last paragraph here they actually created a procedure, the link is here that simulates an oracle feature called Oracle Parallel Query Statement queuing and basically it checks how many workers are available to do parallel queries at the time that a query is ready to run and if there are insufficient workers for it, meaning you may get a slow query because it downgrades to a serial execution, it actually pauses the execution of that query with a certain delay and then rechecks it, and only when there are sufficient workers available does it go ahead and run the query. So that seemed pretty interesting. And if you want to learn more, you can check out this blog post.

Next piece of content “VIEW inlining in PostgreSQL”, this is from notso.boringsql.com and he's talking about views, which again I usually consider virtual tables. Really the system has a stored query that you can reuse to run that query. And he shows a good example what view inlining is. It's basically if you have defined a view such as this, when you run this view in this type of query, it's as if it's replacing the view with a subquery of the view. So essentially it inlined the view into the query and by doing this the planner can take into account the whole query. It doesn't have to analyze the view and then your query of the view, it can do it more comprehensively. And he shows an example of that in this explain plan down here. But he says there are some things that can prevent inlining such as distinct on operations, window functions, set operations like union intersect and accept certain complex aggregations, materialized CTEs, volatile functions, or complex subqueries. So if the planner notices this, it may choose to do a subquery scan or materialize certain nodes, or even do a separate aggregation or sorting steps. But if you want to learn more about this definitely check out this blog post.

Next piece of content “Dealing with the PostgreSQL error "found xmin ... from before relfrozenxid ..."”, this is from cybertec-postgresql.com and he says this error is reported often enough that thankfully I've never seen it because it's definitely not very good. So first the blog post talks about xmin and xmax. It's basically two ranges to help postgres determine the visibility of certain rows. But the ultimate visibility of a row is determined after it's been frozen, and this freezing is important so that transaction IDs can be reused without having a wraparound problem where something that was in the past is now in the future. So the problem with an XMIN that is older than the railfrozen ID is basically this XMIN should have been frozen essentially, and it's not. So it's an indication of data corruption. Now I'm simplifying this greatly. If you want more detail, definitely read this blog post. It is very good but he says the main reason that this can happen is some sort of corruption. So basically you need to address the corruption. And he actually simulates it here by updating the system catalog, causing the error and hence the corruption. And it only happens during vacuum operations, so it's not anything you're going to experience with queries. This particular error, it only happens during vacuum. So it's very important to track what postgres errors are being logged. Now, the ways to fix the error. The best way, he says, because it ensures any corruption is gone, is dumping and restoring the table. But that can be very inconvenient, take a long time. Some other alternatives is update the railfrozen XID in the PG class system catalog. But it's dangerous. The main thing, he says if you pick a bad value for railfrozen xid, you could end up with worse problems. So definitely dangerous. Another option is using pg_surgery. That's a tool I've never used and I think if you have to use it, you would want to classify yourself as a postgres surgeon. But he did mention an interesting technique down here is updating the corrupted row. So basically your transaction ID will be updated and then it could move forward. And someone made a comment down here. They actually addressed the situation by updating the table. They set a column to itself, essentially an update where nothing was really changed and only update the rows where the XMIN was less than the rel frozen xid. So this was a pretty interesting technique to address this issue. But if you want to learn more, definitely check out this blog post.

Next piece of content “Scaling with PostgreSQL without boiling the ocean”. This is from shayon.dev and this is quite a long blog post. He has basically 11 sections to it, so I'm not going to go into detail. I'll just mention those sections for if you want to check it out. The first one he talks about is optimizations with regard to lock contention. He also talks about advisory locks and how they can be used. He talks about foreign keys and the issues they can bring. He talks about index bloat, the toast or the oversized attribute storage technique. Talks about vertical versus horizontal scaling solutions, talks about append only versus update heavy tables, how you could address those. Talking about DDL changes under load. Basically there's the lock timeout most important thing in my opinion. There's zero downtime, major version upgrades and how to do that, how to address counting issues where counting is slow and then finally multi-tenancy and different techniques for dealing with that. But if you want the details with regard to this, definitely check out this blog post.

Next Piece of Content There was another episode of Postgres FM last week. This one was on “Return of the Buffers”. So this is their third revisit of this topic, prompted by Postgres18, presumably getting buffers now a default on Explain. When you run Explain Analyze it will give you the buffers by default. And they are huge fans of this. Definitely want to have it because they are buffer lovers I guess. But they did talk about the future of buffers. What they may like to see why buffers may be larger than you could potentially expect. Maybe there's bloat involved and that's why they're accessing more data than expected. They also talked about performance optimization and measuring Cold versus Hot cache. But if you want to hear about everything they covered, you can definitely listen to the episode here or watch the YouTube video down here.

Next piece of content “PostgreSQL 18: Virtual generated columns”, this is from dbi-services.com and this is super great in my opinion. So we've had generated columns since I don't know if it's postgres 14, I can't remember when, but those generated columns are stored, so it actually stores a value in that column of the table, whereas 18 looks like it's going to have virtual columns. So basically you don't have to use disk space for storing these additional columns, but you can query them like any other column. So they show an example of here you just omit the stored option when creating that generated column and it gives you a virtual one. So that's definitely awesome. You can check out this blog post if you want to learn more.

Next Piece of Content “How about trailing commas in SQL?”. This is from peter.eisentraut.org and I know when I found that various languages support the trailing commas, I'm like thank goodness. This is awesome. And now whenever I use other languages that don't have it, I always want it. So I would love for SQL to have trailing commas, but this blog post article talks about how should we handle it? How should we address it? Is this something Postgres should do? Should the SQL standard do that first and then postgres? So there's a lot of discussion about how to do it. Do we do it everywhere? Do we do it only in certain places? But if you want to learn more, check out this blog post.

Next Piece of Content Ryan Booz and Grant Fritchey have released a book called “Introduction to PostgreSQL for the Data Professional”” and this announcement is at softwareandbooz.com and the actual title of the blog post is on writing a book about PostgreSQL and it looked to be more administration related topics from the overview that I saw. But there's definitely an announcement here on the podcast about it as well. So if you're interested you can check this out.

Next piece of Content There was another episode of talking postgres. This one was on “Why mentor Postgres developers with Robert Haas”. So you can check this out if you're interested.

Last piece of content “Distribute PostgreSQL 17 with Citus 13”. So it seems Citus 13 is released with support for Postgres 17. So if you use Citus or thinking about it, it now supports Postgres 17.

And now for this week's Consulting corner, I'm going to talk about “Partition tables and Optimization”. So if you choose to partition your tables, you need to think carefully about how you want to do indexing. How I like to generally do it is make sure the primary key of the partition table has whatever column you're sharding as the first value and then a second value, maybe that's an identifier such as a big int or a date or something else. So usually I'm using the multi column primary key when I'm using partition tables because I want to include that shard key in there and any other indexes on that partition table. I usually always include the shard key as the first column too. So most of my indexes are multi column indexes when using partition tables. And if you design it that way, you also need to make sure that your queries always include the shard key and any other columns in the indexes so that the full indexes can be used. Because I have seen some cases where clients are not including a shard key, particularly the leading columns of the index, so it doesn't get utilized efficiently, which can lead to slower queries, of course. And if you just query one of the child partition tables, you should still include all the columns in your query that exist in the index to make sure that the planner can find the most optimized plan for executing your query. The other thing about partition tables to keep in mind is that if you do create an index on the parent partition, that index cannot be removed concurrently, so you can add it concurrently. First you add it concurrently to the child tables and then you can add the index to the parent and it will happen very quickly. But in terms of taking off that index. The only way to do that is to do it non concurrently. So you do need to make a decision if you're going to be creating all your indexes on that parent table as well, because once they're there, it's kind of hard to remove them. But those are just a few tips when working with partition tables.

OLTP To Parquet For Analytics | Scaling Postgres 353

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript