Top Ten Analytics with pg_mooncake | Scaling Postgres 355

In this episode of Scaling Postgres, we discuss how Postgres & pg_mooncake achieves top ten Clickbench results, a comparison of Oracle and Postgres transactions, sharding to scale and connection counts.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Many people use Postgres for their transactional processing, but for analytics, sometimes it's hit or miss given its performance. And some people even say Postgres is slow for analytics. And there have been a number of extensions that have come out that are trying to improve that. Maybe they add column storage like timescale, DB supports, or maybe they give you access to parquet files to run queries from, like pg_duckdb. Well, we've got another extension called pg_mooncake that is trying to accelerate analytics performance in Postgres. Also, be sure to stay till the end of my consulting corner where I talk about some of the defaults that are set for database hosting providers, but I hope you, your friends, family and coworkers continue to do well.

Our first piece of content is “pg_mooncake: (another) Columnar storage for PostgreSQL”. This is from dbi-services.com and this is talking about PG Mooncake. And it actually uses some other extensions like pg_duckdb to make what it's doing a reality. So he goes through and installs the extension, goes ahead and creates the extension in his database. If you create a table, you want to state using columnstore. So it will use the pg_mooncake format and you'll see that the access method of a normal table is heap. The new columnstore table is called columnstore. He inserts a million records into each one, and when he tries to look at the size of the file on the disk, you'll see you'll get a pretty normal size for the heat table. But for the column storage table, it lists zero bytes. And if you try to look up this file based upon the relation file path, the file doesn't exist. That's because the data is being stored in parquet files in this particular directory listed here. And then if you run a query against it, you can see a very different explain plan because what pg_mooncake does is it embeds DuckDB in Postgres and it uses that processing engine to process the query. And he does show some of the performance differences when you're doing inserts or querying the data. As long as you have data that can be compressed pretty well and is optimal for column storage, you could see about four times faster writes with a column storage format and about a six times faster query. But I'm sure these performance results would increase based upon the amount of data you have as well. But if you're like me and you wanted more detail about what is PG Mooncake, well,

I'm going to look at a few previous blog posts that have been posted actually on the mooncake.dev website. And the first one here is actually from November, but it says “pg_mooncake: Fast Analytics in Postgres with Columnstore Tables and DuckDB”. And I'm going to scroll down to the how PG Mooncake was built section here, because actually the data for pg_mooncake is stored in parquet files. This could be locally on your database system, this could be on S3, but the table metadata all about the parquet files and whether data gets inserted, updated or deleted and is all stored in the heap tables. So that helps it handle transactional integrity. So unlike a lot of column storage formats, you can insert and update and delete data in this column storage format with transactional integrity. And they say whenever you do an insert or a copy, it will create new parquet files, but if you do an update or delete, it will remove the parquet files and create new ones with the modifications. So of course this has a ramification that if you're, as they say here, deleting one row from a table from a 1 million row parquet file, you could be writing a new one with a million rows minus one. So they definitely want to make this more efficient. And I should mention that PG mooncake at this point is version 0.1, so it's definitely early days. And I felt this was an insightful paragraph as well. So, at its core, PG Mooncake is a TAM, so a table access method for columnstore, plus a storage extension for parquet files. So it actually stores the data in the Parquet files within DuckDB. So it actually uses the DuckDB engine to do the querying, as we saw in the Explain plan in the previous blog post. And table metadata, including addition and deletion of parquet files, is stored inside a Postgres table for transactional consistency. And lastly, queries involving column store tables are routed from Postgres to DuckDB, and the results are streamed back to Postgres via DuckDB. And they have this chart here. So your SQL goes in if it's targeting a column store table. There's some query modifications that need to be done, but it goes to the DuckDB engine and it accesses the column store table on the file system or S3 through pg_mooncake. But it does also use the pg_duckdb extension if it needs to merge in data from heap tables. So I couldn't really find any blog posts about performance comparisons yet.

But they do have a blog post on mooncake.dev that says “Clickbench says Postgres is a great analytics database”, and they were able to get it in the top 10 on clickbench. And this is PostgreSQL with pg_mooncake so it is rivaling these dedicated analytics databases. So if you want to stick with Postgres but need more analytical power, pg_mooncake may give it to you and it does a quick review of what it is. The storage format for this is the column store. It is using the DuckDB engine, which offers vectorized execution of the queries against the column storage and the table metadata and management is done directly in PostgreSQL, I believe through the heap tables. So if you're looking to boost your analytics performance, maybe you want to check out PG Mooncake as well.

Next piece of content “Comparison of the transaction systems of Oracle and PostgreSQL”. This is from cybertec-postgresql.com and even though both Postgres and Oracle are relational database systems and they both comply with a lot of the SQL standard, there are still some subtle differences in how they implement things transactionally. But this blog post first covers some of the ways that they are similar, but then starts highlighting the differences. Like number one is auto commit. So whenever you start running a statement, Oracle implicitly starts a transaction and you must explicitly do a commit or rollback in order for whatever you've typed to actually be executed. Whereas with Postgres each statement is its own transaction and if you want to do multiple transactions in a row, you have to explicitly say begin and then commit or rollback. The other difference is in terms of statement level rollback. So in Oracle if you encounter an error during a transaction, it doesn't automatically abort, it rolls back the failed work that was done. But you can continue moving forward with the transaction and then choose to ultimately commit or roll it back. Whereas with Postgres if you're in a transaction and you encounter an error, the only option is rollback. The only exception with that is if you start using sub transactions. But with the link here you know sub transactions can severely affect performance, so you do need to be careful about that, particularly at scale. Another thing is transactional ddl, so making schema changes in Oracle there's no way to roll those back, whereas with Postgres you can do that, except the very few exceptions listed here. But I'm not going to continue and read through the entire blog post. There is a fair amount of information in here and if you are looking to migrate from Oracle to Postgres definitely encourage you to check this out to be aware of things that are different things Transactions between Oracle and PostgreSQL.

Next piece of content is actually a new blog called “Dave's PostgreSQL Stuff”, and he's posted seven or eight micro blogs about various different topics about PostgreSQL. So I'm not going to talk about one in particular, but if you are looking for more Postgres content, you can definitely check out to see if any of these are of interest and this can be found at tokerpostgresql.blogspot.com.

Next piece of content while “While Postgres Redlined, Robinhood Sharded to Scale”, this is from tomlinford.com and this is a blog post of someone I believe, who used to be at Robinhood and talked about scaling that platform and how they chose to go about sharding their Postgres database to reach the levels of performance they were looking for. Now there's not a lot of detail in this post. It is pretty brief, but it was still kind of interesting seeing some of the decisions they made. Like for example, they were choosing between application sharding, where someone who's contacting the application gets routed to a particular shard of the application versus data sharding, meaning actually sharding the PostgreSQL database but still having one monolithic application that would choose which data to access. They actually went with application sharding, so the routing was done before reaching the application, so the whole application was sharded. But if you want to learn more, definitely check out this blog post.

Next piece of content “Postgres as a Graph Database: (Ab)using pgRouting”, although you could say using as well. This is from supabase.com and the pg_routing extension is usually used within PostGIS for doing calculations such as shortest path or maybe network analysis to solve complex routing problem like the shortest distance traveling across the country, for example. So to use some of these functions you basically need a graph representation of the data. So you'll have a series of nodes that are the entities and then the edges which are say the relationships. But they show a couple of examples of using the pg_routing extension for things other than GIS work. In this example, imagine you have a project that has many tasks and those tasks have dependencies. Well, what is the critical path? What is say the longest path through the project to see what the earliest end date could be. And here you would have each task as a node and the edges would be the dependencies between them. So we set up a task table and then a dependencies table, added the pg_routing extension and then used a PGR_Dijkstra function to find the shortest or the longest path through the network of tasks. So you could see the aggregate cost being reported out down here. The next example he shows is routing based upon resource allocation. So imagine you have a set of servers that have a particular latency. So the servers are the nodes, the server latencies are the edges. And here using a PGR_astar function, you're able to find the most efficient path through the graph. And then the final example is doing recommendations. So imagine recommendations on YouTube. So he actually has two different lists of nodes, a category and videos. And then the edges are the video relationships and the interactions. And using PG_Dijkstra you could come up with the shortest or most relevant path between user and new videos for them to watch. So I thought this was particularly interesting. You can check it out if you want to learn more.

Next piece of content is actually a micro blog from Bruce Momjian at momjian.us about “Wish I Was Wrong About TDE”, and basically to me I thought this reply on the thread of the discussion about TDE in Postgres is that feels like it's more like a when it will be feature as opposed to if. But there's still a lot of uncertainty given how complex this type of update will be and the number of areas that it touches. But I do know there are extensions that support TDE today, so you could use those as an option if you're looking for that.

Next piece of content “EdgeDB is now Gel and Postgres is the Future”. This is from geldata.com and I think we have shown some PGEdge blog posts on scaling Postgres here, but I really wasn't 100% sure what PGEdge was. Well now their gel data and if you look at the main website they're actually a data interaction layer that sits on top of Postgres. So they offer all sorts of services on top of that. So between your application and Postgres. So if you feel this layer might be beneficial for you, you can check out this.

The last piece of content. “PostgreSQL Hacking + Patch Review Workshops - March 2025”, this is from rhaas.blogspot.com and the next hacking workshop is coming up and they'll be watching and discussing Louise Leinweber's talk A Deep Dive into Postgres statistics. And if you want to join the discussion you can use the sign up form down here. He also mentions a set of patch review workshops that Paul Jungwirth is organizing, and he really likes this because patch reviewing is a great way for people to get started hacking Postgres. But also secondly, he says we desperately need more skilled patch reviewers. The there's a lot of proposed Postgres code that needs review and improvement and not enough people with the skill and time to do that. So definitely check this link out as well if you're interested. And lastly, he did say that the PostgreSQL hacker mentoring Discord server has actually just been renamed to PostgreSQL hacking. So it's a Discord server, not meant for just education of newcomers, but for really anyone and all levels of discussion. So if you want to learn more, definitely check out this blog post.

And now it's time for the Consulting Corner. So this week I'm going to be talking about a default that I've been a little confused by, specifically with AWS and their RDS hosting platforms. So many of the server configuration options for Postgres are configured based upon the size of the database, so shared buffers, I think effective cache size are all based upon the instance size that you're running your RDS database on. But what's strange to me is that the number of max connections always seems to be a really high value no matter the instance size I look at, it seems to be in the thousands, normally 5,000. And in my consulting I've never recommended setting thousands of connections. So part of me wonders how much of this is just a convenience so that users don't have to restart their database to change the configuration. Versus Is this a good practice and setting it so high, especially in older versions of Postgres, I think you're going to run into more problems because I know you can run into lock manager conflicts at higher levels of activity with the database. And I know one database system that has an application that if it hits 2000 connections, you start seeing a lot of lock manager conflicts just due to the amount of activity that this database is producing. Frankly, whenever I see someone setting their max connections at 1000 or more, that's a sign that they really need to start using a pooler of some sorts to not have it set quite that high. That could be pg_bouncer, or if you're in AWS for example, you could use RDS proxy. But I'm actually curious if anyone thinks many thousands of connections is fine for Postgres system, because that actually hasn't been my experience. I usually like to keep it at the high hundreds or the very low thousands, if that. But let me know what you think.

Top Ten Analytics with pg_mooncake | Scaling Postgres 355

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript