Splicing Elephant & Duck DNA | Scaling Postgres 330

In this episode of Scaling Postgres, we discuss the merging of Postgres and DuckDB via the pg_duckdb extension, how this can help the analytics story for Postgres, some ways to improve PG analytics and building a search engine.

Content Discussed

YouTube Video

Podcast Audio

Transcript

It's crazy. In the last year I've heard a ton of different things about DuckDB and it being an analytical database and giving incredible performance for analytic workloads. But in the last few months I've heard of people trying to integrate it with postgres to basically try and shore up areas where postgres isn't that great for doing analytical type workloads. Well, this week we're going to talk about a potentially significant integration that's happening, but I hope you,your friends, family and coworkers continue to do well.

Our first piece of content is “SPLICING DUCK AND ELEPHANT DNA”. This is from motherduck.com. That's quite a name for a company and basically this is an announcement of pg_duckdb. So this is an open source postgres extension that embeds DuckDB's analytical engine within postgres, allowing you to do super fast analytical queries. Now, back in June I did cover a post by paradedb.com where they were integrating DuckDB as well. But this extension is going to be fully open source with a permissive MIT license, and the IP is owned by the DuckDB foundation and it's actually event being hosted by the DuckDB GitHub repository. So this is definitely going to be a first citizen tool in the DuckDB ecosystem. And what's even more impressive is the number of organizations that are behind it. So DuckDB Labs, which created and are stewarding DuckDB, are of course participating in it. Motherduck, which apparently has a lot of experience running DuckDB, and they host DuckDB data. So that's what this website is from. Also, Hydra, which is a company that started doing analytical performance, I think maybe a year or two ago. There's the serverless postgres company Neon DB is participating as well as Microsoft. Now, I did mention that others have integrated DuckDB into their analytics solutions, and they mentioned here that crunchy data does have a commercial version. It's their crunchy data analytics, I think. And like I mentioned, Parade DB built what they're calling pg analytics. Although in the most recent post back in June, they actually changed the name, I think to PG Lake house. So I don't know if PG Analytics is still a thing or not, or if it's all just PG Lakehouse now. So this extension is super early, the announcement of it, because, you know, they're still building it. So they have a lot of objectives that they want to achieve with it. But someone from Hydra at Duckon five, which happened this week, did show off some of its capabilities, but in the future they want to do things like support postgres native storage so that DuckDB can integrate with backup replication features such as that they want full type compatibility with postgres so that all the data types can interoperate. They want full function capability so that those would work within DuckDB as well, and also a Semantic capability because they said there are some differences with regard to rounding or precision, and they basically want to make those identical as well. So this is super interesting. I encourage you to check out this blog post as well as check the repository if you want to learn more about it.

Now, as a follow on to that, there was a post on crunchydata.com talking about “Postgres Powered by DuckdB: The Modern Data Stack in a Box”. So again, this is with regard to their crunchy data analytics, and this is a deep thinking post about where postgres fits in with regard to analytics. For example, postgres is very good at online transaction processing, but maybe not so much online analytical processing and how those differences could potentially be rectified. Maybe there's some sort of a hybrid system or following the scenario of embedding an analytical engine within postgres, like they're talking about doing withDuckDB, and even brings into question situations where people are reading parquet files on s three for doing analytical processing. But could you get better performance having local NVME drives with the data on it? So feel free to review this post if you want to cover some of those topics.

As well as another post that's related to it is that “A TimescaleDB analytics trick”. This is from kmopple.github.io, and he mentions the challenge that some people have had with processing analytics in postgres. And he mentions the announcement by Motherduck about the new DuckDB integration in the new PG_DuckDB extension. But he says, if you want to do analytics with postgres, there's a couple of ways you could do that. One, you could do, as he says, Careful Data Modeling. So basically, how can you pre sort or pre aggregate data in such a way that allows you to answer analytical queries very quickly? Basically, you're pre processing the data in some way. Another way to do it is using Timescale DB with their compression capabilities. So basically they offer a column store solution which allows you to significantly compress the data size. He's saying up to 10x for example. Another solution is using the ZFS file system and use compression on it, that might be able to give you better performance. Although I've said this before, definitely be careful with your replica because I have seen cases where replicas have struggled to keep up when using compressed ZFS volumes. And lastly he mentions a File data wrapper accessing compressed static data. I don't know how performance this would be necessarily, but that is another solution you can use. And he closes out the post giving an example of a smallish server trying to do analytics with a particular data set just using pure postgres and then using timescale Db with the column based store and compression and it basically shows a tenfold improvement in runtime. But if you want to learn more, definitely checkout this blog post.

Next piece of content “Postgres as a search engine”. This is from anyblockers.com and this is a pretty comprehensive post about different ways you can do text search within Postgres. So right now postgres basically offers three ways to do I'll call it text based searching, basically not using a b-tree index, but you can do full text search using tsvector. You can do Semantic search using pgvectors or you can do fuzzy matching using the pg_trgm extension. And what this post does, he takes all of these methods and puts them together in one query, trying to get you the best of all worlds. So Full-text search gives you very good lexical search, basically identifying words or the root of words. A Semantic search shows you things with similar meanings, like pasta is similar to pizza, not because the words are similar, but because maybe they both have tomato sauce in them, for example. And then Fuzzy matching to take into account potentially misspellings or particular letter is missing. Well with the trigram search you can find those matches as well. So he has a documents table and he has a title and he has a tsvector on there for the full text search and a vector column for the semantic search. And what he's doing is a CTE where he covers the full text query here and the semantic query here. He doesn't do the fuzzy search yet, and then he does a join technique using reciprocal ranked fusion, basically a way to take these results and kind of merge them together appropriately to give you the best search results. So with that base he then adds the fuzzy search on top and it's essentially just another part of the CTE that gets combined together withthe other two. Then he goes into a section about how to debug and understand how things are being ranked and why, as well as tuning different things to give you the best results. Like with regard to full text search. But I found this a super interesting blog post to give you a solution to give you the most comprehensive search capability possible just using postgres. Now, given the fact that you are doing three different types of searches at the same time, the performance is not mentioned. I imagine it's not super great, so you may have to lean on other columns. But he didn't mention any performance benchmarks in the post that I could see. But basically this is a solution to do as good postgres search as you can. And if something like this doesn't fit your needs, then you're probably going to have to look at another search solution like elastic search or something else. Or you know, we mentioned ParadeDB earlier in this episode. They have a postgres extension that actually embed a text searching engine within it that has similar capabilities to elastic search. So maybe that's another option you would want to explore. But check out this blog post if you want to learn more.

Next piece of content there was another episode of Postgres FM last week. This one was on “Getting started with benchmarking”, and here Michael and Nikolai were actually joined by Melanie Plagueman, who is a database internals engineer at Microsoft to talk all about benchmarking because a lot of her work with regard to postgres is working on patches and then analyzing or benchmarking, hopefully to see performance improvements, but definitely avoid performance regressions. So they basically felt the main reason for doing benchmarking is when you're doing development to make sure again that you see performance gains, but also pretty important that you don't make things worse. They also mentioned whether benchmarking was useful during upgrades, like for example, particular company may want to know, hey, if we go from Postgres 14 to Postgres 16, is anything going to be worse? Could we do benchmarking to find that out? And I think everyone on the panel universally agreed that that's really hard to do using something like PG bench to replicate a production workload, because there's always so

much going on. And what Nikolai volunteered is that one way he's helped people trying to make this assessment is let's just look at the, say, top hundred queries on a particular system, run them with a production like data set on the current version to get okay, what are the costs, what are the buffers with regard to the performance? And then upgrade the database and ask that same question for the hundred queries and do you see any regressions that have happened. He felt that was a much better way to make that type of assessment. The other way I've seen benchmarking being used is for like a proof of concept. If you have two or three ideas of how you could potentially structure data for a given feature, you could use PG bench testing to try the three different variations to see which performed better. They did talk a lot about the observability tools that Melanie uses in her benchmarking work. So a lot of the PG stats tables,because those are basically statistics on what the system is doing, as well as various OS metrics that are captured at the same time as well, and even some additional extensions that give you greater insight into the running system. And because she had a major hand in getting the new PGstat IO view into postgres, they talked about that as well. But this is a great episode.You can definitely listen to it here or watch the YouTube video down here.

Next piece of content is “Good Benchmark Engineers and Postgres Benchmark Week”. This is from ardentperf.com and they definitely saw the episode of Postgres FM. And this blogpost kind of is a review of that as well, and talks about benchmarking in general from his perspective. So if you want to learn more about that, you can check out this blog post.

Next piece of content “How Postgres stores data on disk this - one's a page turner”, pun intended. This is from drew.silcock.dev, and he goes through the process of showing you how postgres actually physically stores your data on the disk. So, you know,normally when you interact with postgres, you just say insert this data into a table. Well, that's primarily a virtual representation, what is actually physically happening on the disk, and where are things being stored in the directories and the files, and how that data is stored in the file. So this blog post answers some of those questions.

Next piece of content “CloudNativePG - Connecting External applications”. this is from dbi-services.com. So in this post, again, CloudNativePG is a kubernetes operator and Daniel has had many posts recently on setting it up.Well this one he shows you how you can connect an external application to postgres running within a kubernetes cluster. So if you're interested in that, you can check out this blog post.

Next piece of content “CloudNativePG Recipe 11 - Isolating postgreSQL Workloads in Kubernetes with Kind”. this is from gabrielebartolini.it, and here he's discussing how to run your postgres loads in Kubernetes on their own nodes and avoid application containers from running on those nodes as well, and further preventing those postgres containers running on the same node. Because if you set up three running containers of postgres, you don't want them running on the same node within the cluster, you want them on separate nodes. So he talks about all the different configuration you can do to ensure that that doesn't happen. But if you want to learn more, check out this blog post.

The last piece of content “PostgresQL Hacking Workshop September 2024”. This is from rhaas.blogspot.com, and the upcoming postgres hacking workshop is happening in September, and it's a walkthrough of implementing a simple postgres patch from sources to CI and he talks about how you could potentially sign up to join,as well as reminders of what's expected when you join and participate. So check this out if you're interested

Splicing Elephant & Duck DNA | Scaling Postgres 330

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript