Postgres Graph Queries | Scaling Postgres 327
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss using Postgres for graph queries, the fastest way to copy data from one table to another, dealing with linux memory overcommit and compression.
Content Discussed
- Graph Retrieval using Postgres Recursive CTEs
- The fastest way to copy data between Postgres tables
- What you should know about Linux memory overcommit in PostgreSQL
- Compression
- PostgreSQL Hacking Workshop - August 2024
- Designing a Prototype: Postgres Plan Freezing
- ClickHouse acquires PeerDB for native Postgres CDC integration
- Splitting and Merging Partitions in PostgreSQL 17
- Use Anthropic Claude Sonnet 3.5 in PostgreSQL With Pgai
- Building a RAG application with Llama 3.1 and pgvector
- How to Build Applications With Asyncpg and PostgreSQL
- PostgreSQL Connection Poolers
- Enhancing PostgreSQL Performance Monitoring: A Comprehensive Guide to pg_stat_statements
- Ensuring Safe Data Modifications in PostgreSQL – Part 2
- Master PostgreSQL in Kubernetes with CloudNativePG at the PostgreSQL European Conference
- How to keep a community alive
- Delightful, production-grade replication for Postgres
- Introducing pause and rename for your Tembo Cloud instances
YouTube Video
Podcast Audio
Transcript
Things are definitely slowing down a little bit. There's not as much postgres content as there is normally, probably again because of the summer and people taking vacations and whatnot. But we do have some blog posts to cover this week. But I hope you, your friends, family and coworkers continue to do well.
Our first piece of content is “Graph Retrieval using Postgres Recursive CTEs. This is from sheshbabu.com and this is a relatively short post, but it gives you insight into exactly what he says here. Looking at this graph, storing it in postgres and then retrieving data from it, looking at connections using recursive ctes. So here's the graphical representation. And then he implemented as essentially a set of relationships in an array here. So A is related to B and A is related to B, and A is related to D, etcetera. And each of those relationships are part of this array here. But how he implements it in postgres is he just has an edges table with two text columns, U and V. I don't know if U and V have particular meanings in the graph database world, but he uses the columns U and V and each row is just a relationship. So he inserts all of the relationships into the table. And if you want to find all nodes that are connected to A, you do a recursive query such as this you do with recursive, give it a name and say as. And then you do a union query. The first part of the union is this is what gets executed first. So select U and V from edges where either U or V equals to a. And then it uses this result in the second part of the union query. So it selects U and V from edges again, it's joining to the full table results set and looks for matches in u and v between each of the four tables. So essentially you have four comparisons it's doing, as you can see here. And then you just do SELECT (*) all from CC and it will output all the different relationships. And he even has an interesting graph here that it shows how it works visually. So first it selects A because that's the first part of the recursive here. And then it does the next set and then the next set, and then the next set. So it essentially does four iterations to walk the whole of the graph.
Now he doesn't mention indexes here. And when you see all these OR statements, you can imagine it's going to potentially have some efficiency issues. But he says he's used millions of connected components and so far it's been working great. But I imagine if you wanted to implement something like this, you would probably put some sort of limitation so that you only do, say five scans or six scans of the graph to limit how many iterations you have to do as well as you can. Add additional columns, of course, to this table to minimize, “hey, maybe you're only looking for this graph for a particular customer or for a particular account” or something of that nature. So that might help alleviate some performance issues you might run into. But if you want to learn more, feel free to check out this blog post.
Next piece of content “The fastest way to copy data between Postgres Tables” this is from ongres.com and I'm going to go to the very bottom of the blog post because it has the nice resultant graph and so I'll talk about it from that perspective. So we tried several different ways to speed up copying one table to another table within a postgres database. The first thing he tried was INSERT INTO SELECT, that's what IIS is. So basically you have a table with data in it, you have another table you want to transfer it to, you just do insert into and select from that origin table, inserting it into the destination. And I think it was doing about 20 million rows or so, but it took 12.8 seconds. So then he said, okay, let's do INSERT INTO SELECT, but do it to an UNLOGGED TABLE. And then after that is done, then set it to be logged. And as you can see in the graph here, that took 14.2 seconds. So it was longer to insert it into unlogged and then to set it to log. So you really don't gain anything and in fact you lose something. Now just doing it to an unlogged table I think was around 6 seconds or so. So if you don't need to persist that data, that's one option you could do. The next one he looked at was COPY TABLE AS and that ran in 11.9 seconds. So that was actually what I consider the fastest out of all of these. We'll talk about the one that you'll see down here, but that actually seemed to be the fastest.
Next he looked at COPY and basically he was using the copy command to copy it to a pipe and then copy it back into postgres again. So that overhead resulted in being almost 21 seconds. So that was definitely slower. He looked at PG_Bulk_Reload and they have a direct reload that I think bypasses the wall. So essentially that's the same thing as an unlogged table. So even though it shows the fastest on the graph, the fact that it's unlogged, I would say it's not similar to these other two, you would have to actually persist it. In there it's probably the same as all these other 12 seconds, 14 seconds to use this method, the PG_Bulk_Upload (Buffered) was again about the 12.9 seconds. And then lastly tried PG_File_Dump, that was over 30 seconds. So really to me the fastest one here is Copy Table As (CTA’s) if you wanted to do this or insert into select is fine too. I mean you're only losing about a second of time here. So maybe it was 5 or 6% slower. But if you want to learn more check out this blog post.
Next piece of content “What you should know about Linux memory overcommit in postgreSQL”. This is from Cybertech-postgresql.com. And he's talking about when you're configuring your operating system for where you're running postgres. So what Linux actually does is it over commits its memory that's available. So it deals out more commits than there is memory available in the hopes that the process will not use all of the memory that it asks for. Now as a consequence, if it starts running out of memory, it will invoke something called an out of memory killer that starts eliminating processes. Now why this is a problem for postgres he mentions here is that because postgres has a shared memory model, if any of those processes get killed it could corrupt the shared memory. So it requires a full restart of postgres in recovery mode to bring it back up to a fully operational state. So basically you don't want Linux to overcommit memory when you're running Postgres on it. And a way to turn it off or disable it is to set the (system) sysctl command, vm.overcommit_memory = 2, so that disables it. But when you set this, there's another setting that you should also make and that is the vm.overcommit_ratio because by default it's set at 50, which means it's going to use approximately 50% of the available memory, which clearly you want it to use more than that.
He has a formula that you can use to help set this taking into account swap huge pages, the huge page size things of that nature. But you can also use vm.overcommit_kbytes as well. So that basically just puts your available ram minus the swap. And you can use either this value, the overcommit kilobytes, or overcommit ratio to configure how much memory your Linux system should use. So definitely if you disable overcommit, you should change these settings so that you're utilizing all of your memory. So even though you've set this up, you have protected your system from a process being killed inadvertently and having to bring the whole system down, but you still can get out of memory errors for individual processes that are running. So he says the way to avoid that is to appropriately set shared buffers and work_mem, and he has a formula of how to roughly set what that should be. But there's also been previous blog posts that have talked about setting work memory and even logging to help track and help you configure what that should be set at. And lastly, he mentions when you're working with postgres in a containerized environment, be aware that, “If you disable memory overcommit in the kernel, that will affect all containers on the host machine”. So he actually advocates dedicating a host machine for postgres. And don't run other types of application containers on it because of this memory overcommit setting. But if you want to learn more, definitely check out this blog post.
Next piece of content, there was another episode of Postgres FM last week. This one was all about “Compression”, and Nikolay and Michael talked about how to work with that with native postgres and some other potential options that are available. And you know, the reason why you do compression is of course, number one, it reduces your storage space, but also it gives you greater performance in a lot of cases because you're dealing with smaller sizes of data files, for example, that can take up less room and memory etcetera. But the thing to be aware of is that compression also is a tax of sorts, meaning that it takes cpu power to be able to compress and decompress things. So you need to make sure you have those resources available if you're going to be using compression in your particular use cases. So they talked about wal compression, and that's a feature that I definitely like turning on on my systems to basically minimize the amount of wal that's being produced. And Michael did mention that in version postgres 15 more options have become available into how you compress it. So there is the default PGLZ, but they also offer two other options LZ four and Z standard. In my experience, Z standard can give you really high compression, whereas LZ four is less of a burden on the CPU but doesn't compress quite as much.
They talked about the native compression that's available in TOAST, so when a column gets toasted that can be compressed as well. And you can also configure that to use the default PGLZ compression or LZ four or Z standard. And they did mention there's not a lot of options for compressing the actual data or indexes in postgres. If you wanted to do that, you'd probably have to rely on some sort of file system that does compression, like ZFS. So running postgres on ZFS, but definitely be cautious and test using it because I have seen use cases where a replica was struggling to keep up with a primary due to the volume of changes and the burden that that compression was causing in order for the replica to keep up, so just something to be cautious of. They did talk about other areas that can do compression, either extensions or other types of products like column storage. Like they mentioned for example, Timescale DB has column storage, and there's other product vendors and extensions that do column storage and that generally compresses things down. You can think of it more like an array of values, and when you're compressing that particular data type, you can get really high compression ratios with it. So if everything's an integer or everything's a particular type of text, you can get higher compression ratios compared to row storage, where you have all sorts of different data types for a given row. And lastly, they talked about backup compression. And when you take backups using things like Pgdomp or PG restore, you can compress them. And those three compression options are available as well. So I was super thankful when they added Z standard as an option to PG dumps because I found that to be really high performant, and in terms of compression ratios, and pretty fast as well. But if you want to learn more about compression, definitely listen to their episode here or watch the YouTube video down here.
Next piece of content “PostgreSQL Hacking Workshop - August 2024”. This is from Rhaas.blogspot.com and this is an update that they are having their first hacking workshop that's going to be taking place in August. It looks like you have to sign up by August 1, so my apologies. This episode was probably after that time, but the first talk is about the postgresQL optimizer methodology. The next talk in September is about a walkthrough of implementing a simple postgres patch from sources to CI. And again, if you want to be a part of the communication flow, you should get on the PostgresQl mentoring discord. In order to do that, there is some hoops you have to jump over. You actually have to run a query to be able to find the link to be able to join the discord. But if you're interested in hacking on postgres, definitely check out this blog post.
Next piece of content “Designing a Prototype: Postgres Plan Freezing” this is from danolevo.substack.com and he was developing a prototype on a way to freeze plans. So postgres backends or server processes can store query plans for reuse by prepared statements or also extended protocol queries. So basically you save on the planning time. You can just re-execute with new parameters what that plan is and he's developing a prototype to see once these plans are developed, “Can they be frozen for a particular period of time?” Because one disadvantage is that each query backend must go through the process of retaining this plan, but he's seeing if he can actually have the plan being frozen and shared by all backends as a part of this prototype. So this is a very long and pretty technical discussion on this prototype he's looking at. So if you want to learn more about this, definitely check out this blog post.
Next piece of content “Clickhouse acquires PeerDB for native postgres CDC integration” this is from blog.PeerDB.io and exactly as it says, ClickHouse, which predominantly does data warehousing data analytics, has acquired Peer DB, which predominantly does change data capture. That's what CDC is for postgres, so this seems pretty significant. We'll have to see what becomes of this, but if you want to learn more, definitely check out this blog post.
Next piece of content “Splitting and Merging Partitions in PostgresQl 17” this is from PGdash.io and they're talking about the new postgres 17 feature where with range partitions you can now split and merge them together. And we discussed this in previous episodes of scaling postgres, but if you missed those particular blog posts, you can check out this one on how you can split and merge partitions with range partitioning.
Next piece of content “Use Anthropic Claude Sonnet 3.5 in PostgreSQL With Pgai” This is from timescale.com and it looks like Timescale has added a new LLM to their PGAI extension. This basically enables you to contact external APIs to use as LLMs in your postgres AI solutions. And now they've added cloud Sonnet 3.5 as an option. So check this out if you want to learn more about that.
And the last piece of content “Building a RAG application with Llama 3.1 and pgvector”. This is from neon.tech, and I know we've covered a number of these so far, but this is another way to build a rag using the open source llama 3.1 engine. So check this out if you're interested.