23 Or 1.1 Times Faster Performance? | Scaling Postgres 284
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss whether something is 23 times faster or 1.1 times faster. We also discuss the release of Timescale Vector and whether you can build a queue on Postgres.
Content Discussed
- ClickBench — a Benchmark For Analytical DBMS
- ClickBench on Github
- How We Made PostgreSQL a Better Vector Database
- A Python Library for Using PostgreSQL as a Vector Database in AI Applications
- Choose Postgres queue technology
- Choose Postgres queue technology on HackerNews
- All the ways to capture changes in Postgres
- We had no choice but to build a Postgres proxy
- Evolution of Logical Replication
- Postgres 16 - a quick perf check
- Postgres 16
- Rolling the Dice with the PostgreSQL Random Functions
- Functions to Validate User’s Input
- 4 improvements in Postgres 16 that you might not know about yet
- EXPLAIN (ANALYZE, BUFFERS) and interpreting shared buffers hit in Nested Loops
- Setting Per-User Server Variables
- Using traditional calendars with ICU
- Monitoring Performance for PostgreSQL with Citus
- Understanding Postgres IOPS: Why They Matter Even When Everything Fits in Cache
- My Postgres Story: Internationalization
- My Postgres Story: GIN and FTS
- Postgres 16: The exciting and the unnoticed
- Getting Postgres logs in a GitHub Action
- Chaos testing Kubernetes Operators for Postgres: StackGres
- Exploring Various Ways to Manage Configuration Parameters in PostgreSQL
- psql \watch improvements
- Adding Postgres 16 support to Citus 12.1, plus schema-based sharding improvements
YouTube Video
Podcast Audio
Transcript
So, you know, in last week's episode I showcased a really large performance improvement, but I did say to take it with a grain of salt. Yeah, you should do that. But before we get into that, I hope you, your friends, family and coworkers continue to do well. Before we start, I do want to thank everyone who gave their feedback with regard to my changing of the intro, basically removing the intro or coming up with a faster one. And it seems pretty much universal that it was good to get rid of the intro. So thank you. But the first piece of content is we're actually looking back to what I covered last week. It was the general availability of Hydra, and specifically to look at the Clickbench benchmark again because thanks to a viewer, Guillermo, so shout out to him, he said that seems like an enormous performance difference. He sent me an email and now, you can go in here and click this PostgreSQL (tuned) button to see what Tuned Postgres looks like. So if I uncheck it, this is the result that was presented last week where Hydra pretty much exceeded the performance in this benchmark for all these other database systems.But once you click PostgreSQL (tuned), it becomes the second fastest. Now, it's not ahead of Hydra, but it's a mere maybe ten, maybe 11% difference in performance, so it's not as huge a performance gain. And looking at the rest of these, I wonder, are these untuned as well? I don't see a Tuned option up here. But then the question is, well, what does Tuned actually mean? If you go to the Clickbench source code, you can see all they're doing for the PostgreSQL Tuned is apparently increasing shared_buffers to 8GB, which is something everyone should be doing to match the memory of your system. Increased max_parallel_workers to 16, which is just fine for this type of database type. Increased max_parallel_workers_per_gather to eight, which is fine, and increasing the max_wall_size to 32GB. So nothing too extreme. And if we look at the details for each query and you can see the details of the query by hovering over it here and it tells you what was done.
You can see there are certain queries where Hydra really performed poorly in this first column here versus PostgreSQL Tuned, which is what this one is. Whereas there are other queries that Hydra outperforms what PostgreSQL is doing. So it is really, on a case-by-case basis, what the query is asking for and what performs better or worse. But looking at the overall performance of this benchmark, the difference is only maybe around 11%, which is why you always have to take performance gains with a grain of salt. Because I feel like reporting performance is a bit like finding an accountant. So when you're looking for an accountant, presumably you need to ask them what two plus two equals. And the correct answer is not four, it's well, what do you want it to equal? So definitely there are so many ways to alter performance findings for something and even this, I'm sure there are issues with this benchmark. And the truth is you have to do your own proof of concept, you have to use your own data and your own queries and see how different solutions perform to see what works best for you. But again, shout out to Guillermo for contacting me about this.
But the true first piece of content for this week is "How We Made PostgreSQL Better Vector Database''. This is from timescale.com and as a heads up, this is a 24-minute read, so there's no way I could cover everything in this particular blog post appropriately. But I guess this is a new product they're calling Timescale Vector. I don't see that they have released an extension for this or an open-source extension for this. So I don't know if it's open source or if you can only get it through their cloud-based Timescale product. That wasn't clear to me. But what's interesting is that they say with this solution, they've developed a different index type they're calling DiskANN or approximate nearest neighbor that can even outperform the pgvector HNSW indexes. This is the Hierarchical Navigable Small Worlds and this is the index type that saw a huge improvement with the most recent release of pgvector. Well, they're indicating from their test, which as you know, you have to take with a grain of salt, is about 30% faster than that one. So this blog post goes over this particular product and they say it's also been optimized for hybrid time-based vector search type queries. And so what is this new type of index? It is focused on being disk-based.
So many of the indexes are built to be essentially a memory based database. But this has been optimized to be used with SSDs and it was created based upon Microsoft's DiskANN. And they indicate it has different trade-offs and unique advantages compared to HNSW. So again, there may be certain use cases where HNSW wins. There may be certain use cases where DiskANN wins. They also say that the index allows you to do quantization optimization, which is a way of shrinking the actual index size. So as an example, they show the pgvector HNSW index and the Timescale Vector was approximately the same size, but with the quantization, it drops it to, I guess, about 10% of its size. So that's definitely helped. They talked a little bit about the benchmarking and the methodology that they used and they talked about how they saw this performance benefit both in a single-threaded mode and in a multi-threaded mode. And lastly, they also talked about index build time. As was mentioned in a previous episode of Scaling Postgres, where we covered the pgvector HNSW index, it took a really long time to build, but ivfflats were shorter. The Timescale vector seems to be kind of in the middle and the quantization adds a little bit more time to that index build as well. And then they cover basically how this new index type works, going over some graphs and describing it.
Now, related to this post is another from timescale.com where they're talking about "A Python Library for Using PostgreSQL as a Vector Database in AI Applications". So if you're looking for a library to help you work with this, they've released this as well. But if this is of interest to you, definitely check out these two blog posts.
Next piece of content- "Choose Postgres queue technology".This is from adriano.fyi and he is endorsing using the NOTIFY and LISTEN capabilities of Postgres as well as FOR UPDATE SKIP LOCKED to use your queues. And you don't necessarily need to use these other sophisticated queuing products. And that if you're familiar with a given technology, use Postgres for this but this kind of repeats, the know Postgres is good enough to do a lot of these things. Now, he talks a little bit about the background job landscape and the many solutions that I'm familiar with. They use Redis, there's also Apache Kafka Rabbit NQ, and Amazon SQL, which is a service of course. And if you're looking at different libraries to do background jobs, you can see Sidekiq for Ruby, resque for Ruby, RQ using Python, Handgire using C#, and asyncq for Go. I think all of these pretty much use Redis, and he knows Redis is great, but it's another piece of technology you have to integrate into your environment if you're not already using it.
So he's advocating you can just use Postgres again using the LISTEN, NOTIFY, and the FOR UPDATE SKIP LOCKED. But the other thing he mentions is that he's actually built his own background job library in Go that he's calling Neoq. But what he chose to do is not tie it directly to a particular backend, meaning you don't have to use Redis. But he set it up so that, as he says here, you can switch between queues without changing any application code so you could choose to do it in memory. You can choose to do a Postgres implementation or a Redis implementation. So personally, I rather like this approach where you can choose your queuing technology. Personally, for my own application, I haven't started using Sidekick because I didn't want to introduce Redis as another dependency and have to worry about it and do backups and be able to test your stores and all of that. So I just use a Postgres table for most of my queuing, which we'll look at in another post in a second here talking about that, but I thought this was a great post and encourage you to check it out.
Also, this post was mentioned on Hacker News and there were a lot of discussions and a lot of people mentioned just doing simple implementations in Postgres. Of course, there was also the counterpoint and pushback against that. But you can check out this piece of content too if you're interested in learning more.
Next piece of content.- "All the ways to capture changes in Postgres". This is from blog. sequin.io and this is a product that basically allows customers to interact with a local database and it sends changes to interact with a separate API, specifically the Salesforce API, the HubSpot API. So if you have one of these products as opposed to integrating directly with an API, you can actually talk to a database in your environment and it will send those API changes to your accounts on these services. So you could pull data down, you can insert data, update data, and it translates those database changes into API calls to these services. And this blog specifically describes them exploring different solutions on how to build this. The first one they tried out was Listen/Notify along with FOR UPDATE SKIP LOCKED. But the issue that they had with it was it's at most one delivery and it only "...receives notifications from that moment forward".
And if your session breaks, you may miss messages that way. They also had some issues with the payload size only being 8,000 bytes. The next solution they looked at was pulling the table. So checking for changes in the table, but they couldn't really handle deletes that way. Now they could get around it by doing an on-delete trigger to update a separate table, but even handling it that way, you're still not going to get good diffs of something. So for these reasons, it's not something they considered. They looked into doing replication, so basically doing logical decoding from the wall stream to get the information that they needed. Now, it looks like this could have worked because they didn't have anything that stopped them in their tracks, but they didn't like the complexity and having to learn a new replication protocol, essentially. The next one they looked at was capturing changes in an audit table. So any change to the main tables, you just update an audit table with the information and that way you can get the diffs and whatnot. So this potentially could have worked.
They also explored foreign data wrappers, but again, this was something that did not fulfill all of their requirements. But this blog post is a great summary of all the different change data capture options that you can do in Postgres. Now they have a second blog post that they link to here called "We had no choice but to build a Postgres Proxy". And this basically fulfilled all the requirements that they were looking for. So, much like PgBouncer is a proxy, it can essentially speak Postgres. They built their own solution that can speak Postgres and is essentially wire-compatible with a Postgres protocol. Now, as they go into this, they kind of explain how things were working when they started. When they started, they were essentially changing data capture, which was what the previous blog post was about, and that captures things after the fact. So essentially, when they write things to the API or make changes to the API, they're sending something in an asynchronous fashion and they question whether that was good enough. And it basically can create a whole host of problems because how do you roll back something that's already committed in the database?
Because the API had issues doing an update. So they kind of decided they wanted to do something synchronous. But what would that look like? They had a few different requirements. They needed to support insert, update and delete. They wanted to support returning clauses for their inserts. They wanted a commit to translate to a single API request. And errors must happen during the commit. So if the API operations fail, they should receive a Postgres error. So the first thing they looked at was doing synchronous replication. So synchronous replication of two Postgres database systems. A change is only committed if it's committed on both systems. So they said, well, could we set up an application that acts like Postgres?
And it's just doing synchronous replication with the database that they're updating data with. But they had some issues with this with regard to supporting the returning clause. And also in a failure case, that would require them to really figure out how Postgres emits errors for some of these things. The next option they looked at is using foreign data wrappers, but they would really have to write their own foreign data wrapper. And because a lot of customers used hosted Postgres, you can't just install any foreign data wrapper on there, so that really wouldn't work for them. So they decided to do the proxy option. So it's something that sits in the middle, receives the change, and sends the API request. If it's successful, then it can update the database. So essentially it had to learn to speak Postgresql wire protocol. And in the end, that's the solution that they were looking for. So again, I found these last two blog posts really fascinating in terms of figuring out how to build this solution with Postgres. I encourage you to check it out.
Next piece of content- "Evolution of Logical Replication". This is from amitkapila16.blogspot.com. This is another great post talking about how Logical Replication has evolved over the years, as well as talking about the changes in Postgres 16. And just as a quick review, it's being able to prevent loops and bi-directional replication. So this is supporting this Active Active logical replication that's possible now. Allowed logical decoding to be performed from the standby server. So that's huge. Perform operations with table owners' permissions. So as opposed to using the privileges of whoever created the subscription, you can actually make it the table owner who does the operations on a table.
Next, nonsuper users can create subscriptions. So this is great for people using Postgres as a hosted service. You can create subscriptions as nonsuper users. Large transactions can be applied in parallel, so you can enable this with streaming equal parallel and they have seen performance gains of 25% to 40% with this. Next is logical replication. You can copy tables in binary format, but if you use it, you need to have both the publisher and subscriber beyond version 16 or higher. And then lastly, indexes other than the primary key or the replica identity can be used on the subscriber so you can actually define those. But another good post talking about all the different features in Logical Replication.
Next piece of content- "Postgres 16- a quick perf check". This is from kmoppel.github.io and he typically does a quick performance review of new releases of Postgres. And he did it as well for 16. And the results were, he said, not really interesting because maybe the performance was a 1.2% difference. I mean, some things were higher, some things were a little lower, but on the whole not much of a difference. But you can check this blog post out if you want to learn more.
Next Piece of content there was another episode of Postgres FM last week. This one was all about "Postgres 16". So they discussed the most important new features, what it means for users and whether and when you should upgrade. And you can listen to the podcast or watch the YouTube video down here.
Next piece of content- "Rolling the Dice with PostgreSQL Random Functions". This is from crunchydata.com. They're talking about a random function that returns essentially a float data type between zero and 1.0. So you can use this along with a generate_series function here, to create several random numbers. And you can do things to create numbers in a given range by multiplying or adding different numbers to it or even subtracting. And you can even produce random words as long as you provide it with some sort of a dictionary of words that it can pull from based upon, say, an index array, which he's doing here. As well as even ordering by different random values. He also talks about a new feature in Postgres 16 which is random_normal. So you can actually get a normalized distribution of data that is random. So he shows an example of doing that here with a random normal function. And you can see the actual distribution of the data here, that it follows a normal distribution. So if you want to learn more about this, definitely check out this blog post.
Next piece of content- "Functions to Validate User's Input". This is from fluca1978.github.io. He's talking about two new functions in Postgres 16. One is pg_input_is_valid. So it helps you validate timestamps to make sure that they're valid or not. And it returns true or false. And if something is false, you can actually get the input error info using the pg_input_error_info function. So you just give it the same input parameters and it tells you essentially what is wrong. These are great if you're building functions or procedures. So check this out if you want to learn more.
Next piece of content- "4 improvements in Postgres 16 that you might not know about yet". This is from pganalyze.com and this is Lukas's "Five minutes of Postgres", so you can check out this content to learn more about the different content covered. Lukas also produced "EXPLAIN (ANALYZE, BUFFERS) and interpreting shared buffers hit in nested Loops". This is from last week, and he covers how you can actually get a really high hit count for shared buffers when doing EXPLAIN (ANALYZE, BUFFERS) when a nested loop join is used. Because however many times that nested loop runs, it's going to hit that memory multiple times. So it doesn't necessarily result in more disk access, but more memory is set multiple times. So check out this piece of content if you want to learn more about that.
Next piece of content- "Setting Per-User Server Variables". This is from momjian.us and he shows three different ways that you can set parameters for users. The first way is that you can set it on login by using the Alter Role Set for a given user role. The second way is actually creating a SECURITY DEFINER function, and this is where you can actually explicitly say what value someone can set a particular parameter to. So he shows how to do that in this function here. Third, as of Postgres 15, you can GRANT SET ON PARAMETER to a user. So you can allow them to change the log statement parameter, for example, in this case. But check out this quick post if you want to learn more.
The last piece of content is "Using traditional calendars with ICU". This is from postgresql.verite.pro. He's talking about how Postgres uses the standard Gregorian calendar. However, the ICU library that's available with the icu_ext extension allows you to work with many more traditional calendars, such as the Buddhist calendar, Chinese, Coptic, et cetera. So if you're interested in learning more about how to do that, definitely check out this blog post.