pgbouncer Released, Query Parameter Types, Lateral Joins, Security Invoker Views | Scaling Postgres 208
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss a new version of pgbouncer, query parameter types, a performance improvement from a lateral join and security invoker views.
Content Discussed
- PgBouncer 1.17.0
- Query parameter data types and performance
- Waiting for PostgreSQL 15 – Add support for security invoker views.
- 5mins of Postgres E10: max_wal_size, Postgres full page writes and UUID vs BIGINT primary keys
- Installing Pgpool-II on Debian/Ubuntu
- Michael Christofides
- Rubber Duck Dev Show Episode 35 | Organizing Your Tests
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. We don't have a lot of content this week, so I'm actually doing something slightly different where I'm introducing basically a video blog post of something covering lateral joints. But there's still a handful of posts this week. So our first piece of content is "PgBouncer 1.17.0" is released. So this is from pgbouncer.org and it has some new functions as well as fixes. So now database definitions can be added as a list that supports round-robin of contacting those databases to handle statements. So that's interesting. They also increase the max length of passwords as well as fixed support for OpenSSL 3. And you can definitely click the link here to review the full change log.
The next piece of content- "QUERY PARAMETER DATA TYPES AND PERFORMANCE". This is from cybertec-postgresql.com and they're talking about parameters that you can set up for your SQL statements. Now, normally you do this within a language and basically, you parameterize a section of the where statement typically. So if you're looking at account one, you'll put ID one. If you're looking at count two you put ID two. Now the benefit of it is it does advanced planning of this, so it just has to execute it when it comes in. So that can result in a speed boost as well as security reasons because that helps avoid SQL injection when you separate your parameters from the actual query statement. Now for the frontend/backend protocols they're talking about here, it actually sends a prepare message, a bind message and an execute message.
But you can easily do this in two statements within psql by doing a PREPARE, giving it some certain statement, a data type you expect to pass in, and then as well as the parameterized query with the dollar sign one or dollar sign two or three, et cetera, for the parameters. Then you just run an EXECUTE that function name you declared and pass it in the number and it will EXECUTE your statement. Here's an implementation using Java, but you prepare the statement, including parameters, and you set what that value should be and then you execute it. Now, why this can become important for performance is data types. So for example, you'll notice that the data type is specified here, the data type is specified here as a long in Java. Well, if those mismatch what indexes are expecting, you're not going to get an index hit. He shows this example here where a bigint is different from a numeric type, they're not considered to be the same.
So in this example, the column is a bigint and you can pass in an integer and it can convert the integer to a bigint just fine and it's going to do the index scan. There is an unknown type, so you can set something as unknown and the system will try to infer what the type is. And using that method it's able to identify that, yes, this number is some type of integer, so it does the index scan. But when you specify to be numeric and in Java they're using a big decimal type, it interprets those as not matching and it actually converts both to a numeric, which there's no index for. So you get a sequential scan instead of an index scan. So this can definitely result in poor performance. Their recommendation is to either set the data type appropriately or leave it unknown and let the system infer. It usually does a pretty good job on that, but if you want to learn more, you can definitely check out this blog post.
The next piece of content is something that actually I have done using a lateral join to improve performance. So you can think of this as a video blog post. So something I was working with, there is a set of gifts in a system and there can be payments applied to those gifts or the gifts are made over multiple payments. So there's a gift table with an amount and a month just as a simple integer to make this example simple. Then there's a payment table that has the gift ID as a foreign key along with the amount of the payment and the month that the payment came in. So I just inserted five rows into the gifts table. So a $1,000 gift, a $2,000 gift, $3,000, $4,000, $5,000 gift, et cetera, all in the same month. And then I inserted a number of payments, five payments per gift. So the $1,000 gift had five $100 payments, the $2,000 gift had five $200 payments, et cetera, and they were made over different months.
Now the report that needs to be generated is looking at all gifts by month, adding up the total of those gifts as well as the payments that have been made against them. So this simple implementation actually won't work where you just do a gift and you do a left outer join to the payments, then you group by the gift month and sum the gift amount. Now that's going to double count your gift amount because of the left outer join the payments. It's going to duplicate those gift rows. You can't do something like a simple distinction on the amount. It will work in this case, but not when you're talking about thousands and thousands of gifts in a system with similar amounts. That's not going to work that way either. So the solution I had come up with that gave the correct answer, but it ultimately ended up being slower, is doing a subquery in place of the payments table.
What it does is it queries the payments table, groups the rows being retrieved by the gift ID, and summarizes the amount. So basically, you get one row from the payment table that can be matched to a row in the gift table. So when you do it this way, you're not going to duplicate the gift rows and you're going to get the correct amount. But this started getting slow. Now, the main reason why this started getting slow in the actual production implementation is that there were hundreds of thousands of payments. Trying to look at that payment table, even if you're selecting by a particular account, you can't really narrow it by the date range because payments are coming in at any time. You want all payments for the gifts. So it started getting really slow calculating this value and then joining it to the gifts table. But the solution I came up with is using a lateral join.
So as you can tell, the query is very similar, but I'm doing a lateral join to something close to the same subquery. So it's not significantly different. But what this does is essentially do a for loop for every gift row that's being selected. So it goes in and adds up each payment row for each gift row analyzed. So if we look at the subquery plan, it basically runs a query on the payments table, runs the query on the gifts table, and then does a hash join between them by this hash condition to match everything up and give you the results. In the lateral plan, it's querying the gift table, but then it goes and it queries the payment table for each gift row. So that's why in this plan you could see it's doing loops of 5. So if I had 10 gifts, it would be doing a loop of 10. If I have 100 gifts, it would be doing loops of 100 to pull all the payments for that particular gift ID.
Now ultimately this ended up being faster. So this lateral version of a query for my implementation actually gave me a 10x better performance. Now, that's not always going to be the case, or sometimes it may be even more significant. It depends on how your data is structured, how many rows you have to loop through and how narrowly you can define the subquery. But this lateral join was an alternative that worked for me. But let me know if you have a different idea of how to make this performance even better.
The next piece of content- "Waiting for PostgreSQL 15- Add support for security invoker views". This is from depesz.com. So this is a patch and in terms of functions, they talk about security definer functions where you can basically determine if the security is going to be set by the person who's running the function or the person who created the function. But this actually applies to views and in version 15, it looks like they added a feature where you can choose to run the view as the person who created the view, which is how it works today, or you can do it by the person who actually invoked the view.
So for example, the standard or old_type_view, the person who created the view, that's how the view accesses the data. You don't need to give table permissions to that person and you're able to query the view if you're not the person who created it. But you actually can set security_invoker=true for the view, and then it's going to check the table permissions of the person who has selected from that view. And in this case, it says permission denied for table source data. So this is a pretty interesting addition and you can check out this blog post if you want to learn more.
Next piece of content- "5mins of Postgres E10: max_wal_size, Postgres full page writes and UUID vs BIGINT primary keys". This is from pganalyze.com and they covered the post that we covered last week on Scaling Postgres, talking about the max_wal_size. They go into a little bit more depth about what happens after the checkpointing process, how many full-page writes can be made to the WAL, and how this can impact performance. And it can particularly impact performance if you're using UUIDs. The reason is that after a checkpoint, whenever any byte of data is changed, that whole page is written to the WAL if you're doing full-page writes.
Now the issue with UUIDs is that they're inserted randomly across the pages. So you're going to get a ton more WAL writing happening with a UUID, whereas an incremental integer is written to the same page incrementally and then the next page, et cetera. So you're going to get less WAL writing volume with bigints and the post that he was referencing here, they showed a 20 times difference between the data sizes between big serial incremental IDs and a random UUID. But if you want to learn more about that, definitely check out this episode.
The next piece of content- "Installing Pgpool-ll on Debian/Ubuntu". This is from b-peng.blogspot.com and now it appears that the official Postgres repository for Debian/Ubuntu, the app repository now has Pgpool II. So this blog post runs through how you can get it installed using this new app repository.
The next piece of content, the PostgreSQL person of the week is Michael Chistofides. If you're interested in learning more about Michael and his contributions to Postgres, definitely check out this blog post.
The last piece of content, we had another episode of The Rubber Duck Dev Show this past Wednesday. This episode was on how to best organize your tests for your applications. So if you're interested in that type of content, definitely welcome you to check out our show.