PSQL GEXEC, Delete Duplicates, Postgres Podcast, Puny to Powerful | Scaling Postgres 225
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss /gexec available in psql, how to delete duplicates using a window function, a new Postgres podcast and a performance talk about going from puny to powerful.
Content Discussed
- gexec in psql: PostgreSQL poweruser practice
- Postgres Format Function
- Delete Duplicate Rows
- Introducing a new Postgres podcast
- [YouTube] RailsConf 2022 - Puny to Powerful PostgreSQL Rails Apps by Andrew Atkinson
- 5mins of Postgres E27: Waiting for Postgres 16 - Make subquery alias optional in FROM clause
- RailsConf 2022 - A Rails Performance Guidebook: from 0 to 1B requests/day
- PostgreSQL Performance Puzzle
- Laura Ricci
- Rubber Duck Dev Show Episode 51 | Andrew Mason & Collin Jilbert of Ruby Radar
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "GEXEC IN PSQL: POSTGRESQL POWERUSER PRACTICE ''. This is from cybertec-postgresql.com. And they're talking about using gexec. So when you use psql, it has a number of different slash commands. One is \G, which just repeats the command that was just executed. It runs it again, but gexec actually takes the output from the previous command and actually runs it as statements within psql. So that's what this post explores. So the first exercise is if you're going to be wanting to craft commands to run within psql, you're going to need to craft those statements. So the first example he's using is the double pipe operator, which is basically a concatenation of two things. So here he grabs a number of different tables just by selecting from pg_tables. Then he writes a statement just as a literal, concatenates it to the table name that's being returned up here, and then concatenates the remainder of the command, including, of course, the semicolon.
That produces an output as you see here. Now he also talks about a safer way to use this other than that, but we'll continue with what he's doing at this point. Then you simply call gexec on that, a psql command, and it will actually run the statements from the last command's output. So it runs this command and this command, and this command and this command, as you can see down here, all the grants were run. Now throughout this post, he's using the double pipe operator. But if you are unsure of the input you're going to be passing to this, you are vulnerable to an SQL injection attack. So you want to use some different functions in order to build out these commands. In this particular example, he could have used a quote_ident(), because this is an identifier he's passing in essentially. Or you could use format() and it has a number of different params you can use. This is the identifier param. There's also just being able to put a literal string for literal string replacement.
He uses the format example here where he's writing out the full query and then within it saying, 'Okay, this is an identifier that needs to be replaced and it's going to be replaced with the table name'. So this should be familiar to you if you've used other interfaces that protect against SQL injection. Basically, you have a replacement variable in here that gets replaced by what you want to pass in. So that's basically what gexec does. Now he further describes some different exercises you can do. One example is if you actually need to run certain commands against the tables you're identifying, in this case, you can actually use a cross-join to explicitly state the values here and here. He's able to say, I want to do an INSERT, UPDATE, and DELETE for each table and then you can actually build that out.
Again, you should use the format() or quote_ident() functions instead, but it actually says GRANT INSERT, GRANT UPDATE, or GRANT DELETE on the particular table of interest. Now, he also mentions here you could also add double quotes for different purposes if they're required. But again, you shouldn't necessarily use the concatenation operator. It's easier to use format(). Actually, you should look at the Postgres documentation, I'll include this link as well. When you get down to the format() function and they're talking about the SQL Identifier replacement, it actually double quotes it if necessary. So you don't even have to worry about explicitly doing double quotes within it if you use format() because it'll do it automatically. But if you want to learn more about that, definitely check out this blog post.
The next piece of content- "Delete Duplicate Rows". This is from sqlfordevs.io and this is a slightly different take on how you can delete some duplicate rows. Now he gives both a MySQL example and a PostgreSQL example, but he first talks about usually how people handle this is by making a group and then they give the MIN(id) to retain and delete everything other than that MIN(id) maybe using some subqueries to do it. I myself have done this frequently, but this is a pretty elegant way to do it, using a window function to do it. So basically, he has a CTE where he is identifying all of the duplicates.
He pulls back the ID then he wants to print out the row number as a part of the window function, and he does it over these partitions. So every distinct first name, last name, and email will get its own row number and in the case where they're identical, they'll get however many duplicates exist and then he orders it by age. Now, if this was say the CREATED AT for that record, or if it's the age of the record, basically you would want to order it by which record you want to retain. So if you wanted to keep the oldest record, you probably wouldn't do it by descending. You would want to do it ascending by the CREATED AT date, for example.
But that'll give you all the duplicates and show you how many duplicates there are. Then you can use this delayed statement. So DELETE FROM the table using duplicates, using the CTE as the reference point, and saying where the ID equals the ID and the duplicate's row number is greater than one. So you're basically going to keep the first one that you've identified by this ordering, however you want to do that. So I found this to be a pretty great way to delete duplicates and if you want to learn more, you can check out this blog post.
The next piece of content- "INTRODUCING A NEW POSTGRES PODCAST". This is from pgmustard.com and he's announcing the start of Postgres FM, which is also going to be posted on Postgres TV, their YouTube channel. So this looks to be a weekly format, roughly 30 minutes for each episode. So if you're interested, I believe they have an episode or two up so you can go ahead and get started.
The next piece of content is actually a YouTube video and it's "RailsConf 2022- Puny to Powerful PostgreSQL Rails Apps by Andrew Atkinson". Now, this was from a Ruby on Rails conference, but it all covers areas of Postgres performance. Particularly talking about doing safe migrations and how to handle connection pooling. How to do query optimization using things such as pg_stat_statements and enabling that for your database. As well as index maintenance such as vacuum and following up with partitioning. Now this is more beginner-level information he's covering, but it's great for developers and I thought this was a pretty good talk about those subjects. So if you're interested, you can check that out.
The next piece of content- "5mins of Postgres E27: Waiting for Postgres 16- Make subquery alias optional in FROM clause". This is from pganalyze.com. Now this is something coming in Postgres 16, which is actually about a year and three months away. But this particular feature many, many people will appreciate. Lukas asks the question about why this doesn't work when you don't have an alias when doing a select from a subquery. That question shows up over 200,000 times on stack overflow. And I myself do this all the time, where I'm selecting from a subquery and I forget the alias and it gives me an error and I have to put the alias in. Well, basically, it looks like in Postgres 16 you won't have to do that anymore because it has been merged in as a patch. But if you want to learn more about that and what you can look forward to, definitely check out this piece of content.
The next piece of content is another YouTube video. This one is also from RailsConf 2022 and its "A Rails Performance Guidebook: from 0 to 1B requests/day". Now, this wasn't all about databases. There are some general performance topics covered here, but there are a few database items covered. But this is from a performance engineer at Zendesk and basically, they have billions and billions of tickets and terabytes and terabytes of data. So I always find it interesting getting insight into how these larger applications run so you may find it of interest as well.
The next piece of content- "PostgreSQL Performance Puzzle". This is from ardentperf.com and this is actually a puzzle that someone proposed. They set up a table with just a few columns, did a query and got dramatically different results for data that one would potentially expect to be similar. Now, I'm not going to give anything away because you can actually go through the puzzle, see if you can figure it out, and at the end of the post, they do give the explanation. So if you're interested in that type of performance puzzle, check out this blog post.
Next piece of content. The PostgreSQL person of the week is Laura Ricci. If you're interested in learning more about Laura and her contributions to Postgres, definitely check out this blog post.
The last piece of content. We did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was with Andrew Mason and Colin Gilbert of Ruby Radar. So if you're interested in a developer discussion, we definitely welcome you to check out our show.