Implement Get Or Create | Scaling Postgres 328

In this episode of Scaling Postgres, we discuss how to implement get or create, mention a new pgBouncer version, how to alter default privileges, and performing six degrees of separation with Postgres.

Content Discussed

YouTube Video

Podcast Audio

Transcript

It's interesting, how a lot of Application Frameworks have features that work great on small data sizes, but kind of fall apart once you start getting larger data sizes, particularly when you're working with an ORM that has certain features. One of those features is something like rails finding create. So you want to find a record, but then if it isn't there, go ahead and create it for me. Well, our first blog post talks about that and how to make it more efficient and less harmful to postgres in general, particularly if you have a highly active database or just larger in size. But I hope you, your friends, family and coworkers continue to do well.

Our first piece of content “How to Get or Create in PostgreSQL” and the subtitle is “And why it is so easy to get wrong”. This is from Hakibenita.com. now forewarning, this is a super long blog post and if you read it, you'll probably spend 15-20 minutes on it. Probably. And as another indication, any blog post with a table of contents you know is going to be quite long as well. So he's basically talking about how to efficiently implement the getting of a record, and if it's not there, go ahead and creating it and returning that value. So in his example, he's using a Tags table and he goes through many different implementations, seeing where they fall down, what problems they potentially cause, etcetera. The first objective he wants is for it to be item potent. In other words, be able to run the command and it shouldn't error out. So if you try to insert a value that already exists, you're going to run into an error compared to when you normally just try to insert something. So to get item potency, the function does a SELECT and then if it found some data, it returns it. If not, it inserts the data along with RETURNING function and that gets returned. So that gives you item potency. But now you've essentially created a Race condition potentially where a different process trying to SELECT and INSERT the same thing could run into an error. And he lists that case right here. So then maybe you could do some exception handling and that will help prevent the Race condition. But depending on how fast things are happening, you could be causing a lot of bloat in your system. So this exception method basically generates bloat. He then tried some sub statements, basically do a select within your insert statement, but that is still subject to erase condition, meaning it doesn't support concurrency very well. So finally gets to the thing that that I was thinking he was going to be talking about is doing an on conflict to do nothing when doing an insert, because that way you can always just tell it to insert something, and if it already exists it will just not give you an error. So it's item potent in that respect, and it helps to avoid bloat as well.

But the thing he didn't like about ON CONFLICT DO NOTHING is that the returning clause only returns data which was inserted, that data that which was changed. So he said, hey, maybe instead of do nothing, I'll do an update and just set the ID to itself so I get both results that are returned, even though one or both of them may not be inserted. But he calls this section taking a wrong turn because of course now you're essentially bloating your system again, and there's got to be an index on that ID. So no more heap only tuple updates. Then he has another consideration is that avoiding schema changes, because the thing about on conflict do nothing is that it does need some sort of unique constraint to do that, but he wondered if he could do it without the unique constraint, although the benefit of the unique constraint is it gives you an index to make things more efficient for large datasets. So he decided into using merge instead of insert on conflict, do something like do nothing, and using the merge command he was able to achieve the same results, and it can do it without having a unique constraint on the table. Although the point of the tags is you want them to be unique, so you still do want a unique constraint on the tags. As well, the index will really help with quick lookups to make sure there are no duplicates when you're looking to insert new tags. So I guess I don't really see it as a detriment right now, particularly if you have larger database. But as you can tell, this is a very comprehensive blog post examining different ways some paths worked out better than others on building a GET or CREATE implementation in postgres and making it item potent, supporting concurrency without race conditions, avoiding bloat and being flexible on constraints. But if you want to learn more, definitely check out this blog post.

Next piece of content is a very quick one that “PgBouncer 123.1” is released. The reason is because of two fixes that were implemented due to crashes that were happening with the most recent version 1.23. So if you've recently installed that, you're going to probably want to install this patched version to avoid any kind of crash issues. And you can see the name of this release is “everything is put back in order”. So just basically a bug fix release. And this is from pgbouncer.org

Next piece of content “Dispelling Myths About PostgreSQl Default Privileges”. This is from percona.com and the purpose of alter default privileges is that when you do a grant statement, it just affects existing objects like you want to grant select on a table. It only works for tables that are in existence, whereas if you alter the default privileges for tables, you can say this particular user should have select access to any new table created. And he has an example here where as a postgres user he created a schema, granted usage on that schema to a user, created a new table, and then it had the others who try to select it and it fails because the default privileges have not been changed. You can grant a select on that table to that user and now can use it, but how you can fix this so any new tables will allow that user access as you need to run the command ALTER DEFAULT PRIVILEGES. And in this case he runs the command ALTER DEFAULT PRIVILEGES in the schema test, grant the SELECT on all tables to the given user. Now when a new table is created, that user can now SELECT from it because it has the privileges, a separate grant was not necessary. But the problem comes in what if another type of user creates a table in that same schema and then that user tries to access it? Well, they get a permission denied as well. The reason being because there's another clause you will need to add to the ALTER DEFAULT PRIVILEGES to enable objects created by a user so other users can read it. And so that is ALTER DEFAULT PRIVILEGES in schema test, same as before, but you say FOR ROLE obj_creator in this case is the name of the user, GRANT SELECT on all tables to this user. So it's that for role you need to specify to make sure that any objects created by this creator role will still allow that user to access them. And in addition he shows some other examples of commands that in addition to granting SELECT, INSERT UPPDATE, DELETE on tables, you can also grant usage on types or execute on functions, or select some sequences. And finally he wraps up with showing how to look at the current default privileges and how they're set using the “ddp” command as well as a specific query you can run to get all this information in a more readable way. But if you want to learn more, definitely check out this blog post.

Next piece of content “Six Degrees of Kevin Bacon - Postgres Style”, this is from crunchydata.com. this is similar to the graph traversal that was mentioned in last week's Scaling Postgres. This one is doing it using the six degrees from Kevin Bacon. So this is related to the concept that there are six degrees of separation between every person. So you can traverse the relationships of a person to be up to six times to be matched with anyone else. Well, in this game, you compare the actor to what movies they were in and eventually find the other actor. So there's one degree of separation between Kevin Bacon and Steve Martin because they were both in this particular movie. And in this example here, there are two degrees of separation. So they wanted to grab this data and build this shortest path algorithm to find how many degrees of separation. So they downloaded the IMDb data and it has an actors table, a movies table, and then a movie actors tables that links the actors to the movies. And essentially the actors will be a node on the graph. And you want the movies to be the edges or the relationships of the graph. So they take the raw data here and do a join between all of the data and place some indexes on it so that it ultimately looks like this. You have an edge_id, you have one actor, the movie that they were in, and then every other actor in the next column. So as you can see, in this particular movie, the same actor in movie appears five times, because there are five other actors that appeared in the movie as well. Now, the first technique they used is pgRouting, which I think is normally used for geographical information system queries. But they're using the Dijkstra algorithm to find the shortest path. So they basically create the extension PG routing. They find the actor id for Kevin Bacon and who they want to match him to, Timothy Chalamet, and use the pgr_dykstra function in passing in the query as well as the two Ids. And for the 11 million record table returns in about 5 seconds. But it shows you the path between them. So the next thing they looked at was using a recursive CTE, which is what was used in last week's Scaling Postgres episode, doing a graph traversal. So again, it's a recursive CTE. It has two parts, the starting node and then the recursion part. And you place the origin actor here and the destination actor here, here, and it outputs the paths in an array. But then he has another query to make it look more presentable, basically unnesting this array and doing JOINs on it, so you can see the links between the actors and the movies. And he wrapped this up in a convenient PL/PgSQL function. So you can just run SELECT all FROM bacon and then put in the other actor's name and it will give you the results here. Oh, and with regard to performance, he said this solution just took a few hundred milliseconds compared to the 5 seconds of using the PG routing function. So the recursive CTE for this particular use case was definitely more performant. But if you want to learn more, check out this blog post.

There was another episode of Postgres FM last week. This one was on “Why postgres?”, so Nikolai and Michael discuss, as they say here, why they chose postgres as users of it for their businesses, for their careers, as well as some doubts. And some of the doubts they were mentioning is not as transparent with regard to the governance of the project. But Michael mentioned well what other open source project is doing better, and from what I recall, I don't remember them identifying another open source project doing it better. And they also discussed some history of their usage of postgres. So if you want to learn more you can listen to the episode here or watch the YouTube video down here.

Next piece of content “Bluefin for PostgresQL: Revolutionizing Immutable Storage in Database Management”, this is from enterprisedb.com. this seems to be a new feature of EnterpriseDB, so I don't know if it's available anywhere else, but it's basically a new table access method. So as opposed to using the standard heap, it's another storage engine you can use with postgres, and it basically only does immutable storage. So you're only allowed to do INSERTs or SELECTs. So UPDATEs and DELETEs are disallowed. And they say the Tuples in Bluefin are stored as compressed deltas of other tuples, and pages are eagerly frozen once full. Now this new type of storage method allows them to really increase the density of the tuples per page, which accelerates reading operations. But as a consequence of being immutable, the only way you can delete data is by deleting partitions. So basically partitioning is a required element of using the storage system. What are some of the advantages of this is that because you have a higher density, you can store the data. The tables, as they say here, are 57% smaller than the conventional heap table and higher read performance. So this is a PGbench run. It looks like they're only doing select only, and compared to a heap table, the bluefin table was 14 times more performant. So I thought this was super interesting but if you want to learn more, you can look at the link here. Enterprise DB, I believe is part of their advanced storage pack.

Next piece of content enhanced security logging PostgreSQL connections this is from cybertec-postgresql.com. And he discusses some of the importance of logging connections, mainly with regard to security, because now you can see where people have come from, the username they've used, the method that was used to log in, and even the line that ultimately allowed them to log in that way. So if you don't have this feature enabled, you may want to turn that on for your environment.

Next piece of content “Posting Your Patch On pgsql-hackers”, this is from rhaas.blogspot.com and this is basically the best way to post your patch to PGSQL hackers and get responses to it actual in actual discussion, and then ideally have it ultimately committed. So this is the process of getting it posted once it's written and basically my interpretation, it's kind of like a sales job. You need to be very focused and to the point of the objective of the patch, what it does, why it's important. He also discusses how to handle a single patch versus a patch series, as well as the importance of comments within the code highlighting what's being done. So if you want to learn more about that, you can check out this blog post.

Next piece of content, “The State of the Postgres Community”, this is from vondra.me and this was a presentation that was done at Swiss PG Day 2024 and he basically converted it into a blog post. So he talks about commits per year, the number of committers, how many are actively contributing different patch statuses, as well as the commits per year and mailing list activity. So if you kind of want to see what the trends of postgres development have been, you can check out this blog post.

The last piece of content “Does Postgresql Respond to the Challenge of Analytical Queries?”. This is from danolevo dot substack.com. And he goes through and looks at what postgres features that are available now and what is the progress for being able to handle complex analytical queries with regard to extended statistics support, doing incremental sorting, memoizing, doing pull up subqueries, and ordering by distinct aggregates, as well as a brief discussion on what more can be done in terms of future postgres development. But if you want to learn more, definitely check out this blog post.

Implement Get Or Create | Scaling Postgres 328

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript