background

JSON Goodies In Postgres 17 | Scaling Postgres 338

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the JSON goodies in Postgres 17, trigger recursion, pg_dump as backup and pg_timeseries columnar performance.

Content Discussed

YouTube Video

Podcast Audio

Transcript

Last week when I started this episode, Hurricane Milton was getting ready to head right over my house and I didn't know what the impact would be. Well, thankfully we had no to minimal exterior damage, so that's good. But my neighborhood has plenty of downed trees and some neighbors had fence damage and things of that nature. But all in all not too terrible. But I I hope you, your friends, family and coworkers continue to do well. 


Our first piece of content is SQL/JSON is here! (kinda “Waiting for Pg 17”). This is from depesz.com and he frequently titles his blog post waiting for postgres 17 and he mentions the new feature, but this covers the feature that's in the already released postgres 17. So I don't know about you, but I don't work with JSON and postgres on a regular basis. I definitely use JSON in my applications. I have clients that use JSON, but it's not something I use every day. Therefore, my knowledge with regard to it tends to come and go pretty easily. So I'm always looking for resources to remind myself how to do a particular thing in JSON and I think this is a great post for that purpose. It's a reference of all the new goodies that come in postgres 17 with regard to SQL JSON. So this blog post is super long. I'm not going to go into all the detail, but we'll just highlight some of the things he covers and then he gives examples in the code blocks. So he mentions constructors such as the JSON constructor that's new as well as JSON scalar. To be able to properly quote a particular scalar value, you will be inserting into JSON. They've got the JSON serialize which serializes JSON into a text format or into a byte array format. They've got the new JSON array functions, the JSON array aggregate function, the JSON object for building JSON as well as JSON object aggregate. He talks about some of the testing functions and then starts talking about the more sophisticated features like JSON query as well as JSON value. And finally, the big one that he says is quote this is the big thing is JSON table. So being able to take a snippet of JSON you receive and build a table out of it and converting it to different data types, etcetera. So this is a blog post I'm definitely going to be bookmarking so that I can reference in the future for when I need to use JSON functions again. 


Next piece of content “Convert JSON into Columns and Rows with JSON_TABLE” this is from crunchydata.com and this goes more into depth of the JSON table capability. And here they're grabbing JSON of earthquake occurrences. And because it's on an API, they actually install the HTTP extension and build a CTE where they retrieve the JSON and then combine it into a JSON table command to extract these values from within the JSON and build a dedicated table with specific data types for it. And you can even store some of that JSON you're retrieving in its own JSON B column. So he shows an example here. I'm sure this is hard to read on the screen, but here's what the JSON document looks like, so you can see how the syntax is able to grab each of these values. Basically, it looks in the set of features, which is essentially an array here, and for each element within it. It then builds the different values of the columns. And he even shows how you can do additional transformations of it depending on the type of table or query that you want to build. So if you want to learn more about that, definitely check out this blog post. 


Next piece of content “Dealing with trigger recursion in PostgreSQL” so the first thing he talks about is a beginner's mistake leading to trigger recursion. So he has a table here where he has an updated at column, which is a timestamp, and by default it's going to be the current timestamp. But how do you get this updated at to be updated every time this row gets changed? So he creates a trigger for that purpose. So he first creates the function that basically updates the updated act column to the current timestamp for the given id that's been changed, and then creates a trigger that happens after the update on this table for each row execute that function. But of course the problem is you've essentially created an infinite loop or an infinite recursion problem because it will keep updating itself in a loop because it's using after update, do another update, and then do another update, etcetera. So how can you avoid it in that situation? Well, in this example, don't use an after update trigger, but use a before update trigger. And essentially for each row you don't do another update within the function. But because it's happening before the update, you can simply take the updated at and set it to the current timestamp. So it basically changes the data before the actual update is executed. So that's the best way to avoid an infinite loop or an infinite recursion case. Now he has a more, he says, serious example where you have multiple workers that are going to be placed in quarantine, and it's basically everyone at the same address may need to go into quarantine, but you kind of have the same problem after a row of interest is updated. In this table, do an update where you're going to be updating other rows that have the same address. So again, you can still cause an infinite loop or an infinite recursion here. So how you can get around this problem is to add add to the where condition so that it won't update a worker that's already been quarantined. So this avoids doing essentially a double or multiple updates. He said another solution for dealing with recursion problems is the function PG_TRIGGER_DEPTH. So this actually tracks how many times a recursive loop has happened and you can use it in your function. And here he says, only run this update statement if the trigger depth is less than two, because I think that's possible with a particular implementation he has, but that's a way to control and preventing an infinite loop or infinite recursion. And lastly, there is also the when clause, which is kind of like a guard clause to prevent this function from ever being executed. And you actually add the when clause on the trigger itself. So you have after update on the worker for each row before you get to execute the function. It says when the new row was quarantined and the trigger depth is less than two. So that prevents this function from even being called if the trigger depth gets exceeded. But if you ever have to deal with any of these issues, I think these are good solutions for dealing with it. 


Next piece of content “Is pg_dump a Backup Tool?” And this is from rhaas.blogspot.com. and it's pretty funny, a number of weeks ago I did an episode called is Pgdump a backup tool? Or something like that. I think it was episode 331, but this is Robert's take on is it a backup tool? And he had a lot of the same perspective I did in that at the end of the day, if it gets back your data, it could be perceived as some type of backup tool. For example, he says, quote, if by chance my database went down and I lost the contents of my data directory, I could restore from one of my saved dumps and get the system back up. So that's what matters. And he says he's dealt with very small databases at times that just using PGdump was sufficient, and I mentioned that as well. The key thing is, whatever tool you choose to backup your data, can you restore it? So test out those restores and make sure that they happen. But he does agree, you know, these are tools and they're not full backup solutions. So once you become more sophisticated, they're not the best choice. And you may want to use a tool like Barman or PG backrest, because they have many more features to give you enhanced functionality for backing up the system. And once you start using physical backups, not the PG dumps and PG restores, you get access to point in time recoveries and things of that nature. And I'll kind of mention something that I was saying back at the previous episode. I think probably at this time PGdump might be more appropriate to be named PG export, and maybe PG restore should be PG import, because that's essentially what they're doing. They're exporting all of your data, they're importing all of your data. But if you want to learn more about Robert's perspective, definitely encourage you to check out this blog post. 


Next piece of content “Comparing Columnar to Heap Performance in Postgres with pg_timeseries” this is from tembo.io and PGTime series is their new extension for time series data and I they have incorporated hydro columnar storage to provide column storage for this extension, and they said they did a presentation at the Chicago Postgres users group and people were asking questions with regard to columnar performance. So they said all right, let's go ahead and do a benchmark. So they explained all the parameters they use for setting up this benchmarking test, and it's available on GitHub for you to review as well as all the code used. But we'll take a look at the results down here. And they basically inserted, updated and deleted into either heap storage or the columnar storage, and from their testing they were pretty much equivalent up to about 3 million rows. And at that point the columnar storage started doing better. So the heap storage was actually a bit slower. And they said, “As row density increases, new column data has more chances to integrate into an existing compression histogram. So that apparently gives a little bit of a performance gains as the amount of data increases”. In terms of the updates, it was a little bit different, but still the columnar was a little bit faster compared to the heap, as you can see here. I don't know about this statistical variance because they don't mention it here, but still it appears to be a little bit faster. And lastly, the DELETES. It's clearly apparent that the columnar storage is much faster than the heap storage, so he says he wasn't quite sure what algorithm was used here, but the deletes are definitely more efficient in the columnar storage compared to heap. So those are some benchmarks of the columnar storage in the new PGTime series extension. So check this out if you want to learn more. 


Next piece of content there was another episode of Postgres FM last week. This one was on “Advanced EXPLAIN”, and in this Nikolai and Michael discussed explain, and I thought they would go a little bit more into talking about serialize and memory the new options in Postgres 17. But their discussion actually ran the gamut of explain, two naming things they mentioned that I thought were very interesting and I would definitely endorse is Nikolai said as opposed to using analyze to actually run the query, it should be renamed to execute. So you do explain execute because it actually executes the query, it does what it says it needs to do. So I would definitely plus one that in addition, Michael's recommendation is that the verbose option to explain should actually give you all the different parameters, because right now it picks and chooses what you need to run. So if you do explain for Bose, you still need to list out a number of other parameters if you want all information with regard to what explain offers. So I'll put my little plus one there as well. They did talk a little bit about the testing results with regard to partition planning time, and this is with regard to experiments that Nikolai was doing at postgres AI and found that caching plays an important role in the planning time when you're dealing with a lot of partitions, and that was mentioned in last week's scaling postgres. We covered the blog post from postgres AI discussing this, so you can reference that, but basically highlights the importance of using a pooler if you're using a lot of partition tables in your application. They also emphasize the importance of buffers and how you should always add buffers whenever you're running explain and not just rely on the timings, as well as being sure to run a query more than once, because a lot of times you're going to have a cold cache when you run it. Depending on the system you're running on, and the second or third time it will have essentially hot cache results. And they discussed a little bit about a wish list that they would like explain have in terms of additional capabilities. So if you're interested in listening to this episode, you can click here or you can watch the YouTube video down here. 


Next piece of content “PostgreSQL 18: Tweaking relation statistics” this is from dbi-services.com and apparently this is a feature that allows you to change the statistics that the optimizer knows about. So there's a new function called pg_set_relation_stats and it enables you to enter a table or relation and define how many rows it has, for example. So he created a table with 100 rows and when doing a query it did a sequential scan even though there's an index present because there's not many rows. So it's not going to try to do an index scan. But once he used PG relations stats and said oh no, this table has a million rows. Now it was doing an index scan for the same query, but if any other operation runs like analyze, it will reset those statistics back to what they should be. So now it's back to 100 tuples after analyze was run. And there's also a function to run PG clear relations stats. I'm not quite sure how this will be used, but definitely an interesting addition. 


Next blog post also from DBI-services is “PostgreSQL 18: REJECT_LIMIT for COPY” and in postgres 17 we got the ON_ERROR option for copy that allows you to keep copying even though there are errors. Well, with this addition you can set a reject limit. So for example, it will stop if it hits this limit. So you can see the different behaviors setting the different reject limits as to what point copy will go ahead and stop. So that's something to look forward to in 18. 


Next piece of content “Live Share: Connect to in-browser PGlite with any Postgres client”. This is from supabase.com. And as a reminder, PG Lite is an in browser database that's essentially running in single user mode, but they set up the capability to connect to it via PSQL. Now when I first mentioned this, they called it postgres new, but now the name has changed to database.build or I guess that's at least the URL. And basically once you turn on live share, it gives you a URL that you just put into your PSQL client and you can connect to it via PSQL and do whatever operations you want to to the database. Basically, my understanding is that this is happening through a new extension or an open source library called PG Gateway that basically speaks the postgres warp protocol from the server side. So that's pretty interesting. So check out this blog post if you want to learn more. 


And the last piece of content “PostgreSQL Anonymizer 2.0 - Generating Fake Data” so my understanding with this particular extension, it basically offered a way to anonymize data in like a production database that you're transferring to something you want to test out, but you don't want that production data in the test environment. But it seems what they're doing with version two is introducing faker data. So the ability to fabricate data, and it looks like down here that they're actually merging an existing extension, PostgreSQL Faker, with the postgreSQL Anonymizer. So I actually thought this would be two different tools, but now it looks like they're combining them into one. I've never used the postgreSQL Faker tool. I usually use language based tools with whatever language I'm using, but check this out if you want to learn more.

 

episode_image