Pagination Solutions, Return Modifications, Insert Deletions, How To JSON | Scaling Postgres 248
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss different pagination solutions, how to return modifications, soft deletion alternatives and how to use JSON with PostgreSQL.
Content Discussed
- Pagination and the problem of the total result count
- RETURNING Modified Rows
- Easy, alternative soft deletion: `deleted_record_insert`
- How to JSON in PostgreSQL®
- Beyond Joins and Indexes
- PostgreSQL largest commits
- Supabase Wrappers, a Postgres FDW framework written in Rust
- Using the window function row_number to remove duplicates in PostgreSQL
- Horizontally Scaling PostgreSQL
- Upgrading PostgreSQL Extensions
- A PostgreSQL Developer's Perspective: Six Interesting Patches From November's Commitfest
- Using pgbench to load data faster, and the random_normal function in PG16
- Waiting for PostgreSQL 16 – Invent random_normal() to provide normally-distributed random numbers.
- Rethinking buffer mapping for modern hardware architectures
- OrioleDB Github
- How to Solve Advent of Code 2022 Using Postgres - Day 11
- How to Solve Advent of Code 2022 Using Postgres - Day 12
- Timezone Transformation Using Location Data & PostGIS
- Avoiding Constraint violations while migrating Oracle to PostgreSQL – DATE data type
- SUBSTR Functionality Differences Between Oracle and PostgreSQL: What You Need to Know
- Unleasing Boolean data type casting in PostgreSQL.
- Copying a database
- Francesco Tisiot
- Rubber Duck Dev Show Episode 69 | Adventures in Note Taking with Seb Wilgosz
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family and coworkers continue to do well. Our first piece of content is "PAGINATION AND THE PROBLEM OF THE TOTAL RESULT COUNT". This is from cybertec-posgresql.com. They're talking about a situation where a lot of programming libraries have the ability to do pagination, but nearly all of them ask for the full result setback to get the full count of records to calculate how many pages would potentially need to be rendered. But they also use a specific technique, LIMIT and OFFSET, to do each page. But that results in some performance problems. So in this example, they generated some fake data that they wanted to page through. They wanted to look at values between zero and ten ordered by the created_at date. So they created an index on created and value. Now, normally for the best performance, you would swap these out, but I guess if you want the earliest record to always show, it would make more sense to have the record first.
As I mentioned, most of these libraries use OFFSET and LIMIT and they show exactly how that works here. When you're looking at the first page, and then when you're looking at the 100th page, you'll see an offset of 4,950 records, limiting to 50 for each page. Now, the big disadvantage of this is that as you get to later pages, the performance gets worse and worse because it has to pull all of these 5,000 records and then only show the 50 of interest. But it's pretty simple and this is what most libraries tend to use. He mentioned another option using cursors. Now, a cursor generally takes place within a transaction, so that's not necessarily going to help you, particularly with a web application, but you can use WITH HOLD cursors and that basically materializes the whole result set that you have. Now, the huge disadvantage of this is it takes a long time to pull back all of this data depending upon how large it is. So rendering your first page could be pretty slow.
Secondly, you also need to do a fair amount of management of the cursor and close it properly as well. The third disadvantage is that this doesn't really work well when you're doing transaction-level pooling because this depends upon having a consistent session to manage the cursor and have it present. So this has a fair number of disadvantages as well. And then the third option is keyset pagination, where basically you keep track of the last value of the page, or I should say the last row of the page that's rendered. And then you use that for subsequent queries when wanting to go to the second page, the third page, et cetera. So it gives you a new starting point and generally, you're going to want to use, say, the created_at date and an ID, which is some sort of a unique identifier, just in case something has the same date time.
And then with that, it does require a slight modification to the index because you're going to want to have the created_at ID and the value. Now this has a lot of advantages. The only disadvantage is needing to modify the index a bit for this specific query. So this may be less appropriate for more dynamic data you're trying to pull. And then they have this great chart here that shows the different comparisons when trying to render pages 1, 10, and 100, and how the OFFSET/LIMIT went from one millisecond for the first page to 15 milliseconds for the 100th page. If you have more pages, it's going to increase from there. Using the WITH HOLD cursor, the first page took 642 milliseconds. So that's a huge amount of time to pull all of that data. But the 10th page was able to show in 0.34 milliseconds so much faster than even the first page OFFSET and LIMIT.
And this 100th page was 0.51 milliseconds. So very quick, once you have all that data loaded in and then the keyset pagination was pretty consistent, 1.4 milliseconds rounded up for pages 1,10, or 100, it didn't matter. So usually if I'm going to do pagination, this is my go-to keyset pagination. But in general, I don't like using pagination at all if I can help it. I try to purposefully limit what amount of data is returned or generate a report in the background for it, at least for web applications. So the next area they cover is what's the problem with the total result count. And basically, you need the total result count to enumerate all of the different pages that you could potentially click through.
But this is a huge resource hog and as they say, here, quote "...the query to calculate the total result set count is usually more expensive than all other queries together". That's definitely the case that I've seen. So basically you don't want to run the whole result count. His recommendation, which I agree with, is either don't show it and just say next page, previous page, don't show all the different pages, or do an estimate. And you can do this using an EXPLAIN plan for what you're querying. So use the database statistics to give a rough count of how many rows you have. But anyway, these are all different methods that you can handle pagination with Postgres and you can check out this blog post if you want to learn more.
Next piece of content- "RETURNING Modified Rows". This is from sqlfordevs.com and he's talking about potentially rethinking how you perform the UPDATE to the database. So in this case, he's doing the update that he wants done, but then returning all of those rows. You get the returned result set of this update task and you get all of the columns back for the rows that were touched by this. So basically you don't have to do a secondary SELECT or do a SELECT first and then do an update. You could also do this with DELETE statements and even INSERT statements. So basically whatever rows are the result of these statements, you can return specific columns or all columns from that result set. So this was just a quick blog post explaining how to do that.
Next piece of content- "Easy, alternative soft deletion: 'deleted_record_insert'". This is from brandur.org and we covered a post in a previous episode of Scaling Postgres on "Soft deletion probably isn't worth it". He said he's seen numerous cases supporting soft deletion where say you added a deleted_at column and when you need to delete data, you don't delete the row, but you actually update the deleted_at column to a value, something other than NULL and then that record is considered soft deleted. When you query the table, you now need to say where deleted_at is NOT NULL to make sure you're not getting those deleted rows. But this causes problems with foreign keys and potentially querying the data directly, getting those deleted records when it wasn't intended.
This post walks through that, but this is a follow-up post to that that walks through the result of making these changes where he's not doing soft deletion. Instead, he created a dedicated deleted_record table and if data needs to be deleted, it gets deleted but then inserted into this deleted_record table so you still have a record of what's deleted. It's just in its own dedicated table. You don't have to worry about foreign key issues and you don't have to worry about querying the table and accidentally pulling deleted records. He even made an enhancement here where you could use this simple function such that when you have a delete, it inserts into this deleted_record table and it stores all the columns in a JSON field. To use it, you just create a trigger and say AFTER DELETE ON a particular table, for each row, execute the function. So I really like this solution and I actually may be considering it next time I'm looking to implement a self-delete solution, but definitely check out this blog post if you want to learn more.
Next piece of content- "How to JSON in PostgreSQL". This is from dev.to in Francesco's area, and this is a summary blog post on how to use JSON with Postgres and they're basically links to more comprehensive blog posts about how to use all the different JSON features in Postgres. So if you're interested in learning about that, definitely check out this blog post.
Next piece of content, Bruce Momjuan did a new presentation and this is a PDF of it called "Beyond Joins and Indexes" and this can be found at momjian.us. This is a follow-up presentation to his explaining the Postgres query optimizer talk. This covers everything not related to joins and indexes basically all of these 42 topics here. So it's definitely quite comprehensive and I think it's yes, 69 pages. So if you want to learn more about Postgres and query or statement optimization, definitely check out this piece of content.
Next piece of content- "PostgreSQL largest commits". This is from peter.eisentraut.org and he looked through the git log for each version of Postgres to see what was the largest commit for it. So as of the most recent release, 15, it was the MERGE command, which was definitely a huge endeavor. For Postgres 14, it was extended statistics on expressions. For Postgres 13, it was allow partition wise joins in more cases, so more parallel processing, which is great. Version twelve, the biggest one was the partial implementation of SQL/JSON path language. That's great. Version eleven was replacing the catalog format mostly. This is for internal developers. Postgres ten was the logical replication feature, which was definitely huge and so far for 16, it's on the initial version of the meson based build system. So definitely something more internally related for this one, but you can check out this blog post if you want to learn more.
Next piece of content- "Supabase Wrappers, a Postgres FDW framework written in Rust". This is from supabase.com and they're talking about a new framework that they developed that's been reworked to use foreign data wrappers not only against databases but even third-party APIs. So the first release of this supports querying Firebase and Stripe. So querying the Stripe API from your database. So not getting a library for your language of choice and communicating with Stripe, but just doing it directly in the database, creating a foreign data wrapper that you could use a direct data call to the API. This is fascinating to me and they have plans for writing wrappers for ClickHouse, BigQuery and Airtable. So if you want to learn more about this and how to implement it, definitely check out this blog post.
Next piece of content- "Horizontally Scaling PostgreSQL". This is from pgdash.io and this is a comprehensive blog post on all the different ways that you can horizontally scale Postgres. So they talk about Read Replicas, where you can get streaming replication set up and be able to read from multiple replicas. They talk about logical replication and how that enables you to logically replicate individual tables or even specific data by setting filters on columns and rows. They talk about multi master and how that's not really a thing for Postgres, although there are third-party tools that offer that option.
Then they talk about synchronous replication. So by default, replication is asynchronous, but you can make replicas synchronous and ensure that once you commit to the primary, it's actually committed in some other databases as well. And then they follow up with foreign data wrappers and discuss being able to communicate with other types of data sources with your database. So check out this blog post if you want to learn all the different ways where you could do horizontal scaling.
Next piece of content- "Upgrading PostgreSQL Extensions. " This is from percona.com and they're talking about when you have an extension you can actually check its version and even upgrade it to a new version when it comes out and they give you the commands and show you how to do it here.
Next piece of content- "A PostgreSQL Developer's Perspective: Six Interesting Patches From November's Commitfest". This is from timescale.com and some of the commits they are talking about are two security improvements. One is using regular expressions for usernames in pg_hba.conf. The second is allowing particular users to vacuum and analyze tables. The next two cover transparent column encryption capabilities. The next one covers the groundwork to move from 32-bit transaction IDs to 64-bit ones. Again, this doesn't get a 64-bit transaction ID, but it starts laying the groundwork for getting to that point. And then the last one of note is doing logical replication for DDL. So it already replicates INSERTs, UPDATEs, DELETEs, AND TRUNCATEs, but this one would also potentially do table changes, so actual schema changes as well. So definitely check out this blog post if you want to learn more.
Next piece of content- "Using pgbench to load data faster, and the random_normal function in PG16". This is from pganalyze.com and this is their "Five minutes of Postgres" episode. The post they discussed is "Generating lots of test data with Postgres, fast and faster". This is something we covered on last week's Scaling Postgres, but you can definitely watch this blog post to get his insight into this blog post.
As well as that, he covers this post here, "Waiting for PostgreSQL 16- Invent random_normal() to provide normally-distributed random numbers". And this is from depesz.com and you can see when you're using this function you can actually create a normal distribution of data. This could be advantageous if you want to fabricate data using generate_series to do testing. So you can check this out if you want to learn more about it.
Next piece of content- "Rethinking buffer mapping for modern hardware architectures". This is from orioledata.com. OrioliDB is actually a new storage engine for PostgreSQL, so it tries to take a modern approach to handling data that results in a number of advantages, a lot of which are performance-related. So the first part is the buffer work and that is what this blog post covers "Rethinking buffer mapping for modern hardware architectures". And it goes through the process of how they've redesigned it to achieve, in a lot of cases, four times greater throughput for read-only scalability tests.
But they also have reworked MVCC to use an UNDO log concept. So as opposed to marking a row for deletion, which happens now in a table in Postgres, this actually maintains an UNDO log. And then the third is supporting copy-on-write checkpoints. So it's all about giving Postgres better performance. So this is the first blog post that I've seen them do on their website. And you should know that this is alpha software. At this point, it's not even in beta yet, but if you want to learn more about it, definitely check out this blog post.
Next piece of content is two blog posts about the advent of code. "How to Solve Advent of Code 2022 Using Postgres". There's day eleven and day twelve. And this is from crunchydata.com. You can check these blog posts out to see their solution.
Next piece of content- "Timezone Transformation Using Location Data & PostGIS". This is from crunchydata.com. They're talking about a scenario where they were receiving some data and the timestamp did not have the time zone, but they had geolocation data. So they used the geolocation data to look up what zone it should be in and then modified the timestamp from a UTC timestamp to one in the proper time zone. So if you want to learn how to do that, you can check out this blog post.
Next piece of content- "Avoiding Constraint violations while migrating Oracle to PostgreSQL- DATE data type". This is from migops.com, so definitely something valuable to know if you're looking to migrate from Oracle to Postgres.
The next two posts are also related to Oracle to Postgres migrations. This one is "SUBSTR Functionality Differences Between Oracle and PostgreSQL: What You Need to Know" and this is from databaserookies.wordpress.com. And the next post, also from them is "Unleashing Boolean datatype casting in PostgreSQL". So definitely three pieces of content to check out if you're thinking about an Oracle to Postgres migration.
Next piece of content, there was another episode of Postgres FM this week. This one was on "Copying a database". So if you're looking for all different ways to move your database from one cluster to another, definitely check out this piece of content. You can listen to the episode or watch it on YouTube.
The next piece of content, the PostgreSQL person of the week is Francesco Tisiot. If you're interested in learning more about Francesco and his contribution to Postgres, definitely check out this blog post.
And the last piece of content, we did have another episode of The Rubber Duck Dev Show this past Thursday afternoon. We've actually changed the time on that, but this was on "Adventures and Note Taking with Seb Wilgosz". So this wasn't developer-heavy content, but it was definitely all about learning and how to retain what you learn. And we talked all about different tools that you can use to help learn and retain the information that you need. So if you're interested in that type of content, we definitely welcome you to check out our show.