Aurora vs. Postgres, Surprising Transactions, Write-Only & Read-Only, Indexing Advice | Scaling Postgres 195
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss Aurora vs. Postgres, surprising transaction behavior, write-only & read-only database connections and indexing best practices.
Content Discussed
- Is Aurora PostgreSQL really faster and cheaper than RDS PostgreSQL – Benchmarking
- What developers find surprising about Postgres transactions
- Taking Advantage of write-only and read-only Connections with pgBouncer in Django
- Some indexing best practices
- Postgres and JSON
- Surviving Without A Superuser - Part One
- Extracting and Substituting Text with Regular Expressions in PostgreSQL
- kill that backend!
- A Look Inside PostgreSQL’s Extended Query Protocol
- pg_rewrite: PostgreSQL Table Partitioning
- pg_graphql: A GraphQL extension for PostgreSQL
- Emre Hasegeli
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Is Aurora PostgreSQL really faster and cheaper than RDS PostgreSQL- Benchmarking". This is from migops.com, and they're talking about how they were hearing from some of their customers that they were seeing pretty big CPU utilization spikes on their Aurora instances. Maybe not spikes, but just high CPU utilization that caused them to get bigger instances, which of course, has a budgetary impact. So they actually wanted to do an analysis to benchmark Aurora PostgreSQL versus RDS PostgreSQL. Now, RDS PostgreSQL is basically Postgres that Amazon hosts on your behalf. Aurora PostgreSQL is not PostgreSQL. It doesn't use the Postgres engine to my knowledge. It's its own proprietary technology that Amazon has developed.
But it provides a PostgreSQL interface, so you can do configuration of it that way. But it's an entirely separate engine to my knowledge, so it can behave differently than Postgres. They mentioned here that Amazon indicates that Aurora can give you three times the performance of Postgres. But of course, the question comes in what instances does that happen? Because in their analysis, they actually saw the RDS Postgres outperform Aurora given a similar size instance, and the resource utilization of Aurora was much higher. So the tool that they used to do these benchmarks is pgbench, and they found what I think are some interesting findings. So they used a db.r5.xlarge for both RDS and Postgres, they have the same zone so the same instance type.
The storage is handled a little bit differently. It looks like Aurora auto-scales to do the storage limits, whereas you specify an RDS. They initially started with IOPS at 900, just based on the size of their storage. But they increased it to 3,000 because when they initially did their test using pgbench, Postgres was about half the performance of Aurora. So when they started to dig in to say, okay, what's causing this problem? They were seeing a number of weight events related to data file reads, WALSyncs, as well as WALWrites. So clearly, it looks IO-bound. So what they did was they looked at how much their billing was, and it looked like the Aurora billing here, they say, was equivalent to $170. So they went ahead and upgraded the IOPS to get up to 3,000 IOPS on the RDS instance to try to make it equivalent pricing, I'm assuming.
But as soon as they did that, the RDS PostgreSQL started outperforming Aurora. So at the four, eight, and twelve client loads across different runs, it outperformed Aurora. And the transactions per second were relatively consistent in the tests. But something else that was also interesting is that the CPU utilization was quite different. The RDS CPU utilization was maybe around 30% during the runs, whereas it reached almost 70% in Aurora. Similarly, the IOPS Utilization RDS versus Aurora. The RDS looks like it was peaking around 4,000 IOPS, which is interesting because I thought they had set at 3,000, but it was dramatically higher. It looks like 40,000 versus 4,000. So tenfold higher IOPS utilization on Aurora, which of course may result in increased costs.
Then they looked at the IOQ depth where it seems relatively consistently under ten except for some spikes. It was spiking all over the place on Aurora with some spikes over 100, but looks like other spikes around the 75 mark. So, definitely, an interesting set of tests comparing Aurora versus RDS or essentially Postgres. I frequently tell my clients Aurora is not Postgres, it's an entirely separate engine. So you're going to expect to see performance differences. The question is, are those performance differences going to be better or worse than PostgreSQL? So clearly, using pgbench in this example, it looks to be worse, perhaps. But maybe there are other benchmarks that will show Aurora outperforms Postgres. Based upon how the database engine has been designed and the differences between Postgres, but definitely an interesting piece of content. I highly suggest you check it out.
The next piece of content- "What developers find surprising about Postgres transactions". This is from blog.lawrencejones.dev and he's talking about transactions and the case of repeatedly reading something. So a lot of transactions are used to ensure that you can do an atomic update of, say, two tables. Like you want to decrement a balance in one account and increment it in another. You want those to happen within a transaction and those updates to either happen or not. On the whole, you don't want, say, money decremented but not incremented in the other account. In this example, he's looking at doing, within a transaction, a select to get some information and that even though you're doing a transaction in Postgres another connection has changed the information where the ID equals bananas in the organization's table. When you do this next query, it's possible you're going to get different information.
The reason is that the isolation level in Postgres, by default, is read committed. So as soon as any transaction is committed, you can read it and it doesn't matter if you're in a transaction or not. So as soon as another connection updates the information you selected, if you go to select it again, you're going to get that up-to-date information. It's not going to return the same data just because you are in a transaction, because it's showing you things after reads are committed. Now, this post talks about a way to avoid this is to use locks. So he shows an example where you're essentially locking to share this example and then you'll get the same information with the line three query. So basically that is an explicit row lock, as he says here. However, there is another way to do this, changing the isolation levels. So in addition to read committed, there's also repeatable read.
So that would enable you to query this data as many times as you want to and you will get the exact same information as it existed at the start of the transaction. So it's going to keep a version of the database existing for this transaction to allow you to retrieve the same information. Even if another session has committed changes to that record, it keeps that old version around for you to reuse. Now, one disadvantage of using an isolation level change and making it a repeatable read is that there is a chance that the query has to be canceled. So if you look at the Postgres documentation, it talks about the repeatable read isolation level, so it is possible to get an error and it says it could not serialize access due to a concurrent update. So if another session is updating it and you're doing specific updates or changes within that transaction, it could have to roll back your transaction.
Basically, you have to start the process over again. So locking as he's suggesting here won't result in this possible failure, but you're potentially reducing your concurrency because no one can make changes to this record while this is going on. Whereas with a repeatable read, all the updates can still happen, so there's no blocking from a lock. However, there are cases where you will hit a serialization error and you'll have to redo your work if you choose this path. But definitely an interesting post. If you want to learn more, encourage you to check out this piece of content.
The next piece of content- "Taking Advantage of write-only and read-only Connections with pgBouncer in Django". This is from enterprisedb.com and they're talking about setting up an environment where your application is specifically Django, in this case. But there are other application frameworks that can do this as well. Basically, the fact that they can speak to multiple databases. So they've set up an environment here where your application framework or your application server can speak to two different databases, a write-only and a read-only database. Then what they do is they send them through pgBouncer and this acts as a proxy to point to the primary database. The reads then go through an HAProxy to speak to one or more standbys so that you're going to get a read from one of the standby databases. So this post doesn't go into how to set up the high availability here, but it talks about configuring your application with pgBouncer and HAProxy to be able to send writes to one database and reads to another.
So the pgBouncer configuration simply sets up a reads database that talks to HAProxy and sets up a writes database to talk to the primary. Then HAProxy does this: it receives the connections and sends them to one of the two standby servers using a balance roundrobin. Then you configure your application framework to speak to both databases. Basically, you configure your databases such that your primary goes to the writes database here and your standby goes to the reads database. Then Django, in particular, you can set up DBRouting so it automatically knows to send reads to the reads database and writes to the writes database. But of course, you could also do this for any other framework. So this is a very simple implementation of showing you how you can separate your reads and your writes, send them to different databases, and then scale out at least your reads. So definitely an interesting blog post. I suggest you check it out.
The next piece of content- "SOME INDEXING BEST PRACTICES". This is from pgmustard.com and this is just a numbered set of some best practices. Number one, don't index every column. Definitely true. Every index has a price to be maintained when you're inserting or updating records. And you would avoid a lot of heap-only Tuple updates because those indexes have to be updated. The second recommendation is to index columns that you filter on. So anything in a WHERE clause, you're generally going to want to have an index that can access it. The third is to only index data that you need to look up. Now specifically, they're talking about utilizing partial indexes when necessary. So if you look at rows in the table where deleted it is NULL. So when it gets deleted, this column is updated with the date of the deletion. Generally, it is a good practice to create the indexes as a partial index WHERE deleted_at is NULL.
You could also do this with statuses where, say, a status is complete. Maybe you only want to index those values where the status is complete or the status is active or inactive. The fourth consideration is to think about other index types. So the B-tree index is the standard index type, but if you have a text search you need to do, if you have a JSONB column, you're generally going to want to use some GIN indexes for that. Or there are certain use cases when you have a lot of data, you may want to use a block range index or a BRIN index. Of course, spatial uses just indexes so you can check out the different index types to see how they can benefit you. Tip five is to use indexes to presort data. I was a little confused on this, but I think what they're talking about is that if you presort the data you're going to index, you can get some advantages as long as that data doesn't change after the fact.
Like they have a link to loading this data in a sorted fashion and you avoid a bitmap index scan if the order of the index matches the order of data in the table. Now that can work very well for things like data warehouses, data marts, and things of that nature. But if you're constantly updating and inserting data, that may not work out as well, but that's definitely something you take advantage of. Number six is to use multicolumn indexes, but sparingly. I see a lot of performance benefits from using multicolumn indexes. So I may not necessarily agree with sparingly, but when I do first passive indexing of a table, I usually don't use multicolumn indexes unless I absolutely know I need it.
I basically put indexes on the foreign keys and other columns I know I'm going to be using in a where clause. And the other thing not mentioned here is that when you're ordering, an index is also used. So you should also consider using indexes when you're ordering by something because that can definitely lead to some performance improvements. Number seven, look after your indexes. So basically, yes, if they get bloated, it's a good practice to reindex them periodically and particularly always do reindex concurrently. So these were some pretty good recommendations. And if you want to learn more, you can check out this blog post.
Next piece of content- "Postgres and JSON". This is from aklaver.org and the title wasn't necessarily 100% clear to me what it was discussing. I think a more appropriate title would have been Creating an Audit Trail Using Postgres and JSONB Fields. Because that's essentially what this post is. So they created an archive test table that has a number of columns in it and then they created a companion DELETE table. So this is going to hold any deletions that happen to this table. Then they created a function that takes the old data and uses the function row to JSON. So it converts that row of data that's being deleted into JSON and then it stores it into the delete version of the table, the archive test delete table. Then it creates the trigger and it executes that procedure, the archive record procedure that was created after a delete. So the advantage of this is that this can now persist through schema changes.
So if you change the schema of the archive test, it doesn't matter because it's storing the whole row as a JSONB field that they called a record field. So they give an example of it where they insert three rows into the table and then they delete a record and you can see what gets inserted into the delete table. You basically have a record field with all the values in a JSONB. They showed where you could add a column and it still works as normal. It just adds additional information to the JSONB. Now what's interesting is their technique to basically get data out of the audit table. They do a SELECT from the table, but then they do a lateral join, and from the record field they do a jsonb_populate_record that basically takes all that information and turns it back into an SQL record, essentially. So definitely an interesting way to set up auditing for particular tables.
The next piece of content- "Extracting and Substituting Text with Regular Expressions in PostgreSQL". This is from blog.crunchydata.com, and this describes exactly what it's doing. It's learning to use regular expressions in Postgres. So they go over a quick refresher of how to use regular expressions and then how you can do searches within the database using them using the tilde operator, as well as how you can extract specific text using substring and regular expressions, as well as do substitutions. So if you want to learn more, you can definitely check out this blog post.
The next piece of content- "PG_REWRITE: POSTGRESQL TABLE PARTITIONING". This is from cybertec-postgresql.com, and this is a new open-source tool and extension that helps you rewrite a table into a partition structure. So they show you how to get access to the code for this, how to install it, and how to enable it. You need to make it a part of your shared preload libraries, and you do need to set the wal_level to give it enough information to set up the partitioning scheme. Then you create the extension within your database because you do need access to the partitioning command. They give an example of a table that they wanted to convert and they made two partitions out of it.
Essentially, you just run this one command SELECT partition_table, and then you specify the table you want to partition, what you want to partition it to, and then what the old table should be renamed to. After you run it, it should result in a structure like this where the old table will have been renamed. It used to be t_number, now it's t_old_number. The new partition table is no longer named t_part_number because they renamed it to t_number, what the original name of the old table was. So essentially it's a swap in replacement. They said that this utility should work without locking things and you should be able to do it online, but they didn't go into a lot of detail about how it works. But since the source code is open source, you can definitely check that out. So if you're looking to convert a table into a partition scheme, maybe you want to check out this new utility.
Next piece of content- "pg_graphql: A GraphQL extension for PostgreSQL". This is from supabase.com. They're talking about a new open-source tool release called pg_graphql, and they do say it is a work in progress, but it's something that they're working on and they've just released it. Basically, it uses Postgres to provide a GraphQL interface. So if you're interested in something like that, you can definitely check out this blog post.
The last piece of content, the PostgreSQL person of the week is Emre Hasegeli. So if you're interested in learning more about Emre and his contributions to Postgres, definitely check out this blog post.