Companion Databases? | Scaling Postgres 292
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss using companion databases, multi-tenancy database design, whether SQL is good, and different transaction isolation levels.
Content Discussed
- Ludicrous Speed Postgres
- Companion databases
- Designing Your Postgres Database for Multi-tenancy
- Is SQL Good?
- Transaction Block Isolation Levels
- River: a Fast, Robust Job Queue for Go + Postgres
- Exploring a Postgres query plan
- Postgres + Citus + Partman, Your IoT Database
- Citus live query inspection using citus_stat_activity
- Teach Kelvin Your Thing (TKYT) — High Performance PostgreSQL for Rails
- Learn PostgreSQL - second edition
- Explore Table Access Method Capabilities: Sequential Scan Analyzed
- Experiencing WAL REDO in PostgreSQL
- An Introduction to Indexing and Partitioning in PostgreSQL
- PostgreSQL on s390x
- Fast switchovers with PgBouncer on Amazon RDS Multi-AZ deployments with two readable standbys for PostgreSQL
- PASS Session: Postgres Extensions Shape the Future
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Now, before we get to the first post this week, as I've mentioned in previous episodes, Scaling Postgres, my course is available for early adopters and as you can see, the title is "Ludicrous Speed Postgres". So basically this is my performance optimization course. This is for early adopters because the course is not complete yet, it's actively being worked on. The projected release date is January 29, 2024. So if you want to join, you do have a limited time to do that. So you can use the link or you can just go to the Courses section and it will be available here as another course by the time this video is posted.
But this week's first piece of content is "Companion databases". This is the episode of Postgres FM that happened last week and Nikolay and Michael were talking about databases that you use alongside your Postgres database. So maybe you have a Redis cluster that you've set up for doing job queues, or maybe you've set up an Elastasearch engine for doing searches, or an analytical database, or a vector database for storing embeddings. So they talked about, does this make sense, does it not make sense? Their conclusion is, which I definitely agree with, that you can probably go longer than you think just using Postgres for all of these different features. Because Postgres can do full-text search, it can store your embeddings and be able to query them. It can do some analytical workloads. It's just when you're really pushing the envelope of what a single instance can do, that you need to start choosing a different path.
So maybe that's using a separate type of data store for your analytics events or maybe you need help scaling out and using something like Citus. Maybe you have a lot of time-based data you're working with and want to use something like TimescaleDB, you just don't have all the search features that you want, or you need a little bit more performance. Maybe you do need to set up an Elasticsearch cluster. The downside of doing that is that you're going to have to have more management, you're going to have to learn a whole new set of technologies and how to use them, as well as keep up with data synchronization across all these different database systems. But I thought this was a really good episode and encourage you to check it out. You can listen to the podcast here or you can watch the YouTube video here.
Next piece of content- "Designing Your Postgres Database for Multitenancy". This is from crunchydata.com. They're talking about the three typical approaches for doing multitenancy in a database. So that basically means you have multiple accounts that are going to be using the same database and their data needs to have firewalls between them. So data should never cross over one another and get mixed up. The first way to do it is to have one database per customer. And this is definitely the hardest to do because imagine you have ten customers and then 100, and then 1,000, and then 10,000 if you are so lucky. Well, now you have to manage 10,000 databases, which is no easy feat. So this probably only makes sense if you're going to have gigantic enterprise-based customers. The next option is a customer per schema. So you set up a schema for each customer. This is only a little bit better than every customer getting their own database because you still have to run separate migrations in every schema for every customer.
They said if you're using Citus, it does have a few tools to kind of help with some of it. They say you can join across customers for cross-tenant analytics. But again, I would think that would be really hard with a lot of schemas. But basically at scale, this breaks down as well. The third way is to have what they're calling a tenant discriminator within each table. And that is to say, having an account ID, a customer ID or a tenant ID. In their example here, they're showing an org_id, just whatever identifier you use for each independent customer of your service. That Identifier needs to be placed in every table so that you can identify who owns what data. This is a technique that I've seen in every software as a service application I've worked on. I have not seen one that uses a schema or a database per customer.
They've all used this technique and they recap the approaches down here. Basically, a database per customer, maybe you have tens of tenants, but it's a very high overhead for management. Database per schema; you have hundreds of tenants is maybe the max you want to go, but you still have the overhead of schema migrations. The tenant discriminator; you can have millions of tenants. The disadvantage is the lack of strong tenant isolation. Now, they do say there are a number of libraries that kind of help with this tenant isolation, like Ruby on Rails and Django, which have different libraries to help you handle multitenancy. And then lastly, the blog post covers Citus as a way to scale out your database once you reach a certain level. But check out this blog post if you want to learn more.
Next piece of content- "Is SQL Good?". This is from momjian.us, and he's referencing a 77-email thread from two years ago that was titled "The Tragedy of SQL." That was saying we could be much further along if we didn't have to use SQL and we had another alternative query language. His proposal was using Datalog. Now, with that, I did look up Datalog and here's the Wikipedia entry, and I started looking at the syntax of it and I was like, 'Is this supposed to be easier?'. Because colon dash(:-) is supposed to mean "if" so, it's a function that says if X is a parent of Y, then X is an ancestor of Y. Then I looked down at some other syntax and looked at something like this, I hate to say it, but I'll take SQL any day just for readability. So SQL may have its problems as they're proposing, and maybe this is a better alternative for particular use cases or for certain reasons, but I would think that the usability definitely has to increase significantly to have a shot at approaching SQL. But they had other people comment on it. I thought they were some pretty interesting perspectives, but check out this post if you want to learn more.
Next piece of content is also from momjian.us is "Transaction Block Isolation Levels". And here he's talking about how isolated your particular transactions are as they're running. So by default, Postgres uses READ COMMITTED. That means for every READ that's been committed from the time you start a query, you can see everything that's been committed thus far. But in a transaction block, you may not want that to happen. You may want to have a consistent view of the database at the point the transaction started. If that's the case, you can use REPEATABLE READ, so you can change the isolation level to be able to achieve that. Or if you want to ensure that everything happens serially and avoid transactions stepping on each other's toes, you could use SERIALIZABLE ISOLATION. Just be aware it has the potential to cancel particular queries. So you need a retry setup in this case. But this is a quick post about different isolation levels you can check out if you're interested.
Next piece of content - "River: a Fast, Robust Job Queue for Go + Postgres". This is from brandur.org. At one time while using a job queue on Postgres, he ran into a lot of issues, but apparently, it was on version 9.4. So a lot has changed since then. But he's been working with queues recently and has been frustrated with a Redis-based queue because of its inability to handle transactions. So things would get in different states where a job was admitted to Redis, but the data needed for it hadn't been completed yet, so the job would fail and retry, or you have transaction rollbacks. That means the job will never be able to start, so it's continuously failing when trying to start and numerous other issues with that.
So he actually wanted to create a transaction-based job queue to avoid some of these problems. So he created River for this purpose, and it is based upon Go using pgx driver. He discusses a little bit about it here, how it works, as well as some of the enhancements he appreciates in Postgres that have happened to make it easier to create queues in Postgres. Number one being SKIP LOCKED, being introduced in 9.5, having REINDEX CONCURRENTLY in 12 or even partition tables in Postgres 14. Postgres 13 handles B-tree deduplication better and Postgres 14 is able to skip B-tree splits. So if you want to learn more about this queue, definitely check out this blog post.
Next piece of content- "Exploring a Postgres query plan". This is from notes.eatonphil.com and this is another tour de force post where he's walking through how a Postgres query plan works. So he goes really into detail on this and shows how things work. So if you want this type of depth, definitely encourage you to check out this blog post.
Next piece of content- "Postgres + Citus + Partman, Your IoT Database". This is from crunchydata.com and he's coming up with a scenario saying 'Hey, you want to set up an Internet of Things data set?'. So basically you have a bunch of sensors you want to monitor, in this case, cars. So he's tracking cars. You have the sensor data and the location data and the first thing he does, he wants to shard the data set. So he's going to be using Citus to do this. Now, he does reference device_id in the text but in the actual code he's saying car_id. There's no device_id, so I think they're essentially equivalent. But he runs these two commands to create distributed tables across the different shards. They show an example query here, but of course, this will run across all 32 shards, which is not ideal. So you need to do the same technique you would use in partition tables by whatever key you partition by, you need to query on. Same thing with sharding, however, if you're sharding the data, you need to use that same key.
So when you specify the specific car_id you're looking at, it will then be localized to that particular shard and ideally on the particular node that that shard is running on. And as long as your same data is collocated for this particular shard, you can do joins intra-node, within the same node, and not have to do cross-node joins which would be much less performant. Then they add a time series component. They've added pg_partman, although you don't have to use this extension, you can just partition them however you like. And here they're partitioning by a timestamp. The reason is that they want to be able to archive older data. They showed an example of setting this up and then followed up with being able to do columnar compression with some of the old data to really compress down the older data. But when using Citus for columnar compression, the thing you need to keep in mind is that you can't do updates and deletes on it and logical replication or decoding doesn't work with it. But in terms of compression, they got about a 10x compression when they went to column or storage for the older data. But check out this post if you want to learn more.
Next piece of content- "CITUS LIVE QUERY INSPECTION USING CITUS_STAT_ACTIVITY". This is from cybertec-postgresql.com and if you know anything about Postgres, of course, it has a system view called pg_stat_activity to look at the current running queries on the system. Well, Citus has citus_stat_activity, so it does the same job, but it does it across the nodes. So you can query I think it's the coordinator node and be able to determine what queries are running on which nodes. So he shows an example of that here. So check this blog post if you want to learn more about that.
Next piece of content- "Teach Kelvin Your Thing (TKYT)- High Performance PostgreSQL for Rails". This is from andyatkinson.com. Andrew was on the YouTube channel "Teach Kelvin Your Thing", and he talked about his upcoming book, "High Performance PostgreSQL for Rails". So you can watch the YouTube episode here, it's about an hour in length, or you can look at the Q&A questions that he posted down below.
Last piece of content- "Learn PostgreSQL- second edition". This is from fluca1978.github.io, and this is the second edition of his book that Luca wrote along with Enrico. So if you want to check out his new editions, you can check out this blog post and even the link for the book.