Database Decomposition, Fast Text Search, Understanding pg_stat_activity, Window Functions | Scaling Postgres 223
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the process of decomposing your database, how to search text in Postgres quickly, how to understand the output of pg_stat_activity and how to use window functions to solve a use case.
Content Discussed
- We are splitting our database into Main and CI
- Decompose GitLab.com's database to improve scalability
- Multiple Databases
- Rubber Duck Dev Show Episode 49 | Fast Text Search In Postgres
- Understanding pg_stat_activity
- Handling Bonus Programs in SQL
- Choosing a PostgreSQL Number Format
- Data To Go: Postgres Logical Replication
- PostgreSQL Replication and Conflicts
- State of PostgreSQL 2022—First Findings
- 5mins of Postgres E25: Postgres lock monitoring, LWLocks and the log_lock_waits setting
- How to log selects from specific table?
- Queries in PostgreSQL: 5. Nested loop
- Improved PostgreSQL support, performance improvements and more...
- What Is SQL Used For? To Build Environments Where Devs Can Experiment
- Julien Tachoires
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "We are splitting our database into Main and CI". This is from gitlab.com. And this is an announcement in their blog where they're splitting their database. So a lot of times, when you reach the limits of how far you can scale up your database, you look for alternative solutions. One way to handle this is to start developing microservices. Generally, a microservice has its own independent database, and the services just communicate with each other. Typically just HTTP protocols. The other route you could take is to do a scale-out solution. So this would be something similar to the Citus extension. Have essentially multiple primary databases that are sharded out. So essentially, you're sharding your database. You could also do sharding from the application perspective.
Then the application itself knows where to find the data that it needs. And with that sharding, maybe you have a certain set of customers on shard 1 and a certain set of customers on shard 2. However, the databases are typically the same. But what they're doing in this case is something they're calling functional decomposition. So they're taking their monolithic database and they're splitting off functionality into its own database. So, for example, they have a diagram here where they have their application. All writes go through PgBouncer to the main database. And then they have read-only replicas set up that they get their reads from. Now they want to decompose it and break off their continuous integration services, or CI services, to their own database cluster. So they're going to be removing approximately 50 data tables from the main database and putting it into this dedicated database.
But as you can tell, the structure will be the same. Writes will still go through to a PgBouncer to this database cluster that has multiple read-only replicas that the application can pull information from. Now they are using Ruby on Rails and that has some additional features that have been added in the last version or two, I believe, where it can more easily speak to multiple databases. So this is probably why they chose this particular path. Now, this is interesting because I have other clients who have chosen a similar route to this rather than sharding their database. Now they anticipate, because they're dedicating the databases for particular workloads, that they'll be able to, overall, increase the performance of the system and increase stability.
So maybe CI services aren't going to impact other parts of the system, and they're also trying to be able to ship just as fast. I mean, this is introducing some changes that need to be taken into account where you operate as before when you had a single monolithic database, but now you're dealing essentially with two databases. It looks like they have some tools here that they developed. One called loose foreign keys and another called data migrations for multiple databases. So when you do a data migration, needing to add an index or a column, it allows you to target which database that command needs to be sent to.
Now I'm posting some other links as well. This other one is "Decompose GitLab.com's database to improve scalability". This is actually an issue they created within GitLab that talks a lot about the work that was done for this. This was done almost a year ago. So it's taken them about a year of time to get up to this point.
In this other post on gitlab.com, they talk about multiple databases. So this is essentially advice given to their engineers about how to work with this split database scenario. So they talk about configuration and they talk about one thing that is a priority is basically removing joins between tables in the CI database and tables in the main database. So their code could have joins across this functional boundary. But now that they're in separate databases, of course joins won't work anymore. Now one solution is dealing with this, could you just remove the code that was doing these joins? Do you really need to join it? Another one, Rails, has a capability where you could pull back IDs to the application and then it sends another query with those IDs to the database.
Presumably, you could do that in their scenario to avoid a join. Another is to denormalize some foreign keys to the table. So basically, the data that you're looking for is actually stored in a table. Now this would result in some duplication, but it would prevent crossing over to the next database. They talk about denormalizing it into an extra table. So maybe you just maintain a separate copy of that table. But this is a fascinating look on one path to hitting a scaling ceiling. So yes, you could do microservices, or maybe you could use Citus as a scale-out extension. But this is another solution, basically decomposition to separate functional units out of your database into their own database. But if you want to learn more, definitely check out these three pieces of content.
The next piece of content- "Rubber Duck Dev Show Episode 49 | Fast Text Search in Postgres". So this is a show that Chris and I did last Wednesday, and it was on Fast Text Search. Basically, when you have a query where you're using LIKE or ILIKE or some regular expression to find an entry in a lot of text, what's the most efficient way to do that? Because a LIKE or an ILIKE is just going to scan the table, it's not going to use any indexes. And specifically, I show you how you can use the Fast Text Search features to be able to find those text instances really quickly. For example, one solution was 1,000 times faster. Another example was 10,000 times faster. And basically, the more data you have, the better an index will perform in terms of finding some text. So if you want to learn more about that, you can definitely check out this piece of content.
The next piece of content- "Understanding pg_stat_activity". This is from depesz.com and this post basically goes over every single column in pg_stat_activity and explains what it means and what it does. Now, also with regard to the rows, he further explains that by showing there are different types of backends. So when someone is making a query, generally that is a client backend. So some sessions are connected to the database and it's pulling data. But you're also going to find other Postgres processes within pg_stat_activity as well, such as the checkpointer, the autovacuum, archiver, WAL writer, and things of that nature. He explains all of these different processes as well. Now, I'm not going to go through this whole thing, but he does a very good job of explaining each of these columns, what they mean, and how they could be useful. So if you want to learn more, definitely check out this blog post.
Next piece of content- "HANDLING BONUS PROGRAMS IN SQL". This is from cybertec-postgresql.com and he's talking about a scenario where you have a bonus card that earns points potentially by making purchases. So he has a table here with a card number, a date, and the number of points that are earned. So as you can see, there are two card numbers here and there are dates and they're being assigned points on those particular dates. The question is, how many points does a particular person have? And as a wrinkle to it, he says, but those points expire after six months. So how can you get what the total sum is at a particular point in time? So basically, a running total. He's using window functions to do this. So first he does a window function using a range between six months to the current row.
He actually in this first iteration does an array aggregation. So you can see what points would be included in the sum he's eventually going to do. And you can see after six months, all of these points essentially expire and they're no longer there. And then it starts incrementing again from that point. So basically it's a six-month sliding window that presents the total. The next phase of the query actually does the sum so you can actually see what the results are. Then he does it for every card number using the PARTITION BY to partition it by card number. So this is a great example of using window functions to answer a real-world question. If you're interested in learning more, you can check out this blog post.
Next piece of content- "Choosing a PostgreSQL Number Format". This is from crunchydata.com and he's talking about when you're choosing a number format, ideally, you want to store the data using the smallest amount of space that makes sense to reduce costs. You want to represent your data with the smallest amount of error, which of course is critically important. You don't want to give the wrong answer for a number and then manipulate your data using the correct logic. So the main way to store data is of course using integers and that's basically the go-to format to be able to store integers. There are smallints and bigints that are two, four and eight bytes.
They also offer floating point numbers, which I've only really used for latitude and longitude when the accuracy wasn't that important. So floating point is, of course, inaccurate, but it allows very, very fast computations. If you want more accuracy but it's going to be slower, then you're using numerics. And if you're going to be storing currency, generally this is the type to use. Definitely don't use a float, although I have heard of people that are using integers, they just multiply the units to get rid of any decimals that exist. He has a little bit of discussion about the difference between numerics and floats, particularly with regard to rounding. But feel free to check out this blog post.
The next piece of content- "Data To Go: Postgres Logical Replication". This is from crunchydata.com and this is a post about logical replications and comparing it to streaming replication. Basically, they kind of provide different use cases. So streaming replications, as they say here, are used for high availability. It copies all the data from the primary database and basically, it replicates the whole schema and everything in place. It doesn't do any partial copy of information and you have to be on the exact same version. So it has to be essentially binary compatible. Whereas logical replication, you can copy all the data, but a lot of the use cases call it for just copying a subset of the data on a regular basis. You can do it with two different operating systems or two different Postgres versions. They walk through the process of actually setting up logical replication, creating a publication, and creating a subscription for it. So you can check out this blog post if you want to learn more.
Next piece of content- "PostgreSQL Replication and Conflicts". This is from percona.com and they're talking about postgres replication and its history, but then they go into replication conflicts. So as you use a replica database there's the potential for running into conflicts. You can actually check these conflicts by looking at the pg_stat_database_conflicts view because that gives you an indication of how often they're occurring and generally they'll appear in the log as well, of course. He goes over the different conflicts you can encounter such as table space conflicts, locks, snapshots, bufferpins, deadlocks, and other types of ones. So if you want to learn more about that, you can check out this blog post.
Next piece of content- "State of PostgreSQL 2022— 1st Findings". This is from timescale.com and they're providing an introduction to a little bit of the results from the State of Postgres survey. They're talking about the demographics of who was included, their rough geographical location, number of years using Postgres, and their current position in terms of those who filled out the survey. They present a few of their community questions as well as popular tools. So apparently the three most popular tools for queries and administration tasks are psql, pgAdmin, and then DBeaver. But you can check this out for all the details.
The next piece of content- "5mins of Postgres E25: Postgres log monitoring, LWLocks and the log_lock_waits setting". This is from pganalyze.com and they covered a post that discussed when you should be concerned about locking that we covered last week on Scaling Postgres. In this particular piece of content, he goes more in-depth about those locks as well as suggests enabling log_lock_waits as well because that will log locks that generally exceed your deadlock timeout of 1 second. So definitely feel free to check out this piece of content as well for more insight into that.
Next piece of content- "How to log selects from specific table?". This is from depesz.com and this blog post covers exactly what it says and his recommendation is to use the PGAudit extension and he shows you how to set it up. Basically on Ubuntu, you can just install it and add it to the shared preload libraries. You do have to restart your database because you altered the shared preload libraries, create the extension, then you can define a role in grant auditing to the table of interest and then it pretty much works and logs the information to the Postgres log. Now, we did have an alternate way to do that, kind of a hacky way, creating a function and then a view that calls the function. But he says really you should just use the PGAudit extension. But again, if you want to learn more details, definitely check out this blog post.
The next piece of content- "Queries in PostgreSQL: 5 Nested loop". This is from postgrespro.com and they're talking about joins in general, but also the nested loop joins and how that works at a very, very detailed level. So if you're interested in that, definitely check out this blog post.
The next piece of content- "What Is SQL Used For? To Build Environments Where Devs Can Experiment". This is from timescale.com and this is another interview with someone from the Postgres community, Nikolay Samokhvalov. So if you're interested in learning more about Nikolay and his contributions to Postgres, definitely check out this blog post.
The last piece of content. The PostgreSQL person of the week is Julien Tachoires. If you're interested in learning more about Julien and his contributions to Postgres, definitely check out this blog post.