ENUMs vs Check Constraints, Faceting With Roaring Bitmaps, Better Scaling, In DB Business Logic | Scaling Postgres 245
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss ENUMs vs. check constraints, querying table facets with roaring bitmaps, a better way to handle scaling and whether you should store your business logic in Postgres.
Content Discussed
- Enums vs Check Constraints in Postgres
- Faceting large result sets in PostgreSQL
- pgfaceting Extension
- Roaring Bitmaps and pgfaceting: Fast counting across large datasets in Postgres
- ISO Better Scaling, Instacart Drops Postgres for Amazon DynamoDB
- Rubber Duck Dev Show Episode 68 | Should You Store Business Logic In Your Database?
- Debugging Postgres WAL events with pg_walinspect
- The bountiful world of Postgres indexing options
- PGSQL Phriday #003: What is the PostgreSQL Community to me
- Postgres support for JSON is 10 years old!
- Timestamps
- Aleš Zelený
YouTube Video
Podcast Audio
Transcript
All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Enums vs Check Constraints in Postgres". This is from crunchydata.com, and they were discussing an issue that was brought up in their internal slack channel where they were asking if anyone actually uses enums out in the wild. And the surprising conclusion was they basically don't. And if you're not familiar, they show you how to create an enum. Here you basically make your own type. So you do CREATE TYPE, say, order status as an enum, and then you can use that as a data type within the tables. You can see that they created a column status with the type of order_status or the data type of order_status. So that's how you can do enums.
But apparently, they're not used that much in the wild, and frankly, I have never used them either in Postgres. Now, I've handled it in different ways, but the way that they propose the easiest way to do it is to use check constraints. So you could quite easily define your table and just say, these are the valued values for this particular column. I know the programming language that I tend to develop in Ruby and Ruby on Rails, they have their own way of how you can create an enum. But essentially it just creates an integer in the table for, say, the status field. So there's no way to really run an SQL report and get the actual data. You actually have to go through the object model in Ruby on Rails to do it. And that's not a good solution. Whereas I think using check constraint actually is a very good solution instead of these different methods of using enums. But if you want to learn more about that, you can check out this blog post.
Next piece of content- "FACETING LARGE RESULTS SETS IN POSTGRESQL". This is from cybertec-postgresql.com. They're talking about a situation where, say, you have an online store or you're looking to search the inventory of something like an inventory of vehicles with different attributes associated with it. And you want to see basically what the count is. So say, how many hats are blue, red, or green or what type of clothing is available? So they created a schema for a document table where you could place particular items in a category and each of those documents has given tags. You can associate it with it along with some other data. Basically, you want to see the distribution for all of this by the type, category, size, tags, and start and end timestamps. Now they have a little bit of a discussion about facets and how there are categorical variables that typically have a limited set of small values.
There are continuous variables where there's a wide range, and in that case, maybe you want to bucket particular ranges. And then you also talk a little bit about composite variables. But let's say you wanted to render this type of information for each of these facets for the search. One solution is to do multiple queries where you look at the type, count all the rows in the document table, and group it by the type. But you would need to run this query for every facet that you wanted to measure. So that's not a very efficient way to do it. But one way to handle this is to actually use a lateral join. And with one query, you can get all those results back. But it's still not very fast to do that because essentially you have to read through the entire table to pull back the data that you would need. But they said there is a way to make this faster. The key is roaring bitmaps, which I haven't heard a lot about, but apparently, there's a psql extension that does it, and it looks like the extension is called pg_roaringbitmap.
But what this post actually did is it took that and created their own extension called pgfaceting, to make its implementation easier. Then they rewrote the query like this, and it resulted in a significant improvement in performance because it uses an inverted index to index all of these different values. So if you look at the GitHub page for pgfaceting under the how fast is it section, they show that doing a parallel scan with the lateral join way of doing it took 18 seconds. If you only went down to a single core, it was 222 seconds. But using the pgfacet extension and a single core, this is equivalent to this 222 seconds, the results returned in 155 milliseconds. So that's an enormous performance improvement using this new type of roaring bitmap indexing. So if you need to present data in this fashion, I would definitely encourage you to check this out.
As well as the next piece of content, which is "Roaring Bitmaps and pgfaceting: Fast counting across large datasets in Postgres". This is from pganalyze.com and this is the article that Lukas covered this week, and he includes a lot more information on it. So if you want to learn more about this, I definitely encourage you to check out his episode covering this topic as well.
Next piece of content- "ISO Better Scaling, Instacart Drops for Amazon DynamoDB". This is from thenewstack.io, and I was looking through this article on some of the reasons why they decided to make the transition. I noticed a few points. One where they said, quote, "As busy as the daytime/evening hours are is equivalent to how quiet the nighttime/very early morning hours are". So basically, there's one time of the day when they're very, very busy and one time when they're not so busy. The main reason that they're exploring this is due to notifications. So presumably they have a ton of notifications during the daytime/evening hours, and it drops off the cliff, presumably, at nighttime/very early morning.
And the key is that Postgres didn't scale based on demand. So that's something that DynamoDB could do. The other point is that Instacart's main concern about DynamoDB was cost, not latency or scaling requirements. So basically it seemed to be not a performance reason, but how can we get the lowest cost and how can we scale down resources that are not used during particular times of the day? Now, upon reading this, I also thought about some other providers that are trying to separate compute resources from storage resources. And if you were able to do that, you could keep your storage resources consistent, but then scale up and down your compute resources to save money, potentially.
So at this time, it seems like just third-party providers are doing this. But I wonder, in the future, if Postgres will have an open source solution to this type of concept to be able to dynamically scale up and down the resources that essentially your database needs over the course of a day without having to pay for one large server that is scaled to a peak capacity model. And what about all the other times that it's not operating at peak capacity? Essentially, you're not utilizing all the resources of that system and spending more money. So I wonder in the future, will some open source project be able to implement something like that? But if you want to learn more details about this article, you can check it out.
The next piece of content is The Rubber Duck Dev Show episode 68 "Should You Store Business Logic In Your Database?". So this is our weekly show that we had, and we discussed a database topic, should you store business logic in your database? And we basically discussed the concept. Some people consider the database just a dumb data store, and other people want to put all sorts of business logic, functions, and procedures in the database. Not so much today, but definitely historically. That was done a lot with some application development, although the pendulum seems to be swinging there. If all you want to write is front-end application code and have Postgres database maybe consume and push out JSON files. But we discuss that continuum of no business logic vs. the majority of the business logic being in the database; what makes the most sense from a performance perspective and from a maintenance perspective? So if you're interested in learning more about that, we definitely welcome you to check out our show.
Next piece of content- "Debugging Postgres WAL events with pg_walinspect''. This is from ongres.com, and there has been a utility called pgxlogdump and pg_waldump where you could actually take individual wall files and analyze them. Well, now there's a new tool as of Postgres 15 called pg_walinspect, which is a new extension that allows you to use SQL to actually query the contents of the WAL files. You take a start LSN and a stop LSN, and you can analyze the different contents of the WAL files just by using SQL queries against the database. So this is super interesting because I've used pg_waldump before to analyze WAL files and have it right in the database to be able to put a load on and do an analysis within SQL. That would have been very beneficial. But if you want to learn more about that, you can check out this blog post.
Next piece of content- "The bountiful world of Postgres indexing options". This is from kmoppel.github.io, and he's talking about a post that we actually covered last week in Scaling Postgres that compared using two individual indexes for doing a query in two columns vs. using a compound or a composite index that covers both columns and how the composite or the compound one gave much better performance. And he tried to replicate that and then said, "Hey, let's try using different index types". So he said, "What about using a covering index for getting the second data in?". He also examined hash indexes, and he published his results here. The covering index comes a little close to what the composite index results in, but the hash merge was even worse than two B-tree indexes that were merged together. But if you want to learn more about some of these tests, you can definitely check out this blog post.
Next piece of content. There was a final blog post posted for PgSQL Phriday. This one is "What is the PostgreSQL community to me?". This is from andreas.scherbaum.la. So if you want to learn his perspective on the Postgres community, definitely welcome you to check out this blog post.
Next piece of content- "Postgres support for JSON is 10 years old!". This is from crunchydata.com, and this is a rough overview of JSON within Postgres and some practical advice with regard to storage queries and filtering it, as well as different query performances based upon the indexes you use. So if you want to learn more about that, you can definitely check out this article.
Next piece of content. There was another episode of Postgres FM this week. This one was on "Timestamps". So if you want to learn all about timestamps in Postgres, you can definitely check out this episode.
The last piece of content. The PostgreSQL person of the week is Aleš Zelený. So if you're interested in learning more about Aleš and his contributions to Postgres, definitely check out this blog post.