Entity-Attribute-Value Design, JSON Subscripting, mysql-fdw Push-down, New Regex | Scaling Postgres 193
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss using entity-attribute-value designs, new JSON subscripting capabilities, the mysql-fdw support for aggregate push-down and new regex functions in PG 15.
Content Discussed
- Entity-attribute-value (EAV) design in PostgreSQL – don’t do it!
- What’s new with JSON in PostgreSQL v14
- Aggregate Push-down in mysql_fdw
- Waiting for PostgreSQL 15 – Add assorted new regexp_xxx SQL functions.
- Using TimescaleDB with PGO, the open source Postgres Operator
- How to run a specific regression test
- Pavel Luzanov
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "ENTITY-ATTRIBUTE-VALUE (EAV) DESIGN IN POSTGRESQL - DON'T DO IT". This is from cybertec-postgresql.com, and they're talking about the design pattern using entity attribute values. Basically, this means using a generic entity table for all objects. You don't have an address table, a name table, or an orders table, but basically, you put everything in an entity and you attach attributes to it via different tables. So like in this example, you have your objects table, which would be, essentially, the entities. Then you have a dedicated attribute table for strings, and it just has a single string container. You have a dedicated table for integers and it has a dedicated integer column. Now, there are also designs where you have a separate attribute table that lists all of the attributes and then references the individual values.
So it's really an entity table, an attribute table, along with one or more value tables. But whereas you may think this is not too bad on the surface or it has some appeal to certain developers, it can really get you into a lot of trouble that they talk about in this post. I mean, the one thing that he mentions here that a lot of people talk about is if it's flexible. Well, yes, it is flexible in terms of not having to create dedicated columns in tables for particular sets of data. But it does have huge performance issues, which he talks about here. So if you're going to do a simple INSERT, in his design, it essentially requires inserting four rows into these different tables where you just want to get a name, a start, and an end. This is usually inefficient compared to just inserting into a dedicated table with dedicated columns to it, one simple row insert. When you're deleting something, that DELETE has to cascade through everything.
So you can do this one DELETE, but it still is going to remove all of these row references. Now, UPDATEs can have an advantage because it's just a very small row in a single table where you're just updating a value compared to a larger table. So there could be some advantages from an UPDATE perspective, but then when you get to the SELECTs, it is join city, where depending on the data you have to pull back, you're going to do a lot of joins. This is just a simple example. In addition, column aggregates are more complicated, like you want to sum a given value and more complex queries. Again, it's join city with so many joins and tables joining to itself to be able to find relatively simple data that you're pulling out. It's much easier to properly lay out data with a dedicated, say, person table and address table to be able to pull out the same information.
Now, one of the benefits that was mentioned in this post is the flexibility of this particular design because you don't have to create a new column or new tables for new entities you need to add or objects you need to add. But one way to handle it is just to go ahead and create the tables when you need them or create the columns as you need them. It's not that hard to do. I mean a lot of frameworks have the ability to migrate the database to be able to handle this. What you can do, which I do as well, is use a JSON column field to give you more flexibility where you don't have to create new columns necessarily. But he does have some very good caveats here that if you're going to be putting attributes in a dedicated JSONB field, don't put in ones that you are going to be using for joins because that'll just make your life more difficult. Make it a dedicated column.
Don't use it when you need a dedicated database constraint. So for example, if you're working with date times, probably don't put that in the JSONB, but use a dedicated column for it so that you can enforce the fact that it is a date time in that column. Then thirdly, if you're going to use a WHERE condition that is not going to be equality, like if you're doing greater than or less than, then probably give it a dedicated column for that purpose. But this is great advice to follow and I definitely agree that EAV or entity attribute value designs cause bad performance. But if you want to learn more, you can definitely check out this blog post.
The next piece of content- "What's new with JSON in PostgreSQL v14?". This is from blog.logrocket.com. This particular post talks about subscripting. So, for example, you have a data column that is JSONB and you insert information like a title into it, an author with a first name and a last name. In order to pull that out, where the author's first name is "Ada", you would have to use the single arrowhead or double arrowhead syntax to be able to pull out the data. The important point is the syntax is not the easiest and it's not common in the programming world, you would use subscripting to do it. Well, with version 14 you can do that now. So for example, you can say WHERE data ['author'] [ 'first_name'] = "Ada". The only thing that's a little bit unusual is that you do have to use double quotes around the text when doing that. But if you want to grab the title, you just do data with a subscriptive title and it works.
Then they go into updating the JSON. Before version 14, you had to use jsonb_set, which the syntax is even less appealing compared to the arrowhead syntax. But with version 14, you can use this update statement where you can use the subscription and just say SET data ['author'] [ 'first_name'] = "Sarah". So that is much easier to interpret than this jsonb_set function. Now, they did say there are some caveats with regard to that. In particular, they say quote "If the value of the JSON key that is being updated is null, assignment via subscripting will act as if the value of the key is an empty object or array". So that may not necessarily be what you want. So you will have to drop back to using jsonb_set for certain use cases because it does have a create optional parameter. But again, there's another post describing this great enhancement to Postgres, and I'm sure there'll be more. But if you want to find out more about how this works, you can definitely check out this blog post.
Next piece of content- "Aggregate Push-down in mysql_fdw". This is from enterprisedb.com. They're talking about an enhancement that's been done to the mysql_fdw where it can now push down aggregates. So as opposed to just pulling all the data back into Postgres from mysql_fdw and then processing the aggregate on Postgres can actually ask the MySQL server to do that aggregate. They point out that the performance impact here is that doing a particular query goes from 6 seconds to deliver the result down to 380 milliseconds. So that's a great improvement using this aggregate push-down. They say you can also use a join and an aggregate push-down together to get responses to queries from MySQL and they give an example of pushing it down to a partition table. So if you want to learn more about this enhancement, you can definitely check out this blog post.
The next piece of content- "Waiting for PostgreSQL 15- Add assorted new regexp_xxx SQL functions". This is from depesz.com, and this is an additional post discussing the new regular expression functions that are coming to Postgres 15. So if you want to learn more about their use and their capabilities, you can definitely check out this blog post.
Next piece of content- "Using TimescaleDB with PGO, the open source Postgres Operator". This is from blog.crunchydata.com, and this post just basically describes how you can use their Postgres operator to get TimescaleDB running on Kubernetes. So if you're interested in that, you can check out this blog post.
The next piece of content- "How to run a specific regression test". This is from highgo.ca and when doing Postgres development, of course, you have regression tests to make sure that any changes to the code don't impact other features of Postgres. Well, this post describes how you can run a specific regression test if you're testing a particular area of Postgres. So if you're interested in learning more about that, you can check out this blog post.
The last piece of content. The PostgreSQL person of the week is Pavel Luzanov. So if you're interested in learning more about Pavel and his contributions to Postgres, definitely check out this blog post.