Fantastic Data Types, Domain Data Types, Insert Performance, Row Differences

In this episode of Scaling Postgres, we discuss fantastic data types, custom domain data types, improving insert performance and how to query row differences.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Webinar: Fantastic Data Types and Where to Use Them [Follow Up]". This is from 2ndquadrant.com, and it's a webinar that was put on by 2ndQuadrant and a recording is available here. You click on the link and you can register to attend it. What was really great about this presentation is that I've heard of using ENUMs, but I've heard that there were numerous disadvantages. But this presentation really put ENUMs in perspective how advantageous they can be and ways to avoid some of the disadvantages and really just deleting them if you ever need to is the biggest disadvantage. But otherwise, there are a lot of advantages to using them compared to, say, a separate lookup table or storing raw text in a text field that has to be managed elsewhere.

So for this part alone, I definitely suggest you check out this webinar to see their discussion on ENUMs. Their coverage of arrays and JSON isn't as in-depth. But for arrays, they say to use it for consistent data type. So if you need to store a set of integers or a set of text, say tags, like user tags or something of that nature, an array is a good use case for that. But don't store an array of multiple different data types that have been converted to text that really doesn't make too much sense. Also, reach to JSON or ideally JSONB when you have a variety of data you happen to need to store in JSON. But again, just be cautious of overusing that particular data type, and they do also cover timestamps in their usage as well. So overall, a really good presentation I suggest you check out.

The next piece of content is "POSTGRESQL: USEFUL NEW DATA TYPES". This is from cybertec-postgresql.com. They're actually talking about DOMAINs, in which a domain is a concept that you essentially create a new data type that has a specific check constraint. So in this example, he wants to create a color code data type. It's made as a text field, but it has this check constraint that's added to it to validate the color codes. Then this DOMAIN just becomes essentially a data type. So he's creating a table here with column C that has the type of color code, and when you insert a valid color code, it accepts it, and when it violates that constraint, it presents an error message.

Now, of course, you don't have to create DOMAINs. You can put check constraints on existing tables and leave the type as text, but this just generally makes working with the database easier to do and to reason about when looking at different tables and how the data is laid out. But he looked at some other use cases like maybe you want to have a text string that is only an alphanumeric. Well, this is one way you can use a DOMAIN to do it. You can also create a DOMAIN for URLs. Any presents this check constraint to be able to check that a particular data type only accepts URLs. So if you're interested in setting up these types of DOMAINs in your database, maybe you want to check out this post.

The next piece of content is "13 tips to improve PostgreSQL Insert performance". This is from the timescale.com blog. Now, Timescale is an extension of Postgres that does time series data very well. So some of the recommendations are related to Timescale, but the vast majority apply to vanilla PostgreSQL as well. So I'm just going to run through some of these. The first one is to use indexes in moderation. So the more indexes you have, the slower INSERTs happen because you have to update all those indexes. Number two reconsider foreign key constraints, because generally when you have a foreign key constraint, the other table must be checked or validated that that value exists before the INSERT can happen. So reconsider using those for faster INSERT performance. Three avoid unnecessary unique keys because again, that's a case where it has to check if this data already exists in the table before it does the INSERT.

So there is another validation step that if you don't need it, you could increase speed. Use separate disks for WAL and data. So depending on your performance, separating the WAL from the data can give you better performance. Five use performant disks, so don't use hard disks. Use SATA or NVMe drive or a high-speed San. Now, starting from suggestion six, they're talking about things related to TimescaleDB. However, I find that nearly all of these are also applicable to Postgres as well. Number six is to use parallel writes. So if you're doing an INSERT or a COPY, use parallel processes or a parallel process to insert that data, because these INSERTs or COPYs are done as a single transaction in the single-threaded fashion. So if you parallelize it, you can get better write performance, potentially. Now, looks like TimescaleDB actually has a parallel copy command, but you could do this yourself using your own language or techniques with Postgres.

Seven is to insert rows and batches, it's always more advantageous to do a multi-row insert versus single-row inserts many times. Number eight, properly configure shared buffers, always important. Nine, run our docker images on Linux hosts. So this is mostly TimescaleDB related. It's basically related to overhead, related to virtual machines. So you need to take these same considerations on how you set up Postgres. Ten says to write data in loose time order. Basically, they're saying you want to try to keep as much in memory as possible. Now, TimescaleDB does things related to partitioning. They don't necessarily call it partitioning, but it's time series data that's chunked up by a timestamp or something similar. They're saying to INSERT all your data in the timing order, as opposed to say, if you're collecting metrics from servers, don't insert all the data from the server A, then all from B, and then all from C.

Do it by time instead of inserting data so that indexes and things of that nature are kept in memory as much as possible with the more recent tables. And that's applicable to Postgres as well. Eleven talks about avoiding too large chunks and twelve avoids too many or too small chunks. This kind of goes to partition sizes. You don't want too little data in a partition if you're using Postgres and you don't want partitions that are too large, you want them to be just right, essentially. And then thirteen is to watch the row width. So anytime you want to maximize insert performance again, fewer columns generally do better and you want to try to avoid a whole lot of data being inserted at one time. So a lot of great recommendations in terms of maximizing your insert performance. Go ahead and check out this post from TimescaleDB if you want more details.

The next post is "CALCULATING DIFFERENCES BETWEEN ROWS IN SQL". This is from cybertec-postgresql.com and this is a post about window functions. So how do you compare one row to the next row when you're doing a query? The first example that they're reviewing here is lag. So you want to say consult the previous year of a set of data you're looking at. You can use the lag window function and look back one row, but of course, when you're doing this you want to order it. So you do an over and ORDER BY the year so you can look at the previous information for the previous year to compare it to the first year. And then he says what do you want to do if you want to look at the difference between this year and the last year?

You can actually take the number from this year and minus that window function to give you what the difference is. Now other than lag, he also talks about the first value. So if you want to compare every year from the first value of a given year, you can use the first value function. Here, he's showing how you can compare oil production compared to the first year that you have data available. So you can compare each row from the first year and then he also shows how you can partition the data in your queries, say, by country in this example. Here he's showing window functions that are broken out by country. So you can look at Canada's output relative to Mexico's output. So if you're looking to learn more about window functions and learning how to use them, definitely check out this blog post.

The next post is "Indexing JSONB columns in PostgreSQL". This is from vsevolod.net. Now, the title is interesting because he talks about an issue where he was indexing a JSONB field trying to track multiple different IDs for a given document. So apparently different sources have different IDs for it. So he wanted to track it all in one document as opposed to using a separate table. However, he ran into a number of issues, particularly when he needed to query based on multiple IDs. Now, I should say this is kind of how the post starts, but the post is a story that weaves in a lot of different information so it's not strictly on indexing JSONB columns. But after he looked and investigated more, his ultimate conclusion was that he had a domain modeling failure and ideally he should set up a separate table, not using JSONB columns.

So it's very interesting. The title of this post is JSONB index, whereas ultimately the solution was not to use JSONB columns. But again, the story and the process that he went through are a little bit interesting if you want to learn more about that. However, given he ran into a situation where multiple IDs could reference a single document, this seemed to be a more efficient solution. Now, he also covers some additional pitfalls where he ran into problems and he also references other people running into problems with the new JIT compiler that is enabled by default in Postgres 12. So again some good information with regard to that. So again, this post is more of a story about a path that he went on, and the things he discovered with Postgres. So if you're interested in learning more about that, check out this blog post.

The next post is "Should You Ever Use Synchronous Replication in PostgreSQL?". This is from enterprisedb.com. Now this is actually curious because I actually think the title of this presentation is how best to use synchronous replication in PostgreSQL. Now it starts off again from a story perspective of a customer that was having issues, that ran into some data loss due to a disk issue and how because they had a setting of synchronous_commit, their standby did not have all the data that it was supposed to have and it was because of how synchronous_commit was set up. They go over the different settings that are possible for synchronous_commit from off, remote_write, on, local, and remote_apply. They describe how you can set this up to get better performance or better protection of your data.

So essentially it's a continuum. If you want the best performance you could turn synchronous_commit off, but you're potentially risking some data loss doing that, of course, because as mentioned here, quote "...commits are sent to the application as soon as the transaction is handled by the back end process, but before the transaction is flushed to the WAL". So essentially, it's somewhere in the memory, but not committed to the disk on the primary database server. So this doesn't even have to do with the replicas. Whereas the most safety you can get, but the least performance because it has to wait is remote_apply.

Because "...commits are sent by the primary to the application only after the standbys defined in synchronous_standby_names have confirmed that the transactions have been applied beyond the WAL, to the database itself". So this post talks about essentially the trade-off. If you want more speed but are less concerned about safety, you could go with the synchronous_commit off route, but if preserving data is the most important and performance not so much, then you could go with a synchronou_commit with a remote_apply. But generally, if you want to learn more about synchronous replication and how best to configure some of your settings, definitely check out this post.

The next post is "Oracle to PostgreSQL - Binary Objects". This is from 2ndquadrant.com. Now I read this title and I was expecting to look at a comparison between them, but actually, most of this post talks about PostgreSQL in general. It's a really good weighing of pros and cons on where you should store binary objects. So if you store binary objects in the database, he mentions three advantages: they get backed up with appropriate transaction integrity, they get transported to physical streaming replication, and they don't require a separate connection stream to the caller. But then they list eleven different issues that are disadvantages with storing them in the database, primarily related to their size and performance.

So if they're small it's less of an issue. I won't run through all of these right now, but it definitely has a great list of potential disadvantages. And then, of course, they give you some recommendations on how best to store binary objects giving these pros and cons. So if you are storing binary objects in your PostgreSQL database and you're not sure if you're doing it right or if you want to use a different method, definitely check out this blog post, because it's not just about Oracle, but in general. What are some best practices with regard to the storage of binary objects and working with your database?

The next post is actually a YouTube channel and it's "A look at the Elephant's Trunk PostgreSQL 13". This is from the Warsaw PostgreSQL Users Group YouTube channel. It is a presentation about an hour and 17 minutes in length going over the new features in PostgreSQL 13th. Magnus Hagander has done this a number of times with different versions of PostgreSQL and this is his most recent version 13 version. So if you want to learn more about Postgres 13 and all the different features coming in it, definitely check out this video.

The next post is "Webinar: PostgreSQL Partitioning Roadmap [Follow Up]". This is from 2ndquadrant.com and they're going over the roadmap for partitioning updates and improvements to PostgreSQL. Now, they covered some of the historical changes with PostgreSQL 11 and version 12, and also what's coming in 13, and potentially some thoughts about 14. So again, they continue to push as much as they can to fill out the performance, make management easier, and look into the future. For sharding, again, this still seems a fairly long way away, but again, it's something being considered. And then they also talk about their bi-directional replication product for a small portion of the webinar. So if you're interested, you could just click this link here to get registered to view it.

The next piece of content is "How Secure is your Postgres? Part 1: Access". This is from enterprisedb.com, and they're talking about different levels of access to the PostgreSQL instance in terms of setting up security from physical access, how you can connect to it, whether you're using a Unix domain socket locally or using TCP/IP connection. Then once you allow a TCP/IP connection, what kind of firewalls are in place to prevent access from other areas? Also, how to ensure that the data is encrypted in transit by using transport encryption such as TLS to do that, and different ways of configuring these SSL parameters. So if you're interested in learning more about secure access to PostgreSQL, definitely check out this blog post.

The last piece of content, the PostgreSQL person of the week is Andrew Dunstan. So if you're interested in learning more about Andrew and his contributions to PostgreSQL, definitely check out this blog post.

Fantastic Data Types, Domain Data Types, Insert Performance, Row Differences | Scaling Postgres 121

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript