UUIDs, Write Spikes, Distinct, SQL Standard | Scaling Postgres 23
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering UUIDs, the source of write spikes, uses of distinct and SQL standards support.
Content Discussed
- UUIDs and Compressibility
- Troubleshooting Daily Write Spikes on Cermati’s PostgreSQL RDS
- High performance distributed DML in Citus
- The many faces of DISTINCT in PostgreSQL
- How PostgreSQL’s SQL dialect stays ahead of its competitors
- Batch Updates and Concurrency
- Troubleshooting PostgreSQL Streaming Replication Synchronization
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about UUIDs write spikes, distinct andthe SQL standard. I'm creston. Jameson and this is scaling postgres episode 23 20.
[00:00:21] Alright, the first piece of content this week is UUIDs and Compressibility, and this isfrom the Richardstarton UK blog and in it he talks about using UUIDs with your database. Hedoes mention postgres a bit, but he talks about other NoSQL databases and different ways thatthey use UUIDs and potentially how best to use them. So UUIDs are good to use when youessentially have more than one system creating data. So if you can rely on your primarydatabase to generate IDs, typically you can just use a serial data type which uses anincremental int. However, if you have IDs being generated by multiple systems, potentiallygeographically, then UUIDs are a good fit for that situation. Unfortunately, UUIDs can be quitelarge and as he was looking at different compression options for UUIDs and even what their sizeis, so he's saying for a typical binary format he's using a ratio of one. And basically hisconclusion going through this is that you should definitely try to use a binary format when youcan. And he mentions that postgres has the ability to do that. However, other types of formatswhen you start going into text can be two to five times as larger, so that's going to have animpact on your system's performance. So he took a look at, okay, well how well could he do it bycompression? And essentially with a binary format there's no difference in the level ofcompression. Pretty much it's pretty much all around 15 megabytes for a million UUIDs, whereaswith the other formats you do reduce them, but it's nowhere near a binary. So this is aninteresting blog post to check out. I definitely suggest you look at it and if you're using UUIDs inyour system, it's a pretty short post to review, but overall I think the main point is that if you'reusing UUIDs, try to use a binary format as opposed to a text format in your system.
[00:02:30] The next post is Troubleshooting Daily. Write spikes on Sir Matty's PostgreSQL RDS.This is from the Sir Matitech blog on Medium and in it this is kind of a story based scenario.They had an issue where early in the morning they were seeing seven times higher right spikesessentially, and they were trying to determine what the cause of this was. So they looked attraffic, was there any significant traffic coming in? The answer was no. They looked for anybackground workers that were potentially doing something and they couldn't find anything. Theylooked in the query logs and they saw nothing that jumped out of them. From the query logs,they looked at the database snapshot schedule, I'm assuming this is backup, and again, nothingfrom that. One thing that they tried to do is change when certain jobs got run in order to see ifthe spike moved to a different time in the evening.
[00:03:34] And finally they found something related to their ETL process and ultimately theynarrowed it down to a query with an order by because with all the data they were doing anexternal merge that was using the disk. So this is something to keep in mind when you're lookingat your database system's statistics and you see high write activity that is not necessarily due toinserts updates and deletes, it can be due to selects. And I've seen this frequently where anorder by clause when doing the sorting and the ordering, it can generate a lot of disk activity. Ifit's a large sort, it's going to flow to the disk as opposed to do it in memory and you're going toget a lot of write activity as it generates that query. And they determined that they ultimately didnot need that order by and of course the write spike subsided, it's no longer quite as high. Sothis was a good blog post if you want to kind of get the story behind the different things theychecked out. But ultimately it's related to a query with a large amount of data that had to besorted that was going to disk and causing write activity for actually a select. So just something tokeep in mind when you're looking at potential write spikes in your system.
[00:05:01] The next post is high performance distributed DML in Citus. And this is from theCitusdata.com blog. Now that's the title of this blog post, but what I found most interesting is thatreally it's a follow on from a post that they discussed using another data aggregation techniqueand that was we covered t --his in a previous episode of Scaling Postgres but it's scalable incremental data aggregation onpostgres inside us. And again it's mentioning Citus but you don't have to use Citus to use someof these techniques.
[00:05:37] So this is kind of a follow on post to that. And in this example they are using movies.So essentially you have movies in one row per movie with a review score for that movie and theywanted to record the score in movies. And of course you'll want to update the review basedupon an average of however many reviews happen. Now you don't want to just do an updatebecause as they say here, update will take a row level lock and if you have multiple reviewsbeing submitted or calculated at the same time, you're going to run into blocks or locks. Sobasically how you can get around that is instead doing inserts. So they created a second tablecalled Movie Reviews. So all the movie reviews go in there and then on a periodic basis you canthen update that movies table and set the review score and the review count based upon howmany rows are in the Movie Reviews table. And because this will run on a periodic basis.Presumably just it could be once an hour, once every 15 minutes. You're not going to get abunch of locks because all new movie reviews are going into the movie reviews table. So this isa good practice to follow if you have a particular use case. Now, of course, one downside of thisapproach they mentioned is that every time you are aggregating the movie review table for eachmovie. So even if you get one more review, you're still reviewing say 1000 reviews anddepending upon your use case, if you have a lot of rows, that can be a bottleneck. So this is kindof a follow on to the previous post I mentioned where you can do an incremental update and thisis the function that they created, again using the rollups table that was described in the previousblog post in order to incrementally roll up those reviews and this should be a lot more efficient.Now, again, you don't need Citis to do this particular technique. So I thought it was a great postthat describes how you could potentially do this. But at the end, of course, they do include abenchmark where they're comparing RDS and Aurora, a large instance type for each of themand compared it to four smaller Citis database clusters and they saw really fast performance. Soagain, these are solid techniques that you can use no matter if you're using Citus or not. And ifyou have a use case that could potentially use some of these techniques, definitely a blog postto check out.
[00:08:11] The next post is the Many Faces of distinct in PostgreSQL and this is from the StatusCode blog on medium and the subtitle title is Three Powerful uses of distinct in postgres. Sowhat is distinct? And the quote is select distinct eliminates duplicate rows from a result so youcan select distinct department from employees. So there's going to be multiple departmentsrepresented in the employees table and they're just showing each one distinctly. Now you coulddo the same thing with a group by just grouping by the department and showing the department,but then he talks about distinct on. Now this is a special nonstandard clause to find the first rowin a group. So to get the first row in a group, in this example he is finding the person with thehighest salary by doing distinct on department from the employee table in order by thedepartment and then by salary descending to get and it will pull the highest row from theemployee table for that department. So I could think of some really good use cases for this. Andthe third option is is distinct from and basically this is a way for safely comparing nullable fields.So for example, in this scenario A is distinct from B. So looking at this, one and one is the same,so it's false. One and two are different. So it's true. Null and one are different. So it's truebecause typically if it was equals it would be Null and Null and Null is false. So again, aninteresting clause that you could potentially use in your SQL queries.
[00:09:49] And lastly he talks about array aggregate, which aggregates values into an array. Andwe've seen this before in previous episodes, but he likes to use it for just getting a sample of thedata. So here you see what employees are part of each department with this type of querywhere you're aggregating the employees by department. So this was a relatively short post, butit gave a lot of --short tips of some great clauses that you could use in your SQL queries.
[00:10:22] The next post is actually the PDF from a presentation called How PostgreSQL's SQLdialect stays ahead of its Competitors. And this is by Marcus Winnand and it's from the ModernSql.com blog. Now in it he compares the SQL standard to what features are available in otherdatabases with a focus on how PostgreSQL has handled them. And I should say thispresentation is quite long at 97 slides, so I'll just go over some of the highlights of what hediscusses to see if you're interested in this presentation. But he covers grouping sets, he coverstable sample to get samples of data in a particular table.
[00:11:06] He covers XML table, a way to work with XML and relational data together. The newIdentity column feature which is similar but more powerful than serial columns, new featuresadded for over and partition by and lastly JSON functionality. So this was a great presentation tocheck out, just to look at the different SQL feature support for different database platforms andwhere PostgreSQL sits amongst them.
[00:11:35] The next post is Batch updates and concurrency. And this is from the Tapoeh.org blogand it continues on with his PostgreSQL concurrency series. And then he's talking about batchupdating. So you get a dump of data from some source. Here he's picking up the Museum ofModern Art collection and then you're going to need to update that batch on a periodic basis. Sobasically the scenario is if a row exists, you want to update, if not do an insert.
[00:12:08] Now he talks about a classical way you would typically do that, but you run intoconcurrency problems if a row already exists and you want to try to insert it again. Now onesolution he is suggesting you could use is locking the target table. But if you have PostgreSQL9.5 you can use the on conflict clause and when doing the update on conflict, do nothing. So thisis just another post that rounds out his concurrency blog series.
[00:12:38] The Last Post is Troubleshooting PostgreSQL streaming Replication synchronizationSo this describes the scenario where you have a primary database and it's synchronizing to oneor more replicas. And what happens if the replicas fall behind? And maybe in the log you seesomething to the effect of this error. Requested wall segment has already been removed. Sohow do you catch it back up? Now, basically, you need to get those wall files from whereveryou're storing them or archiving them and put them back in access to the Replica so it can catchup. And he discusses different settings you can make to the database to be able to keep thosewall segments or those wall files around longer, so that you'll minimize the chance of thishappening. If you don't have enough wall files, then you need to take another backup andrestore the Replica from the start. And lastly, he covers using replication slots to prevent thisproblem from happening at all. So pretty much the wall files won't be deleted until the Replicashave all been updated with that wall segment. However, the key thing you need to keep in mindthat he mentions here is that if you do that, you need to make sure that those Replicas canalways reach the primary. Because if not, then you're going to have unlimited growth of your wallfiles on the primary database to the point where you could run out of disk space. So basically,you're trading a risk of a Replica falling behind with a risk of filling up the disk or storage systemwhere you're storing all of the wall files. So this blog post goes over a great review of thosescenarios.
[00:14:16] That does it. For this episode of Scaling Postgres, you can get links to all the contentpresented in the show Notes. Be sure to head over to Scalingpostgres.com, where you can signup to receive weekly notifications of each episode. Or you could subscribe via YouTube oritunes. Thanks, Our. --