Data Skew, Monitoring, Remote Access CSV, Documentation | Scaling Postgres 270
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss how to handle data skew, tools for Postgres monitoring, how to load data remotely and Postgres documentation.
Content Discussed
- Data Skews in Postgres
- PGDay Belgium 2023 reflections (Monitoring Presentation)
- Holy Sheet! Remote Access CSV Files from Postgres
- The PostgreSQL Documentation and the Limitations of Community
- Use HOT, so CLUSTER won’t rot in PostgreSQL
- How to use HOT Updates in Postgres to make CLUSTER work better
- The Do's and Don'ts of Postgres High Availability Part 1: Great Expectations
- pgBackRest File Bundling and Block Incremental Backup
- Build hypothetical indexes in Amazon RDS for PostgreSQL with HypoPG
- Overview of ICU collation settings, part 2
- Manipulating Data In PostgreSQL: Learning PostgreSQL with Grant
- Extensions
- Oleksii Vasyliev
- The Future of Dragon Ruby with Amir Rajan | Rubber Duck Dev Show 89
YouTube Video
Podcast Audio
Transcript
Alright. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Data Skews in Postgres". This is from crunchydata.com. And by data skew, they're talking about the distribution of your data is not in a normal distribution or a bell curve. You have certain bits of data that are highly overrepresented. So maybe this is a large account in your software's service application and it makes up the majority of your data in a table compared to other customers for example.
Now they have an example here of where they're showing a distribution where all the data points for a particular day amounts to 30%. So here the data is skewed to this particular day. Now as a consequence of it, Postgres, once it gets up to 30% or higher, may decide it's easier not to use an index when selecting this data. It may think "Oh I'll just do a table scan as opposed to using an index scan" and it'll be more efficient. So having data skew 30% or higher can represent a performance impact.
Now to get this information, they actually had this query here looking at system tables primarily at the pg_statistic table to be able to pull back the stats and give you a representation of the different data values for this column in this table. And they even have one where they show you what to modify to be able to look at those data points that have 30% or more of a particular column's values. So this could be highly beneficial to look at to find where you have data skewed in your database.
But of course, the question comes "Well, how do I handle it?" and they have an excellent answer for this. So because Postgres may not use the index on values that are highly skewed or represent a significant portion of the table, you could always use a partial index. So, add an additional index to the one that already exists but specify that individual data that was skewed, and then the query will use this index to be able to efficiently pull out the data. But doing a partial index is a great way to still use an index when querying this type of data. So definitely check out this blog post if you want to learn more.
Next piece of content- "PGDAY BELGIUM 2023 REFLECTIONS". This is from cybertec-postgresql.com. Now we don't normally cover these types of posts that reflect things that happened at a conference, but he does include his slides and his video. And that's what I found particularly interesting because it was a really good look at all the different monitoring tools that are available.
So, he talked about professional PostgreSQL monitoring made easy, so I really liked his presentation that he gave here. Now he did talk a lot about PGWatch2, which is produced by separate tech, but still he covers a lot of the different monitoring tools for Postgres in the community. So I definitely encourage you to check out this post as well as of course his monitoring presentation.
Next piece of content- "Holy Sheet! Remote Access CSV Files from Postgres". This is from crunchydata.com. And he's talking about a scenario where you want to pull some remote data into Postgres and it's located as a CSV file in the cloud. And basically, he's using Google Sheets for that purpose and there's actually a command you can use to be able to pull down that CSV file. He's just using curl here with a particular URL to pull down that information. So how do you get this in Postgres?
The first option he suggests is to copy from a program and that program you copy from is basically the curl command that can pull down the information. So you copy to the table that you've predesigned from the program, putting quotes in the program or command you want to run to pull down the data, and he's specifying CSV with a header and a utf8 encoding. And then that loads the data. Now the disadvantage of this is that you do need superuser privileges to be able to run copy from the program. So that's the disadvantage of this, but it works great.
The next option he talks about is using the HTTP extension. And with that, you can use the HTTP_get function and specify the URL, and it will pull the data down. Unfortunately, it pulls it down as one big string but he wants it as, of course, 22 rows of data. So there's a whole process of going through to separate the data out into columns and rows. So it's definitely not the cleanest thing to do, but it is possible.
The next option is actually using PL/Python. And of course, when you're talking about programming language, it's definitely much easier to pull down data and then manipulate it at that point. So he has some code on how to do that here. And then the last option he says is Remote Access with a Foreign Data Wrapper. And specifically, he's using the OGR Foreign Data Wrapper, which I think is predominantly used for geospatial systems but apparently, it can pull down CSV files and allow you to load them into Postgres. So he describes how to do that here. So if you're interested in learning more about how to do that, definitely check out this blog post.
Next piece of content- "The PostgreSQL Documentation and the Limitations of Community". This is from rhaas.blogspot.com. And he's talking about the state of the Postgres documentation. And he says it exists the way it is now basically because it's the developers of Postgres, people who are actually doing the Postgres development that writes the documentation. So when they add a new feature, they add the accompanying documentation in the same repository as the source code.
So as a result, the documentation is very accurate. All the options tend to be well-documented. And I would argue the documentation of Postgres is more like a comprehensive reference manual. So for that reason, it may be a little bit difficult to wrap your head around certain concepts, or when you're first coming to Postgres, there's not an easy ramp towards understanding it. Basically, they give you the raw reference manual when learning Postgres. But he says that's the basic nature of how they do their documentation.
And the problem is even though someone proposes some changes, it's very hard to actually get those changes through because it's kind of always been done this way. So there's a bit of inertial resistance to it. But he says there's not necessarily anything wrong with that. It's just the way it is. But I did find it interesting to read this and think about. So I definitely welcome you to check out this blog post as well.
Next piece of content- "USE HOT, SO CLUSTER WON'T ROT IN POSTGRESQL". This is from cybertec-postgresql.com. And he's talking about CLUSTER primarily. And what CLUSTER does is it enables you to rearrange your physical data pages on the table to match that of an existing index. So by default, an index is definitely an ordered structure, but when you query and an index is used, it then gives you a pointer to the different rows. Now those may be located in entirely different data blocks.
So pulling back the data may actually take a bit of time. The most efficient way to do it is if you actually ordered the physical pages on the heap or the actual table to that of the index. That's what CLUSTER does for you. Now the disadvantage of CLUSTER is that you can only cluster the data once to one index. So if you have five indexes, you can only pick one index that it will match. You can't, of course, do all five indexes.
The other problem, and I would argue the bigger one, is that it needs ACCESS EXCLUSIVE to do its work. What that means is the table will be locked for all reads and writes. So it's essentially like doing it vacuum full. So for that reason, I've maybe once used CLUSTER, I hardly ever use it, because of that disadvantage. And he also mentioned here takes twice the amount of disc space because you're rewriting the whole table.
What he mentions here is that if you have a table and you can get some minor performance benefit by arranging it by cluster, if all you're doing is updates that are HOT updates (basically you're only updating the non-index columns), your table will remain clustered because the rows aren't being moved around. So if you have active inserts going, your CLUSTER will eventually fall apart. It will become out of alignment with it.
If you have updates that are not HOT updates, like indexes are being updated, your CLUSTER again will fall out of order. That's another reason why I don't typically use it because you have to keep re-clustering it. And it's incredibly expensive plucking reads and writes if you need to do that. But if all you have is HOT updates, and I believe deletes as well, the clustered order maintains itself.
So this is a specific set of conditions that you may find a use case for. But again in my experience, I haven't used CLUSTER all that much because of so many of the disadvantages. And this post also goes into actually setting this up and experimenting with it and showing you some results. And the other important thing you mentioned, if you're going to try doing this to actually get the HOT updates, you need to have the right fillfactor, like make sure the table is not fully full. It has some space to be able to do those HOT updates.
And he has a table where he analyzed different fill factors for it as well. Now as I was reading this post, the other thing I would consider with regard to this is another way to get your data in physical alignment is actually using a covered index. Now they didn't mention covered index, but basically this is where you're putting a data payload on the index itself. So that gives you an index-only scan, as well as has the data right there in the index.
Now there's, of course, disadvantages with that but that's another tact I would potentially take if physical co-location of the data is important to you. But definitely check out this blog post if you're interested, as well as also this week's "Five minutes of Postgres" by pganalyzed.com. This is where they talk about how to use HOT updates and Postgres to make CLUSTER work better. So you can check this blog post out as well.
Next piece of content- "The Do's and Don'ts of Postgres High Availability Part 1: Great Expectations". This is from enterprisedb.com. And they're talking about when you're wanting to set up a highly available system. Number one thing you need to think about is expectations. And with regard to that, they're talking about things like RPO, which is the Recovery Point Objective, and RTO, which is the Recovery Time Objective.
So as they paraphrase here, what's the acceptable amount of loss data with RPO, and then what's the acceptable amount of downtime with regard to RTO? And he says in order to reduce RPO or keep it as low as possible, you would want to use synchronous replicants. So not just a synchronous replicate, but really synchronous ones. And ideally have more than one and require a quorum. Like in this example, he's using three replicas but any two may be available for doing that synchronous right. Because what happens is that synchronous means if a REIT fails on too many replicas, it doesn't even get written to the primary.
So that's definitely important to be cautious of when setting up synchronous replication. In addition, he's talking about streaming the WAL as well for backups. In terms of lowering the RTO, they're mentioning being able to get an OLD primary backup and being a replica as quickly as possible. It's basically using the PG rewind utility. And to do that, you need to make sure you have wal_log_hints set as true and full_page_writes to be on. And of course, they also recommend enabling pg checksums as well.
And their last recommendation is using replication slots. And of course, they do recommend setting a max_slot_wal_keep_size. Because if you don't set this, it's possible to run out of disk space on your primary if you have an orphan replication slot. But with this set, it'll go up to a hundred gigabytes. But if you want to learn more about High Availability settings, you can definitely check out this blog post.
Next piece of content- "pgBackRest File Bundling and Block Incremental Backup". This is from crunchydata.com. So these are two new features that have been added to pgBackRest, the file bundling and Block Incremental Backup. So if you're interested in learning about that, you can definitely check out this blog post.
Next piece of content- "Build hypothetical indexes in Amazon RDS for PostgreSQL with HypoPG". And this is from aws.amazon.com. And they're talking about the HypoPG extension, which enables you to create a hypothetical index. Now, it doesn't actually allow you to do actual queries, but allows you to explain a query you want to do and then the planner can take into account, this hypothetical index, to see what impact it would have. So it's a way to quickly iterate and discover what the best indexes are for a particular table. But if you want to learn how to use it, you can definitely check out this blog post.
Next piece of content- "Overview of ICU collation settings, part 2". This is from peter.eisentraut.org. So if you want to deep dive into more ICU collation settings, you can definitely check out this blog post.
Next piece of content- "Manipulating Data In PostgreSQL: Learning PostgreSQL with Grant". This is from red-gate.com. And Grant is covering inserts, updates, deletes, as well as the returning clause, and merge, which is in Postgres 15. So if you want to learn about data changes and Postgres, definitely check out this blog post.
Next piece of content. There was another episode of Postgres FM this week. This one was on “Extensions”. It covers what they are, how they affect your decisions around Postgres, and some things to keep in mind while using them. You can listen to the episode here or check out their YouTube video here.
Next piece of content- the PostgreSQL person of the week is Oleksii Vasyliev. If you're interested in learning more about Oleksii and his contributions to Postgres, definitely check out this blog post.
And the last piece of content- we did have another episode of the Rubber Duck Dev Show this past Thursday afternoon. This one was on "The Future of Dragon Ruby with Amir Rajan". So, Dragon Ruby is actually a game development tool kit where you can use Ruby to write games. If this is something that's of interest to you, we welcome you to check out our show.