background

Cleaning Up, Function Performance, 11 Million IOPS, Change Data Capture | Scaling Postgres 151

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss cleaning up your database, function performance, 11 million IOPS and change data capture.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "The Unexpected Find That Freed 20GB of Unused Index Space". This is from hakibenita.com.com and he's talking about how they were running out of disk space, so they normally just add some more. However, they started looking at where they could potentially clean up certain areas of the database to free up space. They show this graph here where, essentially, this is free space over time. So it's a little unusual that the higher it goes, the more free space it has. But basically, they did a few operations that brought it up here, had a little bit more growth, and then something additional. They found out that letting them go up to this high, which is about 70 GB of free disk space. So the first thing they did was they looked at unused indexes and they included the query here. 

So you could just take this query and run it to look to see if you have any indexes that aren't being scanned, no tuples read, or no tuples fetched. The thing to keep in mind is that this also uses Postgres statistics, so you may want to reset them periodically to make sure that you're including the most up-to-date data that indexes are indeed being used. It gives a command that you can use to do that here. The next thing they looked at was an index or table bloat. So how you can address index bloat is by doing a reindex. Ideally, you want to do this reindex CONCURRENTLY, so you want to make sure you have this capability if you're going to do a reindex, to reindex concurrently, and I believe it's Postgres 11, might be 12, that offers this. So make sure you're on a version that supports that. 

Failing that, you can also use pg_repack to do it, but this is the preferred way, of course, if it's included in there. Again, with recent versions of Postgres, you are going to get some benefits in terms of compacting the indexes. They mentioned in Postgres 13, that there's a lot of deduplication that can happen if you're indexing duplicate values. Now in the process of doing a reindex, the new indexes are marked with a _c new as they're being created, and if there's any kind of failure that happens, you may need to go in and clean them up. They provided a query for doing that and they took a look at some of the deduplication capabilities in 13 and then they moved on to table bloat. How you address table Bloat one way is by manually recreating the table and switching to using it. 

So there are techniques, but it's a very manual process to try and do that and you probably have to use triggers to keep things up to date. The other option is to do a VACUUM FULL. The disadvantage is that it locks the entire table. Another tool you can use is the pg_repack extension to be able to repack tables. So that's another option you can use, but keep in mind it uses more than twice the amount of space of the table that you're working on, potentially including the indexes as well. So that's something to be aware of. Now in terms of the thing that they found to reduce their index size, they were indexing a column that mostly contained NULLS. So for example, a canceling user, there's a very small amount in their database where someone was canceling. So most of the values were NULL. 

Well, Postgres indexes those, so how they were able to save a lot of space is to a partial index only indexing it where it's NOT NULL, essentially. In doing that they said 99% of the dead weight was shaved off the index. So that's a pretty big reduction. It went from 769 megabytes down to 5 megabytes. You can also use partial indexes if you say, have a status column that you never query for particular statuses. Well, you can just index only the values that you are querying to save space and they even give you a query to look for where you have a column with a lot of NULLS that could be candidates for a partial index as they created here. So if you want to learn more about how to help clean up your database, you can check out this blog post.

The next piece of content is "Cleaning Up Your Postgres Database". This is from blog.crunchydata.com and he's talking about techniques you can use to monitor and then potentially clean up your database. The first thing he suggests looking at is your cache hit ratio, predominantly how often indexes are being hit and also how often tables are being used. He gives the exact query to check your index hit rate as well as your table hit rate. The next thing he looked at was a pg_stat_user_tables to determine what percentage of the time an index is being used for queries. He presents it by table. 

So you could see a small table where the index is not used that much, maybe you don't need to worry about it as much, but a very large table where the index isn't being used a significant amount, that's probably something you want to check out and potentially add some indexes to. The other thing you could do is clean up unused indexes and he has a query here to look for those that aren't being used, similar to the previous post. Now the other thing he mentioned is to keep certain things outside Postgres. So he's proposing if you have things like messages, logs, events, and things of that nature, don't put them in Postgres. 

Because inevitably you're probably going to take over most of your storage through this. Now, I actually store things like that in Postgres, but I tend to use partition tables and then delete the partition tables when they're no longer needed. I prefer doing this rather than running some separate data store, but this is a proposal that he has. The last thing he mentions is to dig into performance with pg_stat_statements to determine how well your statements are running and if anything needs to be optimized. So if you're interested in this content, you can check out this post.

The next piece of content is "Boost your User-Defined Functions in PostgreSQL". This is from ongres.com and they're giving some advice with regard to using user-defined functions within Postgres in your application. Now, they're not necessarily saying you should or shouldn't put logic in the database, but if you do, you should follow these steps. First, they're saying use PL/pgSQL functions for simple SQL statements. So for example, you can put logic such as this section of a case statement and make it a function to call. When you do it, you want to use SQL to do it because this is basically just SQL as opposed to using PL/pgSQL if you don't need it because the optimizer is able to optimize SQL in queries versus using a psql. So it's pretty much doing the same thing, but the pure SQL is about four times faster compared to the PL/pgSQL function. 

The second item is the unnecessary usage of SELECT INTO clause. So basically you should use the assignment operator in your functions as opposed to the SELECT INTO clause because it gives you better performance. Again, it's about a four times better performance in this example they show here. The third is overusing the RAISE clause. So like any other language, having exceptions and using exception processing is a burden on that function's processing. Here, you can see it's about four or five times slower to have that exception processing in there. Fourth is overusing the high-level programming coding style for SQL activities. 

So basically it's best to express your queries in SQL versus trying to build your own. For example in this comparison, using a FOR LOOP to get what you could have done with a lateral clause, you can see you improve performance by about twice or two times by using that lateral statement as opposed to using a FOR LOOP in a PL/pgSQL statement. The next set of recommendations is with regard to function properties. The first one is to use parallel safe whenever possible. 

So if you can make your function parallel safe here, and mostly in regards to read-only queries that don't access sequences and have a number of restrictions, but if you can do that, it's going to give you some better performance. Secondly, use IMMUTABLE when possible, so if your function is truly immutable. In other words, the same input will always result in the same output. Go ahead and mark that function as IMMUTABLE. The last set of recommendations is to monitor your performance functions using the pg_stat_user_functions so you can check on your function's performance. So if you're interested in learning more, you can check out this post.

The next piece of content is "Achieving 11M IOPS & 66GB/s IO on a Single ThreadRipper Workstation". This is from tanelpoder.com. Now, this is not database-specific. He doesn't mention databases in terms of analysis at all. But what he is saying is that modern hardware has become super, super high-performant. You may want to consider a very recent, very fast set of hardware versus setting up a cluster of hundreds or thousands of lower-powered machines because that may result in better performance. I actually took a look at this from AWS's perspective, and right now the max IOPs they offer is about 64,000 IOPs. 

They have a preview where they're offering 256,000 IOPs and you can attach about 30 volumes to one of their metal instances. But you're still with that. Not going to get near these 11 million IOPs from this single workstation. So it's an interesting read on what is possible. Now, this is a quite complex blog post that doesn't mention databases, but it's all about the hardware and capabilities and optimization to get around different bottlenecks. But in this era of distributed workloads and seeking to scale out versus scale up, well, here's a post that talks about how far you can scale up, essentially. So if you're interested in that, you can check out this post.

The next piece of content is "Change Data Capture in Postgres with Debezium". This is from blog.crunchydata.com and essentially this enables you to use your write-ahead log to track changes. So you don't have to put any triggers on Postgres, but it just reads the write-ahead log stream and sends those data changes off, say, to another database or another way to capture the data that is of interest. So they were saying this is a way you could potentially put logs and messages and things of that nature. If you don't want to put them in Postgres, you can send them off to a separate data store. This is using Debezium to do it. And they go through the process of setting this up. Now they're setting this up with their Crunchy Bridge, but you can use this information to be able to set up Debezium for your Postgres instance. So if you're interested, you can check out this blog post.

The next piece of content is "POSTGRESQL: WHAT IS A CHECKPOINT?". This is from cybertec-postgresql.com, and this walks through the process of how INSERTS happen. Basically, an INSERT needs to happen. The first step is actually to write what's being inserted into the WAL. So it commits it to disk that this is going to happen. It then writes that information to the shared buffers so it's in memory.On a periodic basis, a background writer writes it to the actual data files and that is the checkpointing process. It's basically making sure that all the dirty buffers get written to the disks and then that checkpoint is written to the WAL. And they mentioned two ways of doing the configuration. One is the checkpoint_completion_target, which by default is 0.5. That means with whatever duration the checkpoint is supposed to happen, it should be completed basically 50% into that time frame. 

But interestingly, you mostly bring this up to around 0.9. He comments here, quote "In Postgres 14, this parameter will most likely not exist anymore. The hardcoded value will be 0.9, which will make it easier for end users". So that leaves two other variables you can adjust. One is the checkpoint_timeout. So by default, it's five minutes. But normally you make that 20 minutes, maybe 30 minutes, or an hour at most probably. Then the max_wal_size determines how much WAL you're going to keep. So basically these two parameters will determine how often checkpoints happen. So if they're happening faster than your timeout, you'll generally get a warning, or you should configure that warning if you're having too much wall being generated. So then maybe you'll want to increase your max_wal_size. But if you want to learn more about checkpoints, you can check out this post.

Next piece of content is actually a YouTube channel. About 20 or so videos have been posted to the EDB YouTube channel and therefore Postgres Build 2020. So there's all sorts of Postgres-related videos that are approximately 30 minutes in duration that have been posted here. So if you are interested in video content, definitely check out this YouTube channel. 

The next piece of content is "LOAD AND QUERY PI-HOLE DATA FROM POSTGRES". This is from rustprooflabs.com and Pi-hole is an application that lets you block ads at the network level and it stores information about its operation into a SQLite database. Well, he actually wanted to analyze it. So what he did is he exported the database and then used the SQLite Foreign Data Wrapper to be able to query that data from Postgres. He talks about his setup and getting it set up to be able to query the data. The thing he was noticing was that he had a little bit of a poor performance. So then he goes into actually loading the data from specific tables he was interested in into Postgres. So he used the Foreign Data Wrapper to actually load the data into tables he created in Postgres. So if you're interested in that, you can check out this blog post.

The next piece of content is "PostgreSQL Database Security: Authentication". This is from percona.com and it's the second in a series of posts on authentication and they're talking about the Postgres internal authentication. In other words, offering Trust, Reject, MD5, SCRAM, and Cert authentication methods. With Trust and Reject, basically, Trust: come in without any password at all, or Reject: without asking for a password at all. MD5 does an MD5 hash for password authentication SCRAM uses SCRAM authentication and then uses certificates. So if you want to learn more about setting up authentication with Postgres, definitely check out this post.

The next piece of content is "Postgres and Crypto Supply Chain of Integrity". This is from blog.crunchydata.com and they're talking about how you get PostgreSQL to use FIPS 140-2 crypto and basically this is the Federal Information Processing Standard 140-2. So if you're interested in that, you can check out this post.

The next piece of content is "Waiting for PostgreSQL 14- SEARCH and CYCLE clauses". This is from depesz.com. So, these are clauses that are part of the SQL standard that can be used for recursive queries. He's using an example of a directory structure and you can actually use the SEARCH keyword to do a search by depth. So basically it's going to show you the data laid out in depth. Or you could search by breadth first, so it gives you the breadth of the directories and then it goes further down. Then he also talks about CYCLE and that is where you're trying to determine all the different routes between different cities. Say you're planning a flight, once you load up all the data and all the possibilities, you can then use a recursive query to find all the different paths that you can go on. What's great about CYCLE is that it actually halts it once a potential duplicate trip is stopped. So if you're interested in learning more about that, you can check out this post.

The next piece of content is "How PostgreSQL Inserts A New Record With The help of Table Access Method API and Buffer Manager". This is from highgo.ca. If you want to learn about the internals of PostgreSQL and how it does inserts using the Table Access API, definitely check out this blog post.

The next piece of content- "Using GitOps to Self-Manage Postgres and Kubernetes". This is from blog.crunchydata.com. So if you want to use GitOps to help manage your PostgreSQL database system using Kubernetes, definitely check out this post.

Next piece of content is "Various Ways to Retrieve Pgpool-II's Statistics". This is from b-peng.blogspot.com. So this is monitoring Pgpool II. They present three ways. One is through SHOW SQL commands you can execute from a database connection. The other is using PCP commands which my understanding communicates directly with Pgpool II to give you information back. Or you can use a pgpool_adm extension to do so. So if you're interested in learning more about that, you can check out this post.

The next piece of content is "OverlayNG and Invalid Geometry". This is from lin-ear-th-inking.blogspot.com, and he's talking about some improvements that were made to PostGIS 3.1 and GEOS 3.9. In referencing a previous post we mentioned on Scaling Postgres, PostGIS was still reporting errors with invalid geometry. He goes into a little bit of a clarification on why that is and the rationale behind it. So you can check out this post if you're interested in that.

Next piece of content is "PG_TIMETABLE: ASYNCHRONOUS CHAIN EXECUTION". This is from cybertec-postgresql.com. So this is an update to the pg_timetable extension for scheduling tasks in Postgres. If you're interested in that, you can check out this blog post.

The last piece of content, the PostgreSQL person of the week is Alexander Sosna. So if you're interested in learning more about Alexander and his contributions to Postgres, you can definitely check out this blog post.

episode_image