Combining Indexes, Regular Expressions, Parallelism, PostGIS | Scaling Postgres 75
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss combining indexes, regular expressions, parallelism in Postgres and PostGIS content.
Content Discussed
- Combined indexes vs. separate indexes in PostgreSQL
- PostgreSQL: Regular expressions and pattern matching
- Parallelism in PostgreSQL
- How to Adjust Linux Out-Of-Memory Killer Settings for PostgreSQL
- PostgreSQL: Simple C extension Development for a Novice User (and Performance Advantages)
- Using plpgsql_check to Find Compilation Errors and Profile Functions
- Postgres 12 highlight - Replication slot copy
- Waiting for PostgreSQL 13 – Add support for –jobs in reindexdb
- PostGIS Overlays
- Simple SQL GIS
- Waiting for PostGIS 3: ST_AsMVT Performance
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "COMBINED INDEXES VS. SEPARATE INDEXES IN POSTGRESQL". This is from cybertec-postgresql.com and they're basically talking about indexes on single columns versus covering multiple columns. So for this discussion, they go through and create a table with three columns: A, B, and C, and insert a million columns into each with random values. Then they also create an index that covers all three columns. So a multicolumn index of A, B, and C. Now, the first thing that you start with is something pretty basic that PostgreSQL will rearrange filters for you. So it doesn't matter how you order your WHERE clause, it'll still use the conditions and scan through the index in the proper order. It'll do that job for you. It doesn't matter how you place it in the WHERE clause, it even does transitive properties and breaks them out into how they should be.
So even if you have B equals A, A equals C, C equals 10, it'll know to do A equals 10, B equals 10, C equals 10, and also with a multicolumn index, it can use the first part of the index to do searches. So since A was the first column in this multicolumn index, you can search A equals 10 and it will use the index to scan those rows. The key thing is that you can also use it in the WHERE clause as well as in the ORDER BY clause. So they did a WHERE clause of A and then ordered by B and C. So this is a natural fit for the index. It filters by the A column and then orders by the B and C, which are the next following columns in the multicolumn index. But of course, they also discuss cases where a composite or multicolumn index won't work, where if you're searching on a second column or a third column, it's not going to use the index you're going to get a parallel sequential scan. It's not doing an index scan.
Now they say sometimes, on occasion, it may scan through the index, essentially doing a sequential scan of the index based on certain conditions. And they simulated this by setting the sequential page cost relatively high. Then when they did the scan, it did scan through the index, but it's also a lot slower. So this one is 63 milliseconds, so using the second column, 63 milliseconds, whereas the next closest example is 4 milliseconds. So definitely a lot slower doing this scan. So it can't really only search on the last column or middle columns. It needs to at least have the 1st or first 2nd columns in the multicolumn index and then they go into understanding bitmap index scans. So this is where it can combine the results of indexes so it does a bitmap scan in this example searching where a equals four and then it does another bitmap scan where a equals this other number and it does a bitmap or them together to combine the results.
So that's also something it can do. In the last set of examples, they created single-column indexes on all three columns. So no longer the multicolumn index and they did the same search. And of course, it can use those indexes and combine them together. But if you'll notice, it believes it's more efficient not to use all three indexes at once, but it uses what it feels are the two most efficient indexes, combines them together using a bitmap AND and then just does a filter on the index they did not use. So the optimizer used the index on B and C but it just used a generic filter on A. Lastly, they follow up with optimizing min and max in SQL queries. This is an additional case when you're looking for the Min of a value or the Max of a value, it will also use the indexes. So this post gives a good overview of how indexes basically work and maybe some of these ideas can help you decide whether you want to use single or multicolumn indexes as you're developing your tables.
The next post is "PostgreSQL: Regular expressions and pattern matching". This is from 2ndquadrant.com. It says Postgres supports multiple different types of pattern matching. One is POSIX-style regular expressions, the other is SIMILAR TO which is an operator added in SQL:1999 and the SQL LIKE operator. Now, I usually tend to use a like operator, but this particular post talks about POSIX-style regular expressions and that you can use the tilde to match a regular expression, case sensitive. The tilde asterisk matches a regular expression, case insensitive. The exclamation point tilde does not match the regular expression and it's case sensitive. And then the exclamation point tilde asterisk does not match the regular expression and it's case insensitive. And it goes through some different examples just to show it. So this is a pretty easy post to look at if you want to start using regular expressions in your queries.
The next post is "Parallelism in PostgreSQL". This is from percona.com and actually the next four posts this week are from percona.com so they've definitely posted quite a few this week. So first they're talking about the components of parallelism and this is mostly the parallelism features that were introduced in 9.6. And they're talking about the concept of PostgreSQL being process-based. Generally, you have one process that does a particular query because generally, you have one process per connection. But with the parallels and features that they introduced, you can get additional threads. So here you have a gathering thread that coordinates multiple workers. So if PostgreSQL believes that a piece of work can be done in parallel, it will then spawn multiple workers that send its tuples up to the gather node in order to parallelize this particular query.
They talk about the different capabilities this enables. For example, the first is parallel sequential scans. So if you're scanning a whole table in order to look for data, this makes total sense to be able to do it in parallel with as many courses you have. In this first example, they did it with just a single processor and it executed in about 4,343 seconds. Whereas when they did it in parallel using ten workers, it was completed in 1,628 seconds. And so they post the graph here and what they did note is that there is a diminishing set of returns eventually. So once you go in parallel, number one, it's not linear, and number two, at some point you reach a point at which the work of coordinating all of the different workers exceeds the benefit of adding another worker.
But they do show how this definitely does improve performance. Next, they talked about parallel aggregates. So this is when you're trying to account for the number of rows in the table and how optimized it can get. Again, you have a gather working with partial aggregates that are done in different parts of the table leading up to a finalized aggregator and then a leader. Now here they didn't have quite as much of a difference. So for example, they say they got a performance boost of just over 16% doing it in parallel with ten workers compared to just using a single. So that's definitely not significant, but it does give you a bit more performance. Then they talked about a parallel index on B-tree scans and how that works, as well as discussed a little bit about parallel Bitmap scans and parallel joins.
So if you want to learn a bit more about parallelism in PostgreSQL, this is definitely a blog post to check out. The thing that I always like to keep in mind is that your workload is more transactional based and you want a lot of connections doing work. In that case, maybe parallelism isn't that much of a benefit because you're already asking it to handle many, many connections doing small queries. However, if you have more of an analytical workload, you're going to be doing a lot of sum counts as well as fewer numbers of users but want to use more cores of the machine to handle each query, then definitely parallelism is a benefit. So just something to keep in mind if you're looking at optimizing your parallelism settings.
The next post also from percona.com is "How to Adjust Linux Out-Of-Memory Killer Settings for PostgreSQL". Now they go into a lot of detail about the out-of-memory killer and why it exists. Basically, you don't want a rogue processor program taking the entire memory of a machine. Unfortunately, for PostgreSQL that is the primary thing that generally runs on it and it tends to be a candidate for being killed. But of course, the recommendation is to set the v_memory_overcommit_variable to reduce the probability of this happening to 2. And that means that quote "...that kernel is not supposed to overcommit memory greater than the overcommit_ratio. This overcommit_ratio is another kernel setting where you specify the percentage of memory the kernel can overcommit". This is pretty much the safest option recommended for PostgreSQL. So if you want to learn a little bit about the out-of-memory killer in Linux and some of the settings that are suggested for PostgreSQL, definitely a blog post to check out.
The next post is "PostgreSQL: Simple C extension Development for a Novice User (and Performance Advantages)". This is from percona.com. So here they're talking about developing a C extension and they go through the process of how you do it. So if you're interested in developing additional features on PostgreSQL, this is definitely a great post to check out. But they created a very simple extension called addme that just adds two arguments together. And they did it in C. And then for comparison, they actually did a PL/pgSQL function to compare what their performance was like. And they ran the timings just using the straight SQL+ operator because you're just adding two parameters together, you got a performance of 0.65. Using the C function got 0.65. However, using the PL/SQL function call got 1.4, it may be 40% slower or 35% slower and then they graphed it out here to show you the reduced performance. So creating a C extension definitely gets you more performance if you have custom functions you want to develop. So definitely an interesting blog post to check out if you're looking into a way to squeeze more performance out of PostgreSQL.
The next post, if you're interested in doing more PL/pgSQL development, is "Using plpgsql_check to Find Compilation Errors and Profile Functions". This is from percona.com. So basically this is an extension and what it does is it allows you to check for compilation errors in function code, find dependencies and functions, and also offer profiling functions. Now if you use the profiling functions you do need to add this extension to the shared preload libraries area of the postgresql.conf file. So they have an example from a previous post where creating this function generated no error, but if you actually ran it through the plpgsql_check function command, it actually found that a relation needed does not exist. So this could be a great benefit if you're doing a lot of PL/pgSQL development and they point out here you can also help you find dependencies as well as do profiling functions. So if you're doing a lot of PL/pgSQL development, definitely a blog post to check out.
The next post is "Postgres 12 highlight - Replication slot copy". This is from paquier.xyz. and they're talking about replication slots that are used in streaming replication or logical replication or logical decoding. And now they have a new feature where you can actually copy a slot, so they have pf_copy_logical_replication_slot and pg_copy_physical_replication_slot. Now some use cases are maybe you wanted to do a backup with a permanent slot and then make a copy of it that'll give you the same restart LSN and then you can use two replicas or standbys to then each one gets their own slot to be able to follow to do a restore from. So that's pretty interesting. They also talked about scenarios where maybe you're doing logical decoding or logical replication and you can create a slot where an LSN of an existing slot is to be able to test decoding. Maybe you want to output information that you're getting from like pg_recvlogical. So definitely an interesting feature being added to PostgreSQL 12 and if you want to learn more, definitely check out this blog post.
The next post is "Waiting for PostgreSQL 13 - Add support for -jobs in reindexdb". This is from depesz.com. So they're already talking about version 13 and there is actually a parameter, -jobs, added to be able to parallelize certain work, and in this case, reindexdb. They show that when they parallelize it with eight, they get completed in about 6.8 seconds, whereas when they didn't parallelize it, it completes in 24 seconds. So definitely an improvement. So I know 12 hasn't been released yet, but they're already working on version 13.
The last set of posts are all related to PostGIS. The first post from cleverelephant.ca is "PostGIS Overlays". So basically if you have a particular overlay you want to compute, this goes through on how you can get the combining set here. Now again, I'm not an expert on PostGIS, but this post covers how to do that. So if you're interested, definitely a post to check out.
The next post is "Simple SQL GIS". The post is also from cleverelephant.ca. So this is a case where they had a particular area where water was included and they were actually able to use some additional data of where the water exists and then use some SQL in order to create a difference from it to create this type of representation of the data. So if you have a use case for this, definitely a blog post to check out.
The last post is "Waiting for PostGIS 3: ST_AsMVT Performance". This is from crunchydata.com. They are talking about greater performance for this PostGIS function. So again if you're interested in the upcoming PostGIS 3, definitely a blog post to check out.