Indexing With tsearch, Tuples Not Yet Removable, Wicked Problems, Pattern Matching | Scaling Postgres 205
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the best ways to index with tsearch, what to check if tuples are not yet removable, wicked problems with Postgres and different pattern matching techniques.
Content Discussed
- How to index data with tsearch?
- 5mins of Postgres E7: Autovacuum, dead tuples not yet removable, and the Postgres xmin horizon
- Solving PostgreSQL wicked problems
- OrioleDB Github
- How to Cheat at WORDLE with PostgreSQL
- migra is a schema comparison tool for PostgreSQL.
- Pgpool-II Configuration Parameters - auto_failback
- Cloud Data Sources for the Postgres Operator for Kubernetes
- A snippet to acquire a Lightweight lock
- Gianni Ciolli
- Rubber Duck Dev Show Episode 32 | Reviewing the OWASP Top 5
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "How to index data with tsearch?". This is from depesz.com and he's talking about an issue that he was looking at on a Slack discussion where they said should you build a tsvector in a field and then index that or just make an expression index that has the tsvector in it. So in general, he says he doesn't like having an additional tsvector column in the table because it doesn't really serve any purpose if you're using expression index and it's just taking up space. Although my personal opinion is I actually like the cleanliness of it when querying, he actually has a solution to that.
So we'll take a look at what he's talking about. So he created a post table and he created an expression index using gin and a tsvector where he's concatenating the title, headline, and main body. Of course, in order to deal with search issues, you want to put a space between them and take into account if any of these happen to be NULL by using the COALESCE function. Now the issue is that every time you do a query, you're going to have to include this in the WHERE clause so it matches the expression in the index for the index to be used. But he has a technique that he suggests using. So he creates this function here that basically returns this value of the concatenation and you can create the index with that function.
And then when you do a query, you can actually use it and it will use the index. So this is definitely cleaner to use. He also says there's actually an obscure capability in Postgres that's not necessarily well documented, but you can use some dot notation to be able to do it. So you can actually say "your table name, dot, the function name" as long as you're passing in the table as the parameter and that will still work. So that's definitely pretty readable. Now I don't know if I'll necessarily use this technique, but it's definitely an interesting one if you have a lot of text search needs. But go ahead and check out this post if you want to learn more.
The next piece of content- "5mins of Postgres E7: Autovacuum, dead tuples not yet removable, and the Postgres xmin horizon" This is from pganalyze.com. So in this episode of Five minutes of Postgres, he's talking about a friend who had an issue where autovacuum was not cleaning up tuples and he referenced two posts from 2018 that serve as a good reference for finding the issue. So the main problem is that something is holding on to an older transaction ID that won't get recycled, and therefore autovacuum cannot clean up rows in, say, a particular table. And there are basically three main areas you want to check for that. One is long-running transactions because a long-running transaction will prevent those transaction IDs from being recycled. So you definitely want to be cautious of long-running transactions.
The second area is replication slots. So if you have a replication slot but nothing is consuming that slot, the WAL builds up. Now, a lot of times you see this as a disk space problem. So your log volume, for example, keeps expanding because nothing is essentially consuming the WAL and allowing it to be recycled. But this also can cause transaction ID wraparound problems too. So you want to make sure that none of your replication slots are orphan.
And the third one, which is more rare, is prepared transactions. And this typically happens when you're using two-phase commits. And that's definitely a rarity that I've seen. But these are the issues that you want to be aware of if you see that vacuum is not fully vacuuming in tables, or it's taking a long time to run because it's not clearing out dead tuples, there's other reasons autovacuum can take a while to run. That's usually if you have a ton of data in indexes, but if it's not actually clearing out those dead tuples, that's something to look at. But if you want to learn more, definitely check out this episode.
The next piece of content is actually a presentation that was actually given in Postgres Build 2021, and you can get access to the videos online, but this happens to be the presentation that was done, and it's called "Solving PostgreSQL wicked problems". And they're listing ten different problems that Postgres has. And their solution to try and alleviate these issues is a new extension that they've developed called Oriole, I believe. With these changes in this extension development, what they've seen in their testing is a 4 to 50 fold improvement of different metrics from IOPS improvements to transaction performance. And some of the things that they've done, they've listed here is moving the transaction IDs that need to be wrapped around from a 32-bit value to a 64-bit value. So that drastically reduced the risks of that happening. Inefficient replication that spreads corruption. I believe they're talking about block-level replication, whereas this new extension does row-level replication.
Relying on MVCC, having the heap contain all versions of the rows they're moving to an undo log. So an undo log should result in some performance improvements. The primary key index is a space hog. Basically, you could avoid a primary key index if you actually order the tables by the primary index. So you can avoid that and not do block-level compression of the data, whereas this implements a block-level compression of the data to get some space savings. So everything in green has already been implemented in the initial release of this extension. That happened a short time ago. So this post was actually on February 19, 2022. The presentation I think was at the end of 2021, but this is available to try out and the other areas are things that they're working on, but if you want to learn more about that, you can check out this post as well as the presentation online.
Next piece of content- "How to Cheat at WORDLE with PostgreSQL". This is from blog.crunchydata.com. This post does exactly what it says, loading up a bunch of different words into Postgres and then trying to find word matches given a certain set of words. So it's definitely an experiment in pattern matching in Postgres. So if you want to take a look at all the different techniques they use to try and do some of this word matching, you can definitely check out this post.
The next piece of content- "Migra is a schema comparison tool for PostgreSQL''. This is from dtabaseci.com. Frequently, you have file comparison tools that show you the difference between one file and another. Well, this actually does it for Postgres schemas, so you just run the command migra at a command prompt. It's a Python library, I believe, and you give it 2 PostgreSQL URLs. It connects to those databases and then tells you, okay, this is how you match the schemas up or what needs to change to match one schema to another. So it literally gives you the alter table commands to do that. So definitely an interesting tool. It says that they cover tables, partition tables, constraints, views, functions, and indexes, so a lot of different objects and notes you may need to be aware of are listed here as well. But definitely an interesting tool. I encourage you to check out if there's interest.
Next piece of content- "Pgpool- II Configuration Parameters - auto_failback". And this is from b-peng.blogspot.com. For connection pooling, I tend to use PgBouncer. I haven't actually used Pgpool II yet, but it has a lot of capabilities for handling a cluster of PostgreSQL servers. One of those areas is automatic failback of a Postgres server. Like if it goes online and comes back, how do you fail back to it? This post talks about that configuration and process, so if you're interested in that, you can check out this blog post.
The next piece of content- "Cloud Data Sources for the Postgres Operator for Kubernetes". This is from blog.crunchydata.com and they're talking about an enhancement to their Crunchy Data Postgres operator, where for building a new cluster of Postgres systems, you can actually rely upon an existing backup that is stored in AWS, Google Cloud Storage, or even Azure. So if you're interested in doing that, you can definitely check out this blog post.
The next piece of content- "A snippet to acquire a Lightweight lock". This is from highgo.ca, and if you're interested in learning more about the internals of how lightweight locks are set up in Postgres, you can definitely check out this blog post.
Next piece of content. The PostgreSQL person of the week is Gianni Ciolli. And if you're interested in learning more about Gianni and his contributions to Postgres, definitely check out this blog post.
And last piece of content we did have another episode of The Rubber Duck Dev Show this past Wednesday. This one was on reviewing web application security risks as defined by OWASP. So if you're a developer and interested in reviewing those, you can definitely check out our show.