20 Times Faster Text Search & Ranking? | Scaling Postgres 286
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss a new extension that promises substantially faster text search and ranking, an AI content storm, how to work with money and the fastest way to stop Postgres.
Content Discussed
- pg_bm25: Elastic-Quality Full Text Search Inside Postgres
- pg_bm25 Benchmark
- pg_bm25 Installation
- Introducing ParadeDB
- AI ignites the rain: Content Storm is back
- Postgres: the next generation. Investing in the next generation of committers.
- Working with Money in Postgres
- pgvector vs Pinecone: cost and performance
- Stop and start Postgres faster
- HOT Updates and BRIN indexes in Postgres 16
- pg_resetwal: When to reset the WAL in PostgreSQL
- Explore Practical PostgreSQL Substring Use Cases with Examples
- Setup Geocoder with PostGIS and Tiger/LINE
- Geocode Addresses with PostGIS and Tiger/LINE
- Null Presentation Updates
- Data Gravity
- How the name of the initial superuser is chosen in PostgreSQL
- PostgreSQL Indexes: What They Are and How They Help
- Commitfest throughput
- Overview of PostgreSQL Foreign Data Wrapper (FDW)
- How to Reduce Your PostgreSQL Database Size
- Getting Started with Psycopg2: Python’s PostgreSQL Adapter
- TRAKTOR, true multimaster replication for PostgreSQL
YouTube Video
Podcast Audio
Transcript
Could Postgres work like Elastasearch and get you 20 times faster searching and ranking? We'll explore that in this episode of Scaling Postgres. But before we do that, I hope you, your friends, family and coworkers continue to do well. So the first thing I wanted to mention this week is that I am working on a Postgres course and it's going to focus on performance optimization because, you know, Scaling Postgres. Basically, I'll be teaching the framework I use when I engage with clients to help them make Postgres faster or reduce the query or the statement burden on their database so it operates more efficiently. And I'm mentioning this because I'm looking for some feedback. If this course sounds like something you might be interested in, I'd like to have a Zoom call with you. So if you're interested in doing that, you can head over to scalingpostgres.com/zoom and just send me your email and we can coordinate a time to have about a 15 or 20 minute discussion.
But let's get started with this week's content. And that is "pg_bm25: Elastic-Quality Full Text Search Inside Postgres". This is from docs.paradedb.com and they're talking about a new extension that provides a new type of index called BM25, which is the algorithm that's used by modern search engines to calculate relevancy scores and search results. So presumably this is the same type of algorithm used in Elastasearch. Now they're focusing on full-text search which uses the tsvector type and they're saying it has two issues. One is performance and when tables grow to millions of rows, a single full-text search can take several minutes. I haven't really seen that unless they're talking hundreds of millions of rows up to a billion. But then, of course, I'm always looking for a faster way to do things. The second thing they mention is that Postgres has no support for operations like fuzzy search, which it does, just not in the full-text search area. But it doesn't do relevancy tuning or relevancy scoring.
So already I'm kind of like, okay, we'll see what the rest of this covers. But this new extension and index type will be, as they say here, 100% Postgres native with no dependencies on some sort of external search engine. That's great. It's built on top of Tantivy, which is a Rust-based alternative to Apache Lucene. It says it's faster than the full-text search that exists now and it also supports fuzzy search, aggregations, highlighting and relevancy tuning. And it does real-time search as well. In other words, you don't have to re-index things. Now apparently what made this possible is that Tantivy was created as a Rust-based alternative to Apache Lucene and then Postgres got pgrx, which made it possible to build extensions in Rust. So you put those two together and they came up with this new extension to be able to do Elastasearch-like searches. So they're tying into Postgres using the index access method API as well as the operator API. As a result, you can search for query strings within a table and they even designed a way to give instructions to Tantivy's mini query language here. So they did some benchmarkings on their site here and they indicate that the build times are faster than the TS Query, which is in red.
But also the performance looks very impressive, although I haven't heard minutes to return a query from a full-text search when indexes are properly set up. But this one here at a million rows is returning in just over 3 seconds with the TS Query. You can see some of the benchmarks here, I took a look at it. Nothing jumped out at me, I'd probably have to install it and play around with it. But if this relationship holds as you get to millions of millions of rows, then the Paradedb results look quite significant. Now in terms of installing it, you can download the Paradedb docker file and you may be wondering what Paradedb is. Well, that's basically their Postgres solution dedicated to searching, kind of like Citus is dedicated to scale at Postgres and Timescale is dedicated to time series databases.
This one is an index-focused one, but going back to installing it, you can get it as a docker container or you can do it self-hosted. So you basically install Rust in the cargo-pgrx using this method here. Then you get to clone the code, initialize, I think it's called a crate, install it and then create the pg_bm25 extension. So I find this super interesting, but what do you think? Is this something you think you would like to check out? Do you think it would work better for full-text search cases? I do have a portion of my clients that are also using Elastasearch alongside Postgres, but if you can put it all in one database, that would seem really advantageous to me because I know some clients are only using it for a couple of simple searches. But check out this content if you want to learn more.
Next piece of content- "AI ignites the rain: Content Storm is back". This is from supabase.com and apparently, they have a Content Storm where they coordinate with a number of content producers to say "Hey, let's create content on this one subject" and this one was dedicated to AI or artificial intelligence. So they actually had 30 creators participate in this Content Storm and all the content was published this past Tuesday, October 10. And you can see all the different content here. Now, all of this content is of course related to Subabase who did this blog post along with primarily pgvector and how that's being used together. But if you are looking for AI content, definitely bookmark this blog post to go over all this content.
Next piece of content- "Postgres: the next generation. Investing in the next generation of committers". This is from redmonk.com and they're talking about the committers to the Postgres code base and particularly how a lot of the core development team has an average age of probably around 50. And basically, even though that's fine and I know I'm around that range, it's definitely important to keep building the community, particularly for the future. And I just thought it was pretty interesting, this discussion on the community's status and things to potentially improve about it. So you can check out this blog post if you're interested in that.
Next piece of content- "Working with Money in Postgres". This is from crunchydata.com and they're talking about things you should or should not use when you're working with money. The first thing not to use they say is basically the money data type, primarily because it's tied to the database locale setting. So that could cause you a lot of problems down the line. So it's not really the best data type to use. Also floats. You never want to use floats because they're highly inaccurate. I mean, it's fine for graphical representation like games, or maybe geospatial work because the accuracy does not have to be super, super high. But with money, you don't want to lose a cent or even a partial cent, so you don't want to use those. The two candidates you should use are integers. And now it's normally used if you have an amount that's $33.50, they store 3,350 in the integer column.
Basically, they move the decimal point however many places over and just store the full amount in the integer. But you need to have the currency as a separate column along with it so you can know what calculations are needed. Now, the downside of that is, of course, you must always have a calculation to know what the true amount in the database is. Meaning if it's 3,350, you know you need to divide it by 100 to get the proper dollars and cents if you're dealing with United States dollars. The other option is numeric, where you can actually specify the precision and the scale. So how much to the left of the decimal point and how much to the right of the decimal point. And be sure to choose a scale factor that's appropriate for whatever fractions you want to support. And they also talk about different functions. So if you're interested in this, you can definitely check out this blog post.
Next piece of content- "pgvector vs Pincone: cost and performance". This is from supabase.com. They're comparing the pgvector extension to Pinecone, which is a dedicated vector database, and they did some performance tests, which of course you have to take with a grain of salt. But using the HNSW index on pgvector compared to Pinecone, the performance looks like it was over ten times faster. So that's a significant difference. Even the ivfflat performed better than Pinecone in their example here. They also increased the number of pods, but pgvector in the HNSW index still performed better than Pinecone. Then they tried the fastest pod type, but it resulted in decreased accuracy. So 0.94 compared to 0.99 or 0.98, I think even higher. But once they match the accuracy of the HNSW index, it looks like it still performed well, just a little bit less than twice as fast. So another good post to check out if you're considering using Postgres for AI work, so you can see how all the different knobs can be tuned to get the best performance out of different indexes.
Next piece of content- "Stop and start Postgres faster". This is from Postgres FM. This was the episode they had last week. And here Nikolay did a solo show where he was talking about this and he talked about when you're stopping, you could have issues with ongoing connections that haven't been dropped or maybe WAL that has not been archived yet. But the main reason, I think, is checkpoints. So basically whenever you shut down the database, you need to do a checkpoint to take everything that's in memory and then to the WAL and reach a consistent state on the data files on the disk. And I like what he said in this because he suggested taking a manual checkpoint before the shutdown procedure and maybe even doing one or two checkpoints.
Because maybe you do a checkpoint and it takes ten minutes to do. Well, do a checkpoint then immediately, so you get as much work done as possible before you start the shutdown procedure. I haven't really seen a lot of content mentioning that, but it's something I've done running checkpoints before shutdown to make sure as much of the checkpointing process is done as possible. And then similarly to speed up starts, it's basically making sure that you gracefully shut down that will allow you to start faster because if you have a crash scenario, you have to recover from it, you have to read all this WAL to get back up to a consistent state. So as long as you shut down in a clean state, the startup procedure should be pretty fast. But you can play the episode here or watch his video on YouTube here.
Next piece of content- "HOT Updates and BRIN indexes in Postgres 16". This is from pganalyze.com, and apparently, another update to Postgres 16 is that HOT updates can be done even with BRIN indexes. So HOT updates are basically heap-only Tuple updates, so you're only updating the heap, you don't have to change any indexes. And apparently, BRIN indexes were blocking heap-only Tuple updates, whereas, in Postgres 16, HOT updates can happen even though you have a BRIN index on the table. So check out this piece of content if you want to learn more.
Next piece of content- "PG_RESETWAL: WHEN TO RESET THE WAL IN POSTGRES". This is from cybertec-postgresql.com. Basically, I would say probably don't ever run it. So, as he says here, the WAL helps you reach a consistent state along with the data files. Or if you have shared buffers in the data file, you can reach a consistent state. So anytime you're resetting WAL files, you're going to be throwing away some of that consistent state. And I like his very important rules here. The first is always back up what you have before running this. And really I would have probably put this first, although the backups are important, but use it as a means of last resort. In other words, this is the last thing you should do. The third is do not call pg_resetwal carelessly, definitely not. And expect data loss. So this is for when you just can't get your database running, in my opinion. And maybe it can allow you to get access to some of the data. But if you want to learn more about it, definitely check out this blog post.
Next piece of content- "Explore Practical PostgreSQL Substring Use Cases with Examples". This is from techbuddies.io and they're talking about the substring function in Postgres and all sorts of different ways you can use it, especially with a lot of the different regular expression options that they cover here. So a lot of different examples of how to use it. So if you're interested in that, you can check out this piece of content.
Next piece of content- "SETUP GEOCODER WITH POSTGIS AND TIGER/LINE". This is from blog.rustprooflabs.com and he's talking about setting up the US Census Bureau's Tiger/Line data set along with PostGIS to do geocoding of addresses. So this is the setup portion and then the second blog post actually does the geocoding. So if you're interested in that, you can definitely check out this blog post.
And the last piece of content, there was a "NULL Presentation Updates" that was done by Bruce Momjian and the title was "Nulls Make Things Easier?", I like the question mark and how he says, quote "Humorously my major recommendation was to liberally use NOT NULL column constraints where appropriate to prevent their existence. Fewer nulls, fewer problems". And I agree, I try to use as few nulls as possible. But if you want to learn more, you can definitely check out this blog post.