Do You Need A Vectorizer? | Scaling Postgres 340

In this episode of Scaling Postgres, we discuss whether you need a vectorizer, different ways to bin or bucket timestamps, addressing a bad plan and advanced psql.

Content Discussed

YouTube Video

Podcast Audio

Transcript

When using postgres and all of the different extensions that postgres has, a big decision you have to make a lot of times is how much do you really want to do in the database? Because there are so many extensions that offer so much functionality, you can build a lot of whatever solution you're building within postgres itself and relying less on the application framework that you're using. For example, instead of using an application framework's ORM, you could use direct SQL calls or write procedures or functions within postgres. I wouldn't necessarily advise that, but you can also do things like HTTP calls from postgres as opposed to going through an application server to do it. So postgres itself, with an HTTP extension can reach out and contact an API. Well, the first set of content we discussed this week is actually a set of extensions that allow postgres to keep your embeddings up to date. But I hope you, your friends, family and co workers continue to do well.

Our first piece of content is “Vector Databases Are the Wrong Abstraction”. This is from timescale.com and from this title. To me it seems to espouse doing things in postgres. You don't need a dedicated vector database to do those things. So that's what I thought what this blog post is about, but is actually a lot more than that. And they're describing something they're calling a vectorizer abstraction. So basically a way for you to automatically keep your vectors or your embeddings up to date. And this is a new open source tool called pg_Vectorizer and it basically works with the already open source pgvector and pgvector scale. And this vectorizer is a set of tools that allow you to keep embeddings up to date. So for example, you have a blogs table that you want to get the embeddings for. You define the embedding you want to use and then even your level of chunking. And this tool will go out to the OpenAI API and keep your embeddings up to date. And they really want you to think about this kind of sort of like an index. In other words, you don't have to think and maintain indexes whenever and insert, update and delete happens. They want to do the same thing for your vector embeddings. Basically they are always kept up to date. So you can kind of think of it like a generated column. What makes this a little bit more difficult is the fact that you're reaching out to a third party API to do it. So what this tool does, I'll scroll down to the diagram here is you have your PostgreSQL database, you have your source data table, for example, a blogs table. And whenever an insert, update or delete happens to this table, a trigger creates a new row in what they're calling a work queue table. Now, separately, there's a worker, and they demonstrated an example of it in a Docker container. So they are delivering some sort of separate application as part of this tool. But it is open source, and this mini application or worker checks the work queue to see what new embeddings need to be retrieved. It reads configuration information from a database. It then chunks the text for embedding, and then sends those text chunks to whatever service you're using. The open API in this example, it retrieves the information and then writes it to the embeddings table within the main database. So basically this handles the embedding maintenance for you. And when, whenever data like a blog post, for example, gets updated, deleted, changed, or whatnot, it'll automatically get new embeddings for you. And this blog post goes into a lot more depth about how to customize it. And it basically gives you a relatively simple API to create these vectorizers. Again, you're specifying the source table, what embeddings you want to use, how you want to chunk them, and it looks like even a degree of formatting. Now, this is something you can build yourself, but you would have to define the triggers you want, set up your queue table. Whatever application framework you are using, you would have to go through the same process of the worker doing this work. So I guess at this stage I look it as a fast on ramp to getting started using AI, but because it's open source, you can actually use it and see how they're doing it to develop your own solution. Because if you're doing semantic search, your application will still have to take whatever query that person is searching for and send it to OpenAI to get the embeddings back to query against the embeddings table. So you're already going to have to be doing that. Now, maybe an enhancement to this extension is where if you send a select query to the embeddings table, it automatically knows it's going to have to send a query to the OpenAI API to get the embeddings for that query to then compare against the embeddings table. But that's not something that exists yet, to my knowledge. But even if you don't think this is something you'll use. I think it's a great educational tool to understand how all of this works. If you want to try to start using a semantic search solution, for example. And I definitely encourage you to check out the YouTube video here where he explains how everything works. It's about I think 10 to 15 minutes or so, so check out this blog post if you want to learn more.

Also, there are three other blog posts from Timescale related to describing this vectorizer solution, so you can get more details about each part if you want to. The second one is “How to Automatically Create & Update Embeddings in PostgreSQL—With One SQL Query”. The next post is “Which OpenAI Embedding Model Is Best for Your RAG App With Pgvector?”, and the last one is which rag chunking and formatting strategy is best for your app with PGvector. So if you're interested in any of this, feel free to check out these blog posts.

Next Piece of Content “4 Ways to Create Date Bins in Postgres: interval, date_trunc, extract, and to_char, or to character” this is from crunchydata.com and the first way to bin is basically within your where statement and he's using an interval. So he says from this point in time, this timestamp, subtract 90 days from it. So basically you get the last 90 days. Now another solution you can use is current date which actually gets the start of the day we're on and then minus the interval of 90 days from that. And he says you can use interval for binning with just using case statements. So he's using this Kate's statement to determine a 30 to 60 days ago and then a 60 to 90 days ago. And with this query you can see what the results are here. The next thing he looks at is actually a function date trunk and this is the easiest way to get date binning because you can just specify what component of the date you want. So you can do it by month with whatever date time you're comparing, or week or year. It's your choice how you want to pull that out. And it basically gives you the timestamp that's the start of this week, or the start of the month or the start of the year. So you can see that with the start of the month it will always be the first of the month because that's the start of the month. But you can also extract certain components from dates. So here he's extracting an epoch to be able to send Date values to JavaScript or you can Pull out the hour or pull out the day of the week or whatever you want to, and it returns a numerical representation of what you're pulling out of the date time. So you can use that to bin data from postgres. And lastly, he describes to char basically converting the date time into a particular string of characters that then you can choose to bend or manipulate as you want. So if you want to learn more about this, definitely check out this blog post.

Next piece of content “Case study: optimization of weirdly picked bad plan” this is from depesz.com and this was interesting. He did say that this issue cropped up on PostgreSQL 14, so not a relatively recent version, but they had a query like this looking for a max and some queries and a where clause on the primary server executed in a couple of milliseconds. But on replicas they were running it on, it got killed after three minutes. And unfortunately throughout this blog post they haven't been able to explain what the problem was. Or they tried to do an analyze to see if that might change things, but it really didn't because all the stats are the same on the primary as they are on the replica. The only thought he had is that maybe it was something having to do with some different configuration because the replica server was actually larger than the primary server. So you can see the memory size differences and therefore they altered the shared buffers and the effective cache size. But if you happen to have this problem, how can you potentially optimize it? Because what was happening is that the primary server was using the columns in the where clause to filter and choose an index based upon that. But the replica was trying to do a scan, trying to find the max. So what he did to give kind of a hint or force postgres to do is he did a materialized CTE. So first find the data by the where clause and then pull out the max value from it. And with this implementation, both the primary and the replica returned relatively quickly. The other alternative is he actually created a separate dedicated index that was more efficient in pulling out the data that resulted in very fast query times. So these are a couple of solutions to speed things up when you're getting some slow query responses.

Next piece of content There was another episode of Postgres FM last week. This one was on “Advanced psql”. So first off they talked about why should people learn psql and it's basically because it is the main postgres client maintained by the postgres core team, so you should expect it to be pretty accurate and reliable with regard to interfacing with postgres. And the other thing is that it's pretty much ubiquitous. Typically when you install postgres, it's installed alongside it as well. So you'll find it on any machine. The other thing I like about it being a consultant is that I can use it on a remote server if I have to, like some sort of bastion host to access a database system, or I'm granted credentials to the database system itself and I can use a PSQL prompt there. Those are some of the reasons I like it. The other thing I like about PSQL is that if you're going to be automating any jobs with postgres, generally you are going to be running PSQL to run those jobs a lot of the time. So there's an advantage to mirroring what you're doing in an interactive state compared to what you're going to be putting into some sort of a batch job. But they talked a whole lot about all the different features of PSQL in terms of different ways to use the watch command, many different backslash commands in different ways that you can format data, that you can page data, how you can use set commands, as well as even conditional logic. So if you're going to be working frequently with Postgres and you're not 100% up on psql, definitely encourage you to spend some time learning it. And this episode is another great way to do that. You can listen to the episode here or watch the YouTube video down here.

Next piece of content “Using CTID Based Pagination for Data Cleanups in PostgreSQL”. This is from shayon.dev and he was trying to delete some data from a table and he was using an end clause to limit it to a batch of 10,000 where the created date was less than a given date. But the problem is it kept on timing out after 10 minutes. So as opposed to using this technique, he actually wanted to use a CTID, which is a unique identifier of each row that identifies the page number and the row number of where the data is located physically. So we actually wanted to delete the data physically by page. So here's the select clause. He's still querying by the created at, but he's also using a CTID range and then deleting that data from the table. And he says the benefit of this technique is that these lookups are super quick and doesn't put a lot of strain on the system like the previous query, but he does say this is not an ideal solution. I don't know if I would personally use this technique. Usually what I do if I need to delete rows and it takes too long, I would either limit how many rows I'm deleting or try to create an index, even a temporary one using to identify the rows to delete. That way at least the select component is super fast and then the deletes take however long they take. The other recommendation I'd probably do within the session is to do a lock timeout so that if the delete statement holds a lock for too long it will cancel itself to try and protect your system from so many rows being deleted at a time. But if you want to learn more about this, you can definitely check out this blog post.

Next Piece of Content “New way to search PostgreSQL documentation” this is from depesz.com and he created a tool he's calling pgdoc.link and it actually goes to a website. And basically this was born out of some of his frustration trying to find the right page for particular documentation. So he has a search tool that you put in a postgres feature or function or maybe a system view and it brings you right to the page that you queried about. It doesn't show you a Google search where you choose your option, it takes you right to a the specific page. In addition, you can use links. So for example, if you wanted to go to max connections you could just go to pgdoc link and specify what you want to go to. You can even go to a specific version by first putting slash whatever version number what you're searching for. So this is pretty interesting. So definitely encourage you to go to pgdoc.link and give it a try.

Next piece of content ““Gentlemen, this is a 🏈”: Glyphs, Encodings, Collations, and Locales”. This is from thebuild.com and this is discussion about exactly what it says character encodings, collations and locales and all the pain typically associated with it and also gives some recommendations. I'm not going to go into this in depth of course, because it is long and he actually has two other blog posts related to it.

The second blog post is “Speaking in Tongues: PostgreSQL and Character Encodings” where he goes into depth about that.

And then the last one is “An Interlude: Why you do not want to use C/POSIX locale” and basically he definitely endorses using UTF8 but definitely check these posts if you want to learn more.

Next piece of content “PostgreSQL Hacking Workshop - November 2024” this is from rhaas.blogspot.com and this is the upcoming hacking on Postgres workshop. They're going to be discussing a talk given by Andy Pavlo, which is an Intro to Database Systems course at cmu. So if you're interested in hacking on postgres and discussing this he mentioned this is not Postgres specific, but a lot of the topics are super important for designing Postgres and understanding its architecture. But if you're wanting to join, you have to click this form in order to do so.

And the last piece of content last week was the “PostgreSQL Conference in Europe 2024” and there are a number of posts that were done discussing people's experiences. Approximately six different blog posts. So if you want to learn more about that, you can click the link in the description below and go to scalingpostgres.com where you can find all those links.

Do You Need A Vectorizer? | Scaling Postgres 340

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript