PG Edge Cache, Postgres & OpenAI, citext to Collations, Compression Options | Scaling Postgres 254
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss PG Edge Cache, using Postgres & OpenAI, migrating citext to case-insenstive collations and PG16 compression options.
Content Discussed
- PostgreSQL Edge Cache (PGEC)
- Cache consistency with logical streaming replication
- Consistent caching with PostgreSQL logical replication and a Redis API
- What's Postgres Got To Do With AI?
- How to migrate from Django’s PostgreSQL CI Fields to use a case-insensitive collation
- More compression options for pg_dump in PostgreSQL 16
- Unlogged tables in Postgres: Is their data truly gone after a crash?
- PostgreSQL Basics: Object Ownership and Default Privileges
- Store Trees As Materialized Paths
- PostgreSQL: Speeding up recursive queries and hierarchic data
- PostgreSQL: ltree vs. WITH RECURSIVE
- Announcing PgManage 1.0a
- PostgreSQL Patch of Interest (to me): Transaction Timeout
- ER diagrams with SQL and Mermaid
- [YouTube] Postgres Architecture Explained
- [YouTube] Postgres System Columns Explained (ctid, xmin,xmax)
- Postgres Raster Query Basics
- JSON
- Frits Hoogland
- Ruby IDE Showdown With Adrian Marin, CJ Avilla & Eric Berry | Rubber Duck Dev Show 75
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Cache consistency with logical streaming replication". This is from shortishly.com. And what this is talking about is a new tool. So this is the tool that this blog post is talking about. So it's the PostgreSQL Edge Cache and it's some sort of tool utility application that maintains a memory cache and what feeds it is logical replication from a PostgreSQL server. It offers three different APIs. As you can see here, it offers a memcached API so you can talk to it as if it's a memcache server. It offers a REST API and a REDIS API. And what makes this particularly interesting to me is that it also is primarily written in Erlang. So it should be a very stable tool or application. Now this first blog post covers this tool.
It says how it avoids lazy loading because typically, you have a caching layer. Once something in the cash expires, it actually has to go through to the ultimate destination, or the database, to pull back more data to get recent data cached. So you have to worry about cache expirations. But this tool uses a logical replication; it's constantly being updated from the logical replication stream. So it's essentially doing something like changing data capture. This is an in-memory cache, nothing is preserved on disk. And he shows an example of setting it up and querying it here.
The next blog post kind of repeats the same thing but they mentioned the REDIS API and it also supports PostgreSQL 15's capability of using column lists and row filters to be able to filter down the exact data that you want to publish out to subscribers of logical replication. So without having to do a bunch of change data capture solutions, potentially using logical decoding, maybe you could just use logical replication and this tool to set you up an in-memory cache that gets updated in near real-time. I mean ultimately, we will see what the interest of this tool is but I'm really interested in seeing where this goes. And if you want to learn more, I encourage you to check this out.
Next piece of content- "What's Postgres Got To Do With AI". This is from crunchydata.com and they're talking about using Postgres along with the OpenAI API. It discusses the process flow here. You have some data you want to classify and in this example, they're using recipes. So they have a whole recipe database they want to pass to the OpenAI API and receive back tokenized attributes which are called embeddings. That's when you're able to compare to see correlations between the different items within the data and you take these tokenized attributes or embeddings and store them in Postgres.
So to get set up to do this, you need to first install an extension called pg_vector. Which is actually a relatively new extension. But it helps you deal with the embeddings that you're going to be getting back from the OpenAI API. Now for sample data, they loaded up the ArmdedForcesRecipes.xml and they have some Ruby code in here that extracted the recipe data, inserted it into this database table here, and did queries to open AI to get back those vector embeddings. They inserted them into the database and they kind of look like this. Then they use the following queries to do similarity. So basically they had a pizza recipe. They wanted to see what recipes are similar to pizza. So they use this query to find the pizza recipe and they say all right, what other recipes are close to that based upon the embeddings? And then they got back a number of different recipes here.
Basically, three different pizza ones and chicken parmesan. But they said well this is too much pizza so we're going to add a NOT LIKE pizza. So basically, if pizza is in the name, they don't want that recipe. They ran it again. And now they have a little bit more variation There's lasagna, eggplant parmesan, and things of that nature. You can also query for the opposite. So what is the most opposite or what is the furthest distance from something. They said okay, what is the most different from a corn dog and they basically got salads. So that pretty much makes sense. This is a pretty brief blog post but it shows you how you can use the OpenAI API to do correlations and comparisons between different items to see what's close to one another or far away from one another. If you want to learn more about how to do that, definitely check out this blog post.
Next piece of content- "How to migrate from Django's PostgreSQL CI Fields to use a case-insensitive coalition". This is from adamj.eu. Now this is Django specific but a lot of it covers PostgreSQL. Because seemingly with the new version of Django 4.2 using case-insensitive text fields, CI fields, or CITextFields are deprecated. So if you're using these, he says okay well, what's the plan? Basically, you need to move to case-insensitive coalitions of these text fields and he goes through how to do that. Now this is a very detailed blog post that explains a lot about what to do in terms of Postgres but also Django as well. So if you are using CITextFields and you want to eventually move to case-insensitive collations, you may want to check out this blog post.
Next piece of content- "More compression options for pg_dump in PostgreSQL 16". This is from dbi-services.com. He's discussing that pg_dump will allow LZ4 compression in addition to gzip compression. Now he didn't mention zstd, which I know they've supported in other realms of compression. But that would be the one I would be more interested in than looking at. But in this example, he compared gzip and LZ4 and basically he found that gzip gives you better compression, but LZ4 gives you faster compressions. So if you're using pg_dump if you want faster backups probably use LZ4. If you want the smallest size possible, you'll want to use gzip But check out this blog post if you want to learn more.
Next piece of content- "Unlogged tables in Postgres: Is their data truly gone after a crash?" This is from pganalyze.com. This is the "Five minutes of Postgres" episode and we covered this blog post in last week's episode of Scaling Postgres. So this is Lukas's review on it and he also talks about hacker news where there was a discussion on it. And how if you don't actually start the database, the data will still be there because it's actually the startup of the PostgreSQL server that truncates the table. But someone else said yeah but you could have data in memory that is lost because of the crash. Again, there's no crash recovery for these tables. So you don't know what data's there, what or not if it's corrupt or not. Maybe you could retrieve some but not much and basically it goes back to if you care about your data, don't use unlogged tables. But if you want to learn more, you can check out this piece of content.
Next piece of content- "PostgreSQL Basics: Object Ownership and Default Privileges". This is from red-gate.com. This is a very comprehensive overview of these security considerations. So if you're interested in learning more about Postgres object ownership and how the default privileges work, you can definitely check out this blog post.
Next piece of content- "Store Trees As Materialized Paths". This is from sqlfordevs.com. He's talking about hierarchical trees. So in the example here, he shows you have food and you have fruits that are part of that food. And you have cherries and bananas that are part of the fruits that are part of the food. So basically, for this type of tree structure, you should consider storing it materialized like this. The other way to store it is you store the parent as food. And then you create another record that says fruit is apparent of food, it will say that fruit is a child of food. And a cherry is a child of the parent fruit. But he says for performance reasons, you're probably going to want to store it materialized in this fashion. He discusses some reasons why; he shows you how you can get all the children of a particular tree leaf as well as how to get the parents of a particular tree leaf. His blog post does say some of the advantages and disadvantages of this. But what he doesn't talk about is actually using a materialized view.
Now there are two blog posts that cybertec-postgresql.com did on the subject. The first is "POSTGRESQL: SPEEDING UP RECURSIVE QUERIES AND HIERARCHIC DATA". Here they propose storing the data in this parent-child relationship like this. But then you use a materialized view to construct those tree paths and have it look like this. That way, it's relatively easy to update the data and then the materialized view will handle re-rendering any paths, they even did some optimization to get it to perform well on queries
The second post that covers this is "POSTGRESQL: LTREE VS. WITH RECURSIVE". So this digs more into the performance of this and how to handle it. So if you need to deal with some hierarchical data, I would definitely encourage you to check the original blog post as well as these two others.
Next piece of content- "Announcing PgManage 1.0a". This is from commandprompt.com. What's happened is the OmniDB project was abandoned, but the command prompt has taken it over and renamed it PgManage. This enables you to more easily manage multiple PostgreSQL servers from a single interface. They highlighted the major changes that they've done to OmniDB as well as major bug fixes, also UI improvements. So if you're looking for a tool like this, definitely check out this blog post.
Next piece of content- "PostgreSQL Patch of Interest (to me): Transaction Timeout". This is from softwareandbooze.com. He's highlighting a patch that calls for a transaction timeout. So we have statement timeouts and we have idle transaction timeouts or session timeouts. Well, this is addressing a transaction timeout. And they said this can be beneficial because maybe you have a long open transaction and statement timeouts aren't going to reflect that. But this could be beneficial. Now, this is ready for a committer so nothing's happening yet, but if you want to learn more about this patch, you can check out this blog post.
Next piece of content- "ER DIAGRAMS WITH SQL AND MERMAID". This is from cybertec-postgresql.com. This blog post is about using quote "...Mermaid is a JavaScript based diagramming and charting tool that renders Markdown-inspired text definitions to create and modify diagrams dynamically". So basically what you see here is being rendered by Mermaid and loaded with a Pagila example database. They generated the script where they explained the reasons that they've done it the way they've done it and it generated this ER diagram. So if you're interested in generating something like this, you can definitely check out this blog post.
Next piece of content is actually a YouTube channel and it's "Postgres Architecture Explained". Now this is predominantly from the perspective of the different processes of Postgres. But you can check out this YouTube video if you want. This is from Hussein Nasser.
Next piece of content is also a YouTube video and it's "Postgres System Columns Explained". He talks about ctid, xmin and xmax. So you can definitely watch both of these videos if you want more video-based content.
Next piece of content- "Postgres Raster Query Basics". This is from crunchydata.com. Now this was a good post but it went a little bit over my head given I don't use Postgres for spatial queries and analysis. But it looked like a really good introduction if I ever needed to get into understanding rasters and using them. So you can definitely check out this blog post if you need to do that.
Next piece of content. There was another episode of Postgres FM this week. This one was on "JSON". So if you want to determine whether or when you should use JSON in your Postgres database, you can definitely check out this episode or watch the YouTube video.
Next piece of content, the PostgreSQL person of the week is Frits Hoogland. If you want to learn more about Fritz and his contributions to Postgres, definitely check out this blog post.
And the last piece of content, we did have another episode of The Rubber Duck Dev Show this past Thursday afternoon. This one was "Ruby IDE Showdown With Adrian Marin, CJ Avilla & Eric Berry". We compare and contrast different IDEs such as Sublime Text, VSCode, Vim, and even RubyMine. So if you're a developer looking for a comparison of IDEs, definitely welcome you to check out our show.