Postgres Release Warning! | Scaling Postgres 343
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss new Postgres releases and an out-of-band release to address issues, increasing insert performance by 2X, pg_search benchmarks and fast OpenStreetMap loading.
Content Discussed
- PostgreSQL 17.1, 16.5, 15.9, 14.14, 13.17, and 12.21 Released!
- A change to ResultRelInfo - A Near Miss with Postgres 17.1
- Out-of-cycle release scheduled for November 21, 2024
- Boosting Postgres INSERT Performance by 2x With UNNEST
- Benchmarking ParadeDB's pg_search: Postgres as your search engine
- Loading the World! OpenStreetMap Import In Under 4 Hours
- Append-only tables
- Easy Totals and Subtotals in Postgres with Rollup and Cube
- Farm to TABLE: Local(e) Providers
- The Doom That Came To PostgreSQL: When Collations Change
- Grouping data into array of sums – fun with custom aggregates
- PostgreSQL Hacking Workshop - December 2024
- Stop Building Data Pipelines: Cross-Database Queries With PostgreSQL Foreign-Data Wrappers
- Postgres on Spot VMs - only for the crazy?
- dbicdump: using PostgreSQL schemas as package separator in produced Perl classes
- What Happens Behind the Scenes When You Modify a Row in PostgreSQL?
- Combining Semantic Search and Full-Text Search in PostgreSQL (With Cohere, Pgvector, and Pgai)
- Parsing All the Data With Open-Source Tools: Unstructured and Pgai
- Scaling Document Data Extraction With LLMs & Vector Databases
- PostgreSQL 17 Unveiled: Key highlights in data backup, data warehousing and more
- Ayse Bilge Ince
- Contributions for the week of 2024-11-11 (Week 46 overview)
- PostgreSQL 18: part 1 or CommitFest 2024-07
- PGConf.EU 2024 Lightning Talks
- Recap of PGConf.EU 2024 in Athens
- Prague PostgreSQL Meetup on November 18th
- EDB at PGDay MED 2024: Sharing Postgres Expertise with the Open Source Community
- On CVEs, BoMs, and Releases
- Crunchy Data Warehouse: Postgres with Iceberg for High Performance Analytics
- Understanding DiskANN
- Boost Your PostgreSQL Performance With Auto Chunk Size Recommendations
- Scale PostgreSQL via Partitioning: A Dev’s Intro to Hypertables
- How WaterBridge Uses TimescaleDB for Real-Time Data Consistency
- Is Postgres RLS for Everything and Everyone?
YouTube Video
Podcast Audio
Transcript
The typical advice for upgrading postgres is that if you're doing a major version upgrade, you need to take your time, do testing, validation and run through a very rigorous process to upgrade the database to the next version. In contrast, when you need to do a minor version upgrade, it's typically much easier. You just have to update the binaries and restart the system and generally everything works as expected. However, with the most recent release that we'll be talking about this week, there is an issue that was discovered actually with regard to extensions with just upgrading to the next release, so we'll definitely be talking about that, but I hope you, your friends, family and co workers continue to do well.
Our first piece of content is “PostgreSQL 17.1, 16.5, 15.9, 14.14, 13.17, and 12.21 Released!”. This is from postgresql.org and the first thing they're making note of here is that Postgres 12 is now officially end of life, so they're not going to be doing any more releases with regard to it, although we'll see about one that may be coming up. But this particular set of releases had four security vulnerabilities, which is higher than we normally see, as well as fixed 35 different bugs. In terms of the security issues, the base score for The CVSS was 4.2, 3.1, 4.2, and an 8.8 for one of them. But in terms of those security issues, the first one has to do with row level security and it's basically Incomplete tracking in PostgreSQL of tables with row security allows a reused query to view or change different rows from those intended. So this vulnerability has been patched. The next one is that LIBPQ can retain an error message from a man in the middle, so that's definitely something that should be avoided. The third one is that PostgreSQL set role set session authorization can be reset to the wrong user id, so this was resolved as well. And then the final one that has the highest severity score is that PL Perl environment variable changes can execute arbitrary code, so that's definitely something good to get patched. Now again, even with minor releases, you should take a look at the updating requirements needed in this case, something that probably a lot of people using partition tables. If you have foreign key constraints where you've run attach detached partition, you need to follow the guidelines shown here to make sure that things are okay after the update. And there are also some cases where you may need to do a re-index of certain indexes based upon the collation provider using LIBC in particular. So you'll have to be aware of that with regard to certain collisions you may be using. But as I said, there was an issue with some extensions in these releases.
So the next piece of content is “A change to ResultRelInfo - A Near Miss with Postgres 17.1”. This is from crunchydata.com and with the release of 17.1 there was a breaking change made to the application build interface. And this is the interface that a lot of extensions use for interacting and working with postgres. And apparently a new boolean was added to the Result Rel Info struct that caused its size to change, which can impact some extensions. Now there's a link to here under warning sirens that goes to the listserv and talks about some of the issues because of this change and people started testing different extensions to see which ones were affected. And from crunchy data it looks like all of these different extensions were unaffected, but ones that they have found were affected was Apache AGE and TimescaleDB. Now this blog post goes into what the application binary interface is, how extensions work with them, and goes into a little bit of depth about why this particular issue became a problem for certain extensions. But you should keep in mind that it's not just moving from 17.0 to 17.1, but even if you're on earlier versions and you move up to the next one in the series, you could be impacted by this because this change was introduced to every release, not just the 17 major version. And what they're mentioning in terms of timescale in particular is that “Timescale is recommending that users do not perform the minor version install at this time”. And basically things are okay if you recompile the extensions using 17.1, but the problem is if you've compiled them in a previous version and then are trying to use them in the more recent version, because you're going to get that difference. But this has more details if you want to explore it.
But as a result of this, there was another post on postgresql.org that an out of cycle release is coming. So “Out-of-cycle release scheduled for November 21, 2024”. Now as I'm recording this, that will be tomorrow, which is Thursday. So you're going to get this show that should already have these new releases, so 17.2, 16.6, etc. So they're incrementing each one including one for PostgreSQL 12. So even though it's officially end of life, they are doing an out of cycle release ward here and specifically the changes are to address this application binary interface change as well as a regression that happened with one of the security fixes that actually prevented alter user set role from having any effect. So that's going to be included in this next set of releases. So hopefully this next set of releases will be good and no outstanding issues will be discovered. But it is still the general recommendation to upgrade to these as soon as you can, particularly if you're impacted by one of the security vulnerabilities mentioned. But of course you should always test this in a testing environment, staging environment, sandbox environment, whatever it is. Just to make sure that everything works as expected with a new version of Postgres. But check out these posts if you want to learn more.
Next piece of Content “Boosting Postgres INSERT Performance by 2x With UNNEST”, this is from timescale.com and normally if you want fast insert performance, the fastest way to do that is to use copy. But if you actually need to do inserts because you need certain features with regard to it, generally you want to use multirow inserts. So in this example you're inserting into a set of sensors a particular set of columns and you insert multiple rows with a single insert statement. That will give you the best insert statement level performance. However, not quite because you can use unnest to improve performance and he demonstrates that in this blog post. So as opposed to doing multi rows, what you do is you submit a set of arrays for each column to post. So the timestamp has a timestamp arrays of all the values you want to submit. The sensorID has a text array of everything you want to insert, and in terms of the value there's a float array of everything you want to insert. And using this technique gets you about a 2x performance improvement based on how he did his tests. Now he tested batch sizes of 1000, 5000 and I think they're supposed to be 10,000, 10,000 and he really didn't see any differences between the batch sizes, but it was pretty consistent that there was a 2x improvement inserting arrays that are unnested versus a multi row insert. And what he did say is that primary savings in time or the speed came from query planning. In other words, the multi row insert there was a lot more planning required for each column, whereas with your just saying this column put this array of values in it resulted in much lower planning times he didn't show that here, but that's what he indicates. And he said this does add a little bit of complexity, but if you're looking to squeeze the most performance out of it, this is a technique that is faster than multi row inserts.
Next piece of content “Benchmarking ParadeDB's pg_search: Postgres as your search engine”. This is from tembo.io. And he is basically benchmarking ParadeDB's PG search which is an extension that includes an embedded elastic like search engine in Postgres and it's built ATOP, I think BM25 is the index type and Tantivy is the elastic like search engine. And what he did is he compared full text search with a GIN index to PG search and it's I think BM25 index. So he used a tool developed by Parade DB to generate workloads and queries. So I don't know if that has an impact on the accuracy of the search comparison, but he generated a corpus of around 10 million rows. And he noticed that even in the index build time to build the GIN index took five minutes, whereas to build the PgSearch index took less than a minute. So it's already a lot faster to do an index build. And then in terms of queries, the average search time was 1.66 seconds with this distribution, whereas with PG Search it came in at 6 milliseconds with this distribution. So a much narrower distribution and being 265 times faster. So that is significant. Now I forgot to mention this is not just finding records, this is actually doing a ranking. And the indexes for postgres search can't optimize the ranking component that well. So a lot of the performance difference may come from how the PG Search index can optimize ranking queries whereas the genindex on the full text search cannot. So that may be most of the difference. But then he looked at 100 million rows and the ranking query he used took 14 seconds to run, whereas PG search took 28 milliseconds. So 500 times the performance of Postgres. So definitely if you need to list search results by ranker relevance as they're listing here, it looks like the BM25 indexes that PG Search uses is definitely superior to GIN indexes and full text search. So if you're looking for better search performance, maybe you would want to check this out.
Next piece of content “Loading the World! OpenStreetMap Import In Under 4 Hours”. This is from crunchydata.com and the OpenStreetMap database, which I think is of the whole world, has almost a terabyte, 750 gigabytes of location data and he says it notoriously takes a full day to run. That's loading all the data and applying all the indexes necessary. And I think he's benchmarked this in the past, but he did a recent benchmark using Postgres 17, the most up to date hardware as well as the most up to date OSM2PG SQL loader as well. And with all the improvements. Compared to 2022 where it loaded in just under eight hours with Postgres 14, with again the new hardware, all new versions of the software including Postgres 17, it went down to four hours, so double the performance. He said a lot of it was due to the GIST index building improvements that took place from PG14 to PG15 as he shows here. But he said there was also hardware improvements and other things as well. So if you want to learn more about this, definitely check out this blog post.
Next Piece of Content There was another episode of Postgres FM last week. This one was on “Append-only tables”. So Nikolai and Michael talked about how to work with append only tables in your database. So this could be event based tables that are capturing events or sensor data or even just doing audit logs of sorts. Basically you're not going to be updating that data or deleting that data in one off cases. You're basically constantly collecting it. Maybe at some point you will archive it, but generally you're just collecting a bunch of inserted data. Now of course one advantage of this is that you don't have to really worry about bloat because you're not updating anything, you're not deleting anything, so that's a benefit. But of course you still need to run vacuum to make sure statistics are up to date as well as the visibility map and other features of postgres. So you do need to still optimize vacuum for append only tables. And of course another great feature to use with postgres is partitioning, because that enables you to just have one active partition at any one time, generally the most recent one and all the other data can just be kept in essentially a cold state. You don't need to change those tables a lot and then when the time comes to archive it, you can easily just detach those partitions or drop those tables. They talked about how BRIN indexes could be a fabulous use case for these types Append only tables because there's no updating and deleting going on. Because a BRIN is a block range index. So it indexes ranges. And the problem comes if you're updating and deleting data. Data can fall out of that range and you can get really poor performance. But with an append only table, that's usually where you see BRIN work best. And Michael mentioned something he didn't take credit for. He said it was probably from somewhere else. But he says if you're going to be partitioning a table, the best way to partition it is how you plan to delete that data. So in other words, if it's event based data, you probably want to partition it by time to be able to archive or delete those old partitions. Or if you have a B2B SaaS company, maybe you're partitioning by customer because you eventually want to remove customers who are no longer using your service. And then finally they closed out the episode talking about how to deal with all of this data. Because if it's append only, these are probably going to be your largest tables. The data size is constantly growing. How do you deal with it? So they talked about maybe you want to set it into column storage, which can be more efficient or compress that data. And they said TimescaleDB does some of these things for you. Or maybe you want to offload it to say something like S3 or some type of cold storage to keep those older partition tables accessible but not on your high speed disks because the queries to them are infrequent. But if you want to learn more, definitely encourage you to listen to the episode here or watch the YouTube video down here.
Next piece of content “Easy Totals and Subtotals in Postgres with Rollup and Cube”. This is from crunchydata.com and they're imagining you have some data that you're grouping it and ordering it by a given month in a particular category. And it would look something like this. But what if you wanted to add subtotals to it? Well, there's the complicated way that they show here doing a union all to calculate a total. But you can just use rollup. So you just add rollup after the group by and it will give you essentially a subtotal for all of that month. Similarly, if you want to show subtotals and a grand total of all the data, you can use cube. So you just put cube after group by and now it will show you the subtotal by month as well as the totals for each of the categories and then finally the grand total and they show a little representative diagram of that here. And then finally they say if you're not satisfied with nil in these spaces, you can actually rename it to something using COALESCE. So here they renamed it to grand total and then subtotal for the categories here. But check out this post if you want to learn more.
Next piece of content “Farm to TABLE: Local(e) Providers” this is from thebuild.com and this is the next post in a series about locales and collations. And in this one he's talking about different locale providers, namely libc, which has been the default implementation for postgres forever. ICU collation providers which kind of breaks the dependency with relying on the underlying operating system. And then finally the new built in collation provider which comes with Postgres 17, which I'm super happy about because this should be much more stable, although it does only offer a C locale and a CUTF8 locale, although that is the one I would probably be using. And then if you need language based sorts on top of that, you could add specific indexes with those types of sorts or collations.
And then he did a next post about this called “The Doom That Came To PostgreSQL: When Collations Change”. So here he's talking about the change that happened in I think it was a version of Ubuntu where the glibc version changed and basically definitely required re indexing all of your text based indexes. So definitely something to be aware of when you're doing upgrades. What locale provider are you using and what changes are going to be necessary when you do an upgrade?
Next piece of content “Grouping data into array of sums – fun with custom aggregates”. This is from depesz.com this is a little bit of a mind bender, but they wanted to more efficiently store data. So as opposed to storing counts individually, they wanted to store them in arrays where each array had 24 elements in it. So one aggregated sensor count per hour. And he goes through a different way to fabricate that data in order to test his implementation. And he created a custom aggregate to do this. So as opposed to just using sum, he created sumper hour so you give it a particular set of input. And he designed a function to be able to store and load this data in to the point where, although he didn't show it, he said you can use Merge now to add to those aggregates as new sets of data come in. So if you're interested in something like this? Definitely. Check out this blog post.
And the last piece of content ”PostgreSQL Hacking Workshop - December 2024”, this is from rhaas.blogspot.com so the next hacking workshop is coming up and they're going to be discussing a talk by Melanie Plagueman Intro to the Postgres Planner. So for those of you who hack on postgres, you may want to check this out to get a better understanding of how the Planner works so you could potentially add enhancements to it. So check out this blog post if you want to learn more.