Broken Indexes, Trademark Issues, Percentile vs. Average, Logical Improvements | Scaling Postgres 183
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss one cause of broken indexes, Postgres trademark issues, percentiles vs. averages and logical replication improvements.
Content Discussed
- Broken indexes after Linux upgrade
- Trademark Actions Against the PostgreSQL Community
- Postgres Core Team launches unprecedented attack against the Postgres Community
- Respecting the majority, questioning the status quo as a minority
- How percentile approximation works (and why it's more useful than averages)
- Logical Replication Improvements In PostgreSQL-14
- PostgreSQL: Create indexes after bulk loading
- A quick test for postgres_fdw batch insertion
- PostgreSQL schedulers: comparison table
- Best PostgreSQL GUIs in 2021 (Updated)
- pglog, a flexible and simple log analyzer
- Tuple Duplicates Support in Incremental View Maintenance
- What’s new in the Citus 10.2 extension to Postgres
- Soumyadeep Chakraborty
- Rubber Duck Dev Show Episode 13 | REST vs. GraphQL APIs
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Broken indexes after Linux upgrade". This is from elephanttamer.net, and he's talking about an issue where when you put Postgres on a new operating system version, sometimes that can break your indexes. The reason is a lot of times Postgres uses the default glibc library version, and if they have collation changes, that can break your text-based indexes. You'll run into issues either with unique indexes, not properly keeping things unique or duplicates, or any number of changes. Now, he specifically says he apparently encountered this issue when going to Debian 10, Ubuntu 18.10, or the 20.04 LTS version.
So what you need to do if you ever upgrade in a collation changes, you're going to need to reindex any text-based columns you have to make sure that your indexes remain consistent and accurate. Now, he mentions a number of errors you may be getting, such as crashes or restarts or certain errors or certain jobs failing, basically just something to keep in mind. This will even happen if you bring up another replica that's on a different operating system version than the primary. You could also encounter issues similar to this. Now, what you do to redo the coalition is you do a reindex. Now, this blog recommends using amcheck to look for invalid indexes, but I would plan ahead. If you're changing operating system versions, go ahead and do a reindex of your text fields for your database and plan that time to do it, just to be sure.
I have actually kept my systems on Ubuntu 18.04 because I'm waiting for Postgres 14, which can do reindexes of partition tables. I'm actually waiting for that enhancement of Postgres 14 before I upgrade my operating system version because I didn't want to run into this issue. Now, another way to sidestep this problem is to actually use ICU coalitions. So you're not going to be relying on the glibc version that can be upgraded. You can use ICU coalitions, which gives you a separate library that you can upgrade independently, essentially of the operating system. They talk a little bit about that in this hyperlinked post here by Citus, and we did cover this post back when it came out. But it is a good review of this issue and gives you the option of doing an ICU coalition. So you can go ahead and check out this post if you want to learn more.
The next piece of content is "Trademark Actions Against the PostgreSQL Community". This is from postgresql.org, and they're talking about an issue where there have been some trademark applications made by different organizations. This is separate from the PostgreSQL core team or the PostgreSQL Community Association of Canada. This goes into the area of legal and trademark issues and the Postgres and PostgreSQL name. So I'm definitely not an expert in this area, but if you use Postgres, I think it is appropriate and important for you to read this and just be familiar with the current activity on what's taking place. There are also two other posts.
The next one is "Postgres Core Team launches unprecedented attack against the Postgres Community". This is from the postgresql.fund website and this shows another side of the issue with regard to trademarks and PostgreSQL and the overall community.
Then by a third post- "Respecting the majority, questioning the status quo as a minority". Again, from postgresql.fund website. This goes into depth on this issue as well. Now, again, I'm definitely not an expert in this area, but I would encourage you to check out these posts just so you're familiar with what's currently transpiring with a database that you're probably using if you're watching this content.
Next piece of content is "How percentile approximation works (and why it's more useful than averages)". This is from blog.timescale.com. Now this is a quite significant blog post that's very, very long. The first half of it talks about the differences between medians, averages, and percentiles and how percentiles can actually be quite advantageous for tracking, say, performance numbers. Particularly, as it relates to taking an average instead or a median. Because a lot of times medians or averages may not give you the best insight into how well your systems are performing.
And percentiles are much better in this case. You can run percentiles because PostgreSQL has a built-in function to do it, the percentile_disc function and you can define the given percentile you want to get information about. But of course, this is from Timescale and they're talking about how they've also developed a specialized function that gives you an approximate percentile. So basically, it only takes a sample of the data and gives you still pretty accurate percentiles but does it much faster. They talk a little bit about those functions and their availability within the Timescale extension for PostgreSQL. That's essentially what the second half of the post covers. So if you're interested in that, you can check out this blog post.
The next piece of content- "Logical Replication Improvements in PostgreSQL-14". This is from amitkapila16.blogspot.com. They're talking about all the different changes for logical replication in Postgres 14, including decoding of large transactions. So that's where the decoding process happens when the transaction is still in process as opposed to waiting till it's committed. That allows replicas to keep closer to the current state of the primary. Talking about the performance of logical decoding and how that's improved, particularly if you have a lot of DDLS that are hitting the system. Talking about improving the performance or time of the initial table sync and avoiding errors that would cause the whole thing to be retried.
So there have been a number of improvements in this area. Logical decoding of two-phase commits, having that as a new feature. Being able to monitor logical decoding through a pg_stat_replication_slots system view. So that could definitely be beneficial, as well as allowing publications to be easily added and removed. Before, you had to iterate every publication, whereas it looks like you can just add one to a given subscription now. It talks about adding a binary transfer mode and allowing to get messages via pgoutput. So if you're interested in any of these features coming in Postgres 14 with regard to logical replication, definitely check out this blog post.
The next piece of content- "POSTGRESQL: CREATE INDEXES AFTER BULK LOADING". This is from cybertec-postgresql.com, and this is a good piece of advice. Generally, you're going to always want to create your indexes after you've bulk-loaded a lot of data because it's much faster to load the data. He's got an example here. He loaded the same amount of data in two different ways. The first way, he loaded all the data with no indexes, and then he applied the one index that existed and it happened in less than half the time of keeping the index on the table and loading it that way.
So it's far more efficient to load the data and then add the indexes as a second step. Now sometimes, of course, you're going to need to load a lot of data with an active table with active indexes, and there's simply nothing you can do about it. This is more applicable to your doing an initial load of a lot of data. You can take this into account, although you may need to keep certain indexes on during the loading procedure if they're referenced as a part of the loading procedure, like if you're referencing data from another location to be able to do a secondary update or something of that nature. But overall, you want to keep the indexes off during any big initial load and then apply them later. But if you want to learn more, you can check out this blog post.
The next piece of content- "A quick test for postgres_fdw insertion". This is from highgo.ca. They did a quick test to see how much faster insertions are with the postgres_fdw in version 14. So they basically set up a foreign server on the same server using a different port and tried inserting different amounts of data: 1,000 rows, a million rows, and 100 million rows, and examined their performance across different batch sizes. The first one, there's no batches, it just does one row at a time, the next does ten rows at a time, and the next does 100 rows at a time.
What he found was that the performance, particularly with larger record sets, was about five to ten times faster using the batches. So for example, the no batch with 100 million was 53 seconds, whereas it was 5 seconds with a batch size of 100. So ten times faster doing a batch size of 100. The thing to keep in mind is that this was on a local server. Imagine the performance difference if you were interacting with a server across the network. I would imagine that increasing the batch size would give you even greater performance compared to doing one row at a time. So definitely a great improvement if you're using foreign data wrappers.
The next piece of content- "POSTGRESQL SCHEDULERS: COMPARISON TABLE". This is from cybertec-postgresql.com, and they're doing a comparison of all the different schedulers that to their knowledge exist for Postgres. So they're talking about pg_timetable, which is a scheduler created by CyberTech. They also discuss pg_cron, pgAgent, jpgAgent, and pgbucket. They do the comparison charts of checks and X's, and then they have descriptions of each area below. So this is a great post if you're looking for a scheduler for Postgres, although generally I still just use Cron to do all my scheduling, so I haven't ever used any of these particular tools. But if you want to learn more, you can check out this blog post.
The next piece of content- "Best PostgreSQL GUI's in 2021 (Updated)". This is from retool.com, and Retool does do a PostgreSQL GUI, so that's something to keep in mind with this list. But they do list, other than their tool, ten other tools that allow you to use a GUI to interact with Postgres. The number one being, and probably the oldest and most popular, is pgAdmin, but they list others like Navicat, which is their paid-for service. There's DBeaver and HeidiSQL, DataGrip, OmniDB, Beekeeper Studio, TablePlus, QueryPie, and SQLGate. So those are all sorts of different GUI tools that you can use with Postgres.
The next piece of content- "pglog, a flexible and simple log analyzer". This is from mydbanotebook.org, and this post talks about an extension that's explicitly used to analyze log entries from Postgres. The idea is you take logs from one database source, you load them into a separate PostgreSQL instance that you can then use SQL to analyze those logs of the other server system. So there are particular requirements that you should set on the server you're going to be monitoring, and you're going to want to collect those csvlogs here. You load them into tables in a separate instance, and then you can use this pglog extension to be able to run some reports looking at the data in there. So like an autovacuum report, checkpoint report, temp file usage, et cetera. So if you do a lot of analysis or want to use SQL to analyze the log output of your database systems, you can check out this post and extension.
The next piece of content- "Tuple Duplicates Support in Incremental View Maintenance". This is from yugonagata-pgsql.blogspot.com, and this is another post on Incremental View Maintenance, where they want to design a way to maintain a materialized view without having to refresh it. Basically, it stays fresh by reading the data tables and doing the updates in real-time. And here they're talking about being able to support duplicates in a table. So if you want to learn more about their progress with this feature, you can definitely check out this blog post.
Next piece of content- "What's new in the Citus 10.2 extension to Postgres?". This is from citusdata.com and there have been a number of enhancements, including PostgreSQL 14 Beta 3 support. Basically making it Postgres 14 ready. They've added some partition management functions for doing time series as well as index support for columnar tables. This is the one that I find pretty interesting, but if you're interested in Citus, which is an extension that does scale out Postgres, you can definitely check out this blog post.
The next piece of content. The PostgreSQL person of the week is Soumyadeep Chakraborty. So if you're interested in Soumyadeep and his contributions to Postgres, you can definitely check out this blog post.
The last piece of content, our next episode of The Rubber Duck Dev Show covered "REST vs GraphQL APIs". So if you're looking for more long-form developer content, you can check that out.