State Of Postgres 2023 | Scaling Postgres 296
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the State of Postgres 2023 survey, a repository of Postgres how-tos, ways foreign keys can break and a custom SQL playground.
Content Discussed
- State of Postgres 2023
- postgres-howtos
- Broken foreign keys: how can that happen in PostgreSQL?
- The Postgres Playground: BYOS (Bring Your Own SQL)
- Zero downtime Postgres upgrades and how to logically replicate very large tables
- Starting PostgreSQL logical replication for some tables based on a dump
- Hash indexes
- What you should know about PostgreSQL minor upgrades
- Global search inside a database
- Highlights from podcast episode about Postgres monitoring with Lukas Fittl and Rob Treat
- Extracting SQL from WAL? (part 2)
- Data Science is Getting Ducky
- Working with GPS data in PostGIS
- PostGIS Clustering with DBSCAN
- Transaction Isolation in Postgres, explained
- Understanding Triggers In PostgreSQL
- Explore Table Access Method Capabilities: How Data Insertion is Handled
- Key PostgreSQL Extensions Every Developer Must Know When Migrating from Oracle.
- New in PostgreSQL 16: Bi-Directional Logical Replication
- Experiencing a specific table recovery using file-level backup
- Trying to be Barbie in Ken's Mojo Dojo Casa House
- Introducing Read Replicas
- Fly Postgres, managed by Supabase
- pgAdmin for All of Your Postgres Kubernetes Clusters
- Supabase Wrappers v0.2: Query Pushdown & Remote Subqueries
- Supabase Auth: Identity Linking, Hooks, and HaveIBeenPwned integration
- Quickly Building SQL Dashboards for Time Series With Continuous Aggregates
- Reflections on PGConf.eu 2023
- 2023.pgconf.eu – recap
YouTube Video
Podcast Audio
Transcript
Let's check out the gifts the Postgres community gave to us and the content this week for this episode. But I hope you, your friends, family, and coworkers continue to do well. I was thinking, you know, this week we're in the holiday season, there's probably not going to be a lot of content. Boy was I wrong. There's a lot, a lot of linked content here. But as I typically do, this is the top ten content according to me. I usually do the top-rated first and then follow up with the rest as it comes. Although it's not exactly ten, maybe it's eleven or nine. But the first piece of content is "The state of PostgreSQL 2023". This is from timescale.com. They have released their survey that they did, I think, in August and September. I probably shouldn't have shown the preview because there's a lot of information that's repeated, but we'll take a look at some of it.
So I did mention some last week. This is something that wasn't mentioned as to why people chose Postgres. The number one reason is the open-source license. So that's far and away the most popular reason. And I believe this is just a recording of how many people have been using Postgres. Next came the feature set, except for those who've recently adopted it, which is kind of interesting, or it's not as popular a reason. Next is its reliability. Interestingly, to me, the older you are, the more you like the reliability. So lesson learned, everyone. Reliability is important. Next, it looks like it's pretty much extensions and then performance and then it falls off from there. So that's some pretty interesting information. The primary use case for using Postgres is for application development, which is not too surprising.
It asked what kind of industry are you in? Information technology, software, SaaS, finance, and fintech were the top ones. In some areas, it talked about the community, we talked about the extensions and here's an option where they talked about the tools used to connect to PostgreSQL most often, PSQL wins out. And that's the only thing I use. I don't use any of these others such as pgAmin, DBeaver, Datagrip, or IntelliJ. What other third-party tools are used? Most people said I don't use any third-party tools or PgBouncer, although there are a few others like DepeszEXPLAIN, pgBackRest, pgAnalyze, et cetera. Looked at different visualization tools that were being used, and then the section about AI that again we discussed last week. So that's pretty much the survey. So go ahead and check it out if you want to look at the survey in more detail.
Next piece of content is actually a repository at GitLab at Postgres AI under PostgreSQL consulting under Postgres How To. So this is Nikolay, who's one of the hosts of Postgres FM and I think the founder of Postgres AI. Apparently, according to the readme, he started a Twitter stream or thread. He calls it a Postgresql marathon where his goal is to create daily posts, 365 of them, and apparently, he's on 79 and it's just how to do different things in PostgreSQL. It looks to be maybe from a consultant perspective like one is how to troubleshoot and speed up Postgres restarts how to troubleshoot startup it's got a few on pg_stats_statements, so I figured this would be a great reference to have so I would definitely keep this link as he keeps it updated. I don't know if he's eventually planning to do this in a different format, but it's available here now and if you click on it, of course, it gives you more detail in a blog post format. So definitely encourage you to check this out, or if not, at least bookmark it for when you have problems, maybe you can investigate this.
Next piece of content- "BROKEN FOREIGN KEYS: HOW CAN THAT HAPPEN IN POSTGRESQL?". This is from cybertec-postgresql.com. They're talking about issues where foreign keys can become broken. So foreign keys actually use system triggers, and that a single foreign key constraint will create an AFTER INSERT and an AFTER UPDATE trigger on the referencing table, and an AFTER UPDATE and an AFTER DELETE trigger on the referenced table. This is to make sure that you have no orphan records that get created based on changes. Now the system triggers, as he mentions here, are written in C and ignore the ordinary MVCC rules to avoid race conditions. But he says if you try to use user-defined triggers to do something similar to what foreign keys do, you're going to be running into race conditions unless you use serializable isolation or implement some sort of locking.
But you can take a look at these foreign key triggers. These system triggers are in the pg_trigger table. So we showed you how to query to see kind of what they look like when you're setting up a foreign key. So one way they can get broken is of course through database corruption. Thankfully, that tends to be a rare occurrence with Postgres data corruption in and of itself. The second reason is they can become broken by the session replication role. He says, quote "When logical replication replays data modifications on the subscriber, the order of these changes might conflict with foreign key constraints on the subscriber". Now one way to get around that is to use the session replication role and alter it to say replica. Then what happens is the triggers won't fire, including triggers on foreign keys. So this is on the subscriber. So it is possible to create orphan records that way.
So you need to be cautious of that. The next way that they can become broken is by actually just disabling triggers by using the ALTER TABLE command to disable all triggers. Well, it's going to disable triggers as well as the triggers that maintain the foreign key constraints. So that's something to be aware of. A final way mentioned that you can break foreign keys is that with foreign keys you can define a cascade. So cascade deletes from a reference table to the referencing table. Well, the problem is any triggers can also fire on that referencing table. Basically, if you set up a BEFORE trigger on that referencing table, that cancels the operation, and you can do that by just returning NULL in the example he shows here. Basically, it's not going to delete that referencing row and you have an orphan row. So those are some of the ways that you can break Postgres's foreign key constraints. And you probably just want to review these to make sure that you try not to do that.
Next piece of content- "The Postgres Playground: BYOS (Bring Your Own SQL)". This is from crunchydata.com. The Postgres playground is something that Crunchy data set up in their developer area that allows you to run an instance of Postgres in your web browser. And they're saying, hey, you can bring your own SQL files that can create tables, create data within this instance of Postgres because it's just running on your web browser. So they have an example here where they're just creating an example table and inserting two values into it.
They say because it's its own instance, you can even set all sorts of parameters as a part of that SQL that you send to it as well. Basically, their recommendation on how to set this up is to create a GiST on GitHub that contains the SQL you want to run. Then you build your playground URL, passing it in an SQL parameter that links directly to your GiST. So he has an example right here, and when you run it, it loads up their playground. Postgres in the browser runs your SQL file, so the schema or the data you've prepped is all loaded and ready to go. So this is pretty cool. Go ahead and check this out if you want to learn more.
Next piece of content- "Zero downtime Postgres upgrades and how to logically replicate very large tables". This is from pganalyze.com and this covers the post that we covered last week. I think it was the second piece of content we covered where the knock.app site was logically replicating from Postgres 11 to 15. And they use logical replication to do it. Well, Lukas covers that in this episode of "Five minutes of Postgres", and he also includes a number of references to other ways that you could replicate that data as well. So if you plan on doing any types of logical replication upgrades, I definitely recommend checking out his piece of content as well. Then secondly, I included in this the post I mentioned last week. I didn't actually show the post, but this is the actual post where you can do, as it says, "logical replication for some tables based on a dump". This post is from dbi-services.com. This is a different technique than I think even the ones mentioned in "Five minutes of Postgres", where you start a PSQL session using the replication database, using the syntax here, and then you create a replication slot within that session. You keep that session active, and then you can place a dump using the snapshot that you're given as part of the output, and that basically holds the position of that slot while this pg_dump is going on. Once you've copied the data over, you use this command to create the subscription specifying the slot name, telling it not to create the slot. Make sure the enabled is false and copied data is false, because if you already copied the data over using pg_dump in this example, then you alter the subscription to enable it. Now how this can help with large tables is that I've actually used this technique in parallel. So instead of using pg_dump, I think I used COPY. You could probably use pg_dump as well, but for a single table I parallelized it across eight or ten sessions or processes to the database, and that gave me the maximum amount of throughput on the system to convert the data as fast as possible. So this is another technique you want to review in case you're considering doing this type of logical replication upgrade.
Next piece of content, there was another episode of Postgres FM last week. This one was on "Hash indexes". So they talked a little bit about the history, the pros and cons of them. And basically, the pros are not that high while the cons are pretty high. There may be specific use cases. I think the one that they mentioned was if you have a really large text field, it's not going to be storing that in the index. It only stores the four-byte hash of whatever value you're trying to index in the index. For example, B-trees have finite limits on how much you can store in a B-tree index. So if you have huge text, you're trying to test a quality for a hash is a great way to do that, because you could just hash it and it's just one hash key, and that hash key is what's stored in the index for a hash index. So that size may gain some advantages on lookups and queries, perhaps, but the big downside is that hashes only work on equality.
So you can't look at ranges greater than less land doesn't work. You can't do any ordering like you can with a B-tree index, and plus you can't do index-only scans because there's no value stored with it. However, an interesting thing that I thought about during the episode is that Nikolay mentioned how he created his own hash index by using a B-tree index and then using a hash function on it to basically index a hash. Well, you could make that a covering index and include certain payloads along with it. So that could give you a hash-like index that has the potential for an index-only scan depending upon the type of data you want to bring over with it. But this is another interesting episode you can check out if you're interested.
Next piece of content- "WHAT YOU SHOULD KNOW ABOUT POSTGRESQL MINOR UPGRADES". This is from cybertec-postgresql.com. Basically, the minor upgrades are when you're looking at a version of Postgres, the first number before the dot is the major version, and the second number is the minor version. Minor versions only contain bug fixes, so there should not be any new features that'll change something unexpectedly. The minor upgrades are always binary compatible, which means you should just be able to install the new software, restart Postgres, and everything works with the data. There are no data changes needed, although caveat with that in a second. And he says minor upgrades won't introduce any bugs, although he said that's happened.
So I would say it's rare to have any new bugs along with it. But the general recommendation is to install these as soon as they come out because usually, the benefits outweigh any disadvantages that have happened to be around. And if you have any concerns, maybe wait a week or two to see if anyone's reporting issues before you go ahead and do an upgrade. Now, how do you know that a new release has happened? Well, of course, I announced these on Scaling Postgres, but there is a PostgreSQL release roadmap link here, and it tells you when they're projecting to do each of the point releases, and usually the unplanned ones, only happen if there's a really severe, probably data corruption bug or some big huge security issue.
But otherwise, they tend to follow the plan. And they say you can also get alerted by subscribing to the "pgsql-announce" mailing list. Now back to what I was saying in terms of there's no real changes that are needed to the data, you do have to look at the release notes because not infrequently, there are times when you have to do specific things. For example, reindexing certain types of indexes is probably the number one thing that I see. So you just need to follow the recommendations and are you using those specific index types? If so, then you do need to plan a reindexing operation after upgrades of minor versions. But if you want to learn more, check out this blog post.
Next piece of content- "Global search inside a database". This is from postgresql.verite.pro and this is following a process where you know you have a value, but you don't know which tables or columns that value may be stored in. So this blog post has created a database-wide search in the form of a function so that you can do a global search with a particular search term, and then you can narrow down what parts of the database you want to focus on through these different parameters. And it basically searches all columns or rows to find those. So it goes into a lot of detail about the different parameters. But actually here's the 85 lines of source code that does it. So it looks through the different schemas and does this to find searches of the data. It has some examples down here where you can see what a search of “Foo” results in. And it says 'Okay, in this schema for this table name and this column, this column value was discovered and here's the row ctid'. So if you have a need for something like this, definitely check out this blog post.
Next piece of content- "Highlights from podcast episode about Postgres monitoring with Luka Fittl and Rob Treat". This is from citusdata.com and they're talking about a podcast episode, and this is basically the text version of it. But I'm sure you can get the link to the podcast if you want to learn more about this topic.
Next piece of content- "Extracting SQL from WAL? (part 2)". This is from rjuju.github.io, and this is the second part where he needed to extract data from a wall stream, but it wasn't set for logical replication, so it didn't have what you would normally expect to see here. However, he goes through the process of analyzing and pulling out the data that he needs. So you can check out this blog post if you want to learn more.
The last set of posts are about PostGIS. So the first one is "Data Science is Getting Ducky". This is from glog.cleverelephant.ca, and he's talking about how he just got a new M-2 Pro Mac where a GEOS compile time that took 20 minutes went down to 45 seconds. So that's impressive. As a result, he believes that you could use more in-processed databases on your own system to do GIS tasks. And he specifically mentions DuckDB here. So it's basically a thought experiment about this and the evolution of column-oriented data or hybrid storage layouts for storing data and working with parquet files. So if you're interested in that, you can check out this blog post. The next one is "WORKING WITH GPS DATA IN POSTGIS". So this is from rustprooflabs.com and they show you how you can get this set up in PostGIS. Finally, there's "PostGIS Clustering with DBSCAN" from crunchydata.com. They discuss how to work with clustered data with these different functions in Postgres and Postgis. So check this content out if you'd like to learn more.