background

Is pg_dump A Backup Tool? | Scaling Postgres 331

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss whether pg_dump is a backup tool, the pgMonitor extension, Postgres malware, and application uses for the merge command.

Content Discussed

YouTube Video

Podcast Audio

Transcript

So what do you use as your tool to backup postgres? Well, apparently there's a lot of people in the community that don't think that should be Pgdump, in that it is an insufficient backup tool. So that is something we're going to be covering this week. But I hope you, your friends, family and coworkers continue to do well. 


Our first piece of content “If pg_dump is not a backup tool, what is it?” And this is from xata.io. So for the longest time when you looked up PGDump, it said in the documentation, pg_dump is a utility for backing up a postgresql database. It makes consistent backups even if the database is being used concurrently. And that is true, but people have had an issue with calling it backup, because if you look at it, it's called PGDump, it's not PGBackup. So basically it is a way to dump your data. And I know my company had a database and the database system went down and the only data I had was from PGDump. I would consider that my backup, because I know with it I could get all my data back. But PGDump lacks a lot of the tooling that a backup solution generally has for database systems. So that's why there's a separate utility called PGBasebackup. It actually has backup in the name. Now, I'm not going to get into the discussion of well, when you restore from a pg_dump, you use pg_restore. So maybe they should have chosen a different name for pg_restore for that reason. But the new documentation coming in Postgres 18 is replacing the word backup with Export. So in retrospect, maybe PGDump should have been called PGExport and maybe PGRestore that restores PG Dumps should have been called maybe PGImport. But you know, naming things is a hard thing in computer science. So we have what we have. Now, one thing this article does talk about as a disadvantage of PGDump is that global objects in the entire PG cluster are not exported. That's true for PGDump because that's a per database dump or per database export. But if you want the whole cluster, you use pg_dump all. And I didn't see that mention here. So if you wanted to back up the whole cluster, meaning every database and all the global database objects use Pgdump all. So what do you think? Do you think pg_dump is a backup tool? I know I used to think, of course it's a backup tool, but now that I have thought about it more. I think yes, it is more of an export import tool because you're exporting the data at a certain time and it's consistent as of that time, and you can similarly import and you can do individual tables, etcetera. But if you want to learn more about this, you can check out this blog post. 


Next piece of content ”Announcing an Open Source Monitoring Extension for Postgres with pgMonitor”. This is from crunchydata.com and for a while I think crunchy data has had a tool they've called pgMonitor. I think it's available as an open source tool that you can use. I think that relies on Grafana dashboards and things of that nature, but they've been using it for their kubernetes and self managed postgres deployments. But they actually wanted to create a separate extension, and the reason they're wanting to design it is to do more work on the database system, to try and get quicker metrics as well as better version compatibility with whatever version of the database is being used. Now this new extension just generates the data for consumption by another monitoring solution like pgMonitor. So it of course doesn't have the Grafana dashboards and everything, it's just a data production tool from the database, is my understanding. And what it's doing is it's running a background worker that queries the postgres internal tables and stores all the metrics data in materialized views. So basically those materialized views get refreshed on some periodic basis. Now, they said the pgMonitor extension can handle slower metric queries efficiently, keeping the overall metric scrape process fast. I don't know how this would make it overall faster, because you still have to query the internal tables. Maybe they're thinking that the metric scrape is from the external tool. Looking at the materialized views, I'm a little unclear on that. Or maybe if it took time to transfer all the data from the database system to the PG monitor tool running on a separate server, maybe it was faster to run that on the background worker and just send particular queries to pgMonitor. I'd basically have to look at how this is actually engineered. I don't fully understand how it implements things, and it wasn't clear to me by reading the blog post too. But clearly it's an overall of how they are collecting the metrics, and from their viewpoint it's definitely much more efficient, and it does support collecting custom metrics as well. And here are some of the metrics and materialized views that are included. So if you're looking for a monitoring solution. Maybe you'd like to check out this option. 


Next piece of content “PG_MEM: A Malware Hidden in the Postgres Processes”. This is from aquasec.com and they're talking about a new malware that was discovered in the wild. So basically it looks like they had postgres instances running on the Internet with honeypot very weak usernames and passwords to see how it would be attacked. And basically how the attack happens is someone connects via this username and password and tries to run something like copy from program to run arbitrary code on the database system. Now they said this is a misconfigured PostgreSQL in the chart here. So I'm not sure what else was misconfigured other than having a weak username and password and have it accessible via the Internet and presumably maybe even MD five authentication. I'm not sure what else was wrong, but basically someone was able to connect to it as this user and exploit it and they show the process that they went through and they basically installed cryptocurrency miners on the system from what I understand. So basically one of the main lessons here is first, keep your postgres versions up to date. Second, don't have it directly accessible to the Internet. And if you do have to do that, use a strong authentication system such as SSL certificates, even client SSL certificates to secure your connection to the database when you're connecting. But if you want to learn more, definitely check out this blog post. 


Next piece of content “Beyond Simple Upserts with MERGE in PostgreSQL”. This is from notso.boringsql.com and this is another example of using the merge utility that was new in version 15 and is getting enhanced in 17 with the returning clause. And he says historically people were using things like insert on conflict do update for upserts. Well, merge can do that, and even more you can choose to delete data as well. Basically it's a set of rules defining how data should change based upon incoming data. Now he's using it here to set up a scoring system for mobile game. So there's all sorts of rules that must be adhered to for doing data changes based upon players and their status. He covered part of the requirements using an up cert using ON CONFLICT do update, but it didn't give all the functionality needed. So first he used merge to replicate handling up certs and then applied even doing deletions as well to give you the final implementation. And this is a key thing to keep in mind with features like Merge. Like I remember back where I needed to build actually a player ranking system, and that could have taken a lot of code using like Ruby on rails as the application framework I was using at the time, but I was able to do all the logic required using simple window functions. Well, it wasn't a simple window function, but using a Postgres feature window functions, I was able to build that in. Well, now that merge is available, complex data transformations like this can be done more easily. So if you have a complicated data transformation you need to do, maybe you can use merge instead of a lot more code in your application. But if you want to learn more, check out this blog post.


Next piece of content there was another episode of PostgresFM last week and this one was on “Get or Create”. And in this episode Michael and Nikolai were joined by Hackibanita and the reason being because he wrote the article get or create in postgresql. And we covered this on one of the previous episodes of Scaling Postgres article. And basically he was implementing tags and wanting to get a tag if it exists. If not, go ahead and create it. And he was showing all the different issues you can run into on how you implement that. Are you going to run into race conditions or are you going to cause bloat or going to run into item potency issues? So he goes through all the different options on how to do it and he discusses and does a deep dive with Michael and Nikolai in this episode, and he still says he very much likes the INSERT ON CONFLICT DO NOTHING to handle this type of scenario. But with regard to merge, he did mention he ended up getting a constraint violation when using an insert on conflict, so that was pretty surprising by him. But if you want to go into a deep dive on this, definitely encourage you to check out their episode. You can listen to it here or watch the YouTube video down here. 


Next piece of content, “CloudNativePG 1.24.0 and 1.23.4 Released!”. This is from postgresql.org. Again, this is the operator for running postgres on Kubernetes. There are others, but this seems to be getting a lot of traction recently. The other updates to this version are being able to handle distributed postgresql typologies. Basically this is having separate Kubernetes clusters in different data centers and syncing the data between them. Now you only have one cluster that's operational at any one time. It doesn't do bi directional replication, but you have one active cluster in some data center and you can fail over to another data center if you have to need to do that. Next feature is managed services. Basically this is a way for setting up services to be able to interact with your postgres running within the cluster. So this could be granting database access from outside of the Kubernetes cluster, which is not something it does automatically. And they say this is ideal for database as a service scenarios. The next one is an enhanced synchronous replication API. So basically additional features for doing synchronous replication within the Kubernetes cluster and finally a wall disk space exhaustion safeguard. So apparently there were some cases where if your wall files exceed the disk space, you could get into an unrecoverable state where the systems keep failing over to one another. Well, they have a solution for being able to deal with that in the new version, but if you want to learn more, check this out. 


The next post also related to cloud native PG is “CNPG Recipe 12 - Exposing Postgres outside Kubernetes with Service Templates”. So this uses one of the features that I mentioned. So basically shows you how you can set it up more easily than you used to be able to, to be able to say run a PSQL command to be able to connect up directly to postgres running in a Kubernetes cluster. I know when I first started experimenting with cloud native PG, the first thing I wanted to do is connect up to my cluster through PSQL and I couldn't do it because you actually need to be within the cluster to connect to the database systems default way it's set up. And I had to scour the Internet to find out how to actually set that up correctly. But presumably it's easier to do it now. So you can check out this blog post if you want to learn more about that. 


Next piece of content “Handling Cancellation Request”. This is from dataegret.com and a cancellation request is if you're saying a PSQL prompt and you run control c, it actually cancels the current running that's being executed. And what actually happens is it creates a separate connection and it uses a special secret key of the session to know which query to cancel. And they show an example of how this works here. You know, a client connects, it has its secret key, it starts a query if you need to cancel it, maybe you hit control c or something like that. It then opens a new connection, sends that key to be able to identify the connection, and then cancels the query. But this can have problems if you have things like a load balancer in front of postgres. So in this scenario, if you establish this new connection with the cancel request, you're not guaranteed that it's going to be sent to the same database system. So you could run into errors trying to do that. Or if you have multiple PG bouncers, again, there's no guarantee that the same PG bouncer will receive this additional connection to send the cancel request, so you could run into issues with that too. However, there is a solution, because as of PG bouncer 1.19, they have support for peering. So basically what you can do is you set up your configuration to define a peer id for each of the PG Bouncer processes that are running. So this is peer id one, this is peer id two and peer id three down here. And then in the peers section you define who the other peers are by the id. So you can find this host here and this host here. And what it does is if it receives a cancellation request, it knows to forward it to the other PG Bouncers that it knows about. So your connection request can get canceled. So if you want to learn more about PG Bouncer and peering, feel free to check out this blog post. 


Next piece of content “PGConf.dev 2024 - New logical replication features in PostgreSQL 17” this is from PostgresQl dot Fastware.com, and they mentioned specifically two new logical replication features coming in 17 that they helped develop. One is pgCreatesubscriber. So this is the ability to take a streaming replica and convert it into a logical replica. Because what takes the longest time with regard to logical replication is the initial data sync. So copying all the initial data can take a really long time to do. But if you already have a physical replica, basically wall streaming replica, you can now use this capability to convert it into a logical replica. So that's awesome. The other feature they added was being able to start up logical replication after a system has been pgUpgraded. So basically they improved pgUpgrade. So now some additional objects such as replication sludge replication origins get updated so that it's much easier to start up logical replication after an upgrade. So that's great. 


Next piece of content kind of related to logical replication is “PG failover slots one 1.1.0 is released”. This is from postgresql.org and Failover slots is mentioning when you have a primary that's doing logical replication and then you need to failover to a replica. The problem is that logical replication is now lost, you have to restart from the beginning because the logical slot on the primary is not carried over to the replicas, at least not today. But apparently it is in this new capability, PG failover slots, which is open source and available on GitHub. So if you need that capability, definitely check out this extension. 


And the last piece of content “Why I Always Use PostgreSQL Functions For Everything”. This is from Vedran on medium.com and this post will definitely trigger some people because I remember back to where everybody used to create stored procedures on say, Microsoft SQL Server or in Oracle, and that's where all the business logic resided. Well, he's advocating a similar tactic today and it kind of reminds me of making a microservice for your database. Basically you have one interface that you call a function, get latest device measurement and it provides the data that you need. But to me this starts to look like maintaining two different code bases, kind of like a microservices architecture might do. I definitely haven't seen this done a lot in applications that I've looked at, but it's definitely his preferred way to do this. I know personally as I'm working with different projects, I tend to just follow what the typical rails engineer would do or what the typical Python engineer would do, because for me it's more important to be able to have the next person come in after me, be able to interpret and understand how things are structured, and it's not too different from any other application framework he's worked with. But if his perspective is appealing to you, definitely check out his blog post. 

episode_image