background

Community Acquisition, Space Saving Terabytes, WAL Archiving, Vacuum Analyze Tips | Scaling Postgres 134

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss a company acquisition within the Postgres community, how to save terabytes of space, setting up WAL archiving and vacuum/analyze tips.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your family, friends, and coworkers continue to do well. Our first piece of content is "EDB Completes Acquisition of 2ndQuadrant; Becomes Largest Dedicated Provider of PostgreSQLProducts and Solutions Worldwide". So, this is an announcement from the enterprisedb.com website or EDB. As you can tell, they have acquired 2ndQuadrant, and this post goes over some of the details as to why they did it. They give you some highlights of the deal in terms of number of employees, customers, locations, things of that nature. 

There's also an announcement on their blog: "How EDB Became the Leader in the Postgres Market" and has some discussions about why they decided to do this now. So I find this definitely a noteworthy item, even though it's not specifically about how to scale. You know, I believe this has an impact probably both positively and negatively on the Postgres community. Having a larger organization advocating for Postgres is beneficial. But I can't help but think that we lose a little bit of the community's diversity when something like this happens. Either way, feel free to check out these two posts.

The next piece of content is "PostgreSQL, a community project". This is from 2ndquadrant.com, and he's talking about how PostgreSQL is truly a community open-source project. And he says most of the open-source governance models usually have three different structures. One, it's run by an individual, or maybe two, but not that many. The second one is ruled by one company. Third, it's actually community-led. So he talks about each of these models and how one can transfer into the other one and transfer back. But it's very hard to actually transfer to a community-led one because it's hard to go from one company to suddenly multiple companies and individuals supporting it. 

Or even if you have one or two individuals to grow a community where you have multiple organizations as well as people assisting on the project. It talks about how Postgres is definitely this community-led version and it has a lot of benefits. He mentioned this because when Postgres 13 came out, he noticed a lot of people discussing it as being a successful community-led open-source project. So definitely an interesting perspective and kind of why I mentioned one of the potential negatives of having acquisitions between EDB and 2ndQuadrant is maybe we lose some of that community diversity by doing that. But definitely, some interesting news items happening this week.

The next piece of content is "PostgreSQL at Scale: Saving Space (Basically) for Free". This is from Braintree Product Technology on medium.com, and basically, they're talking all about ordering columns to get space savings. Now, he says with this technique, they've been able to save about 10% of their disk space and they have over 100 terabytes of data. So essentially it looks like they've saved maybe about ten terabytes of disk space. As he mentions quote "This technique isn't revolutionary: it's been well documented by 2ndQuadrant...", EDB, GitLab, et cetera. A lot of it goes down to have padding for types when it needs to preserve some alignment boundaries. And they give a good example of this because if you're going to have a bigint, it's going to want to have an eight-byte alignment. So if you have an integer followed by a bigint, it only wants to start this bigint at the next 8th byte or eight-byte increments. 

So if you have a simple integer before that, it has to pad this space so it essentially becomes wasted space. Whereas if you order it with the bigint first followed by the integer, as you can tell, we've saved about a quarter of the space. So it's using these techniques as well as a few others that they mentioned here, such as handling text binary types in a certain way, handling defaults and NOT NULLS and primary keys a little bit differently, where putting primary keys first. Basically, things that are likely to contain data put them ahead of time. So this is how they've been able to achieve these space savings. They actually packaged it up into a Ruby Gem because they use Ruby but it's called pg_column_byte_packer. 

For those people who use Rails as their application framework, what it actually does is it helps order the migrations that you do to the database so that when you do a new one, it places things in the proper order to try and conserve space. It says they even went one step further and basically used pg_dump to rewrite tables for their whole database to reorder their data in it. Now, normally I wouldn't anticipate wanting to do this for just the 10% savings, but when you're talking about hundreds of terabytes and you're going to save tens of terabytes doing it, then maybe it makes sense to do that. But definitely an interesting way to save space in your database. So if you're interested in learning more, you can check out this post.

The next piece of content is "PostgreSQL WAL Archiving and Point-In-Time-Recovery". This is from highgo.ca and it's a post about exactly that, about how you have the write-ahead log that logs all the database activity to do crash recovery and maintain durability and consistency. For backup purposes, you're going to want to archive those WALs because they enable you to restore the database and restore to a specific point in time. They're talking about the different archiving options you can set to be able to do that. One is the archive mode, whether generally set on/off, there's also an always setting the archive command where you're going to save these WAL files to essentially it's a command or a shell script you can put in there. And then how much of the WAL to save do you want? Just a minimal level, sufficient for replica or to be able to do logical replication with that WAL stream as well. Then they go through setting that up as well as doing a backup and a point-in-time recovery. So if you want to learn more about how to do that, definitely check out this post.

The next piece of content is "PostgreSQL VACUUM and ANALYZE Best Practice Tips. This is from 2ndquadrant.com. They're talking about vacuum and ANALYZE. Basically vacuuming up dead tuples after an UPDATE or DELETE, as well as keeping statistics up to date with ANALYZE. The first tip is don't run manual vacuum or ANALYZE without reason. Now basically they just waste resources. They didn't mention any reason other than resource usage as to why you wouldn't want to run them too frequently. The second is to fine-tune the autovacuum threshold. So basically define how often you want autovacuum kicked off for a particular table and that the default values are good for smaller tables. But once your tables start getting larger, you are going to want to adjust these. 

Next, do the same thing for the autoanalyze threshold to determine how often you're analyzing it for statistical purposes. Then he mentions fine-tuning the autovacuum workers. So how many workers to have? The thing to keep in mind with autovacuum workers is that they are in a pool. So by default, it's three and you can increase it. But that's not going to actually make your autovacuuming faster because there is a common limit set by the vacuum_cost_limit or the autovacuum_cost_limit and that is the limit for all those workers. So just increasing the number of workers won't make things go faster. Increasing the limit is the best thing you can do to actually make vacuums run faster or more frequently, essentially. So if you want to learn some more best practice tips on vacuum and ANALYZE, definitely check out this post.

The next piece of content is "Postgresql_fdw Authentication Changes in PostgreSQL 13". This is from percona.com and they're mentioning two changes that were done for the Postgres Foreign Data Wrapper. The first is the superuser can permit the non-superusers to establish a passwordless connection on Postgres Foreign Data Wrapper. Previously it was only superusers that could do it, but now the superuser can specify it can work for non-superusers as well. Secondly, you can do authentication via an SSL certificate and they even go through the process of how to generate a certificate and a key to be able to set this up to use in Postgres. So if you're interested in that, you can check out this post.

The next post also from percona.com is "PostgreSQL 13 New Feature: dropdb-force". So normally you have to have no connections to the database if you're going to drop a database. However, force has been added so that now even if you have existing connections, it will go ahead and drop that database. Definitely a high-risk thing to do, but this new feature is now available in Postgres 13.

The next piece of content is "A QUICK LOOK AT POSTGRESQL 13 RC1 QUERY PERFORMANCE". This is from cybertec-postgresql.com. So he ran a bunch of queries that he had and he mentioned this test set up here and looking for performance differences between 12.4, and Release Candidate 1 of Postgres 13. Basically, there were positives and negatives on balance. Nothing was dramatically more speedy in the queries that he was testing and essentially his conclusion was "So in the end- some test items were a bit slower, others faster, and most importantly it seems like there are no grave problems". So that's basically some good news.

Now similarly, there's another post called "PostgreSQL 13 Upgrade and Performance Check on Ubuntu/Debian: 1.6GB/s random reads". This is from blog.crunchydata.com and he did an analysis using a bunch of different queries that he has to run against 13. It looks like he did not use Release Candidate but the actual release, and it looks like he checked 13 and 12 as well. He found one example where some things were a little bit slower, but for the most part, all the queries he checked, again, seemed to come to the same conclusion. Some were a little faster, some a little slower, but overall everything looked pretty good. So not any great improvements from either of these two posts, but basically there were little to no regressions. So, good news to know. And if you want to learn more about the details of these posts, definitely check them out.

The next piece of content is "PostGIS vs. Geocoder in Rails". This is from pganalyze.com. So they're talking about Ruby and Rails, which is a Ruby application framework. There is a Ruby Gem called Geocoder that helps you do things like geocode cities, states, and countries into longitude and latitude and then being able to calculate distances to those. And they were comparing it to actually using PostGIS because Geocoder does not use PostGIS to do its calculations. So they ran through the process of actually installing PostGIS into their environment and they got an active record PostGIS adapter gem. So this enables the Rails database layer to support the usage of Postgres-specific data types, such as their point data types. 

So PostGIS tends to use points and things of that nature, whereas the Geocoder gem just uses floats for longitude and latitude. Then they talked about building some helper classes to be able to do some of these calculations and finding nearby records. They compared both using Geocoder and PostGIS and they got pretty much the same performance, although it seems like PostGIS will be more accurate because it actually takes into account the curvature of the earth, whereas Geocoder does not. But that should only matter if you're doing really long-distance calculations, I would imagine. 

So the shorter distances between one another won't matter as much. Then they looked at finding records within a bounding box with PostGIS and Geocoder and looked at the comparisons there. Then they looked at some specific things you actually require PostGIS for. So if you're looking for records inside of a specific polygon as opposed to just a bounding box here, that's not something that the Geocoder Gem does. But you have to reach for PostGIS to do that. They have an example of doing that type of query here, as well as finding related nearby records. So if you use Ruby on Rails and you're interested in learning more about actually using PostGIS as opposed to something like Geocoder, definitely check out this blog post.

The next piece of content is "pgpool II 4.2 features". This is from highgo.ca, and it basically covers all the list of features that are coming in the next major release of Pgpool II which is 4.2. So they talked about things like the logging collector, support for log_disconnections, health check improvements, health check stats, different pool back-end stats, the LDAP authentication support, the snapshot isolation mode, as well as others. So if you're interested in using Pgpool II, definitely check out this post.

The next post also related to Pgpool II is "How to Configure SCRAM and MD Authentication in Pgpool-II". This is from b-peng.blogspot.com Now, this was from a previous post talking about authentication in general in Pgpool II, whereas this one focuses on how to set up SCRAM and MD5 authentication with it. So check out this post if you're interested in doing that.

The last piece of content, the PostgreSQL person of the week is Thomas Munro. So if you're interested in learning more about his contributions to Postgres, definitely check out this blog post.

episode_image