Active-Active & Federated, Memory Speed, Table Rewrite Causes, PostgreSQL Shibboleth

In this episode of Scaling Postgres, we discuss setting up active-active instances with federation, the impact of memory speed, the causes of table rewrites and the Postgres shibboleth.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope that your friends, family, and coworkers continue to do well. Our first piece of content is "Active-Active PostgreSQL Federation on Kubernetes". This is from blog.crunchydata.com and they're talking about a process of essentially creating a set of federated PostgreSQL instances where all of them are essentially primaries or masters. Now, for this post, you can actually ignore the Kubernetes part because some of the interesting parts about it are how you're maintaining these active ones. They're using a combination of logical replication and partitioning. So what they did is they set up three different PostgreSQL clusters or three different instances. Now, their example here uses Kubernetes, but you don't have to use it that way. But basically, you have three different instances.

They called one hippo-east, one hippo-central, and one hippo-west. Now, I'm going to skip ahead to the partitioning scheme because that helps you understand a little bit about what this post is talking about. So they created a hippos table, and they're using a UUID for every ID of it. They're doing this to avoid using sequences because sequences don't get carried over with logical replication. If you're doing this type of federation, it's easier to just use a UUID for your IDs in this case. Then they set up a default partition, they just called Hippo default, and then a hippo east, central, and west. So for each geographical server, essentially they set up a separate partition for it. So each instance only inserts data into its own partition. So east will only insert into the east partition.

The central server will only insert data into the central table, and the west will only insert into the west table. Now, they did some additional steps here where they set up a way to identify each cluster and they actually insert the data into the default partition and then create a trigger that then moves it into the proper partition based upon what the node name was. But if you have a way to insert the data directly into the table like your application knows, it's only inserting data into say, hippo-e, it's always going to be setting hippo-east. I don't think you need to do these additional steps that they've laid out here, but basically, each instance inserts its data into its own partition. Then what you do is you set up the sharing where you set up a publication on each of the nodes.

So for example, you create a publication in hippo-east for the hippo-east partition. You set up a publication in hippo-central for the hippo-central table, and then you have every other node subscribe to every other publication. So hippo-central needs to subscribe to east and west. Hippo-west needs to subscribe to east and central, and of course, hippo-east subscribes to central and west. Once you get this all working, all the data will be synchronized across all the different nodes and they're placed into the proper locations of each node's partition table. He's showing this example here where he queried east and yet you can see all the data from the different nodes because he's querying it from the parent table and you can see that it is properly placed in all of the different partitions.

If you want to set up another node, you just set up another node. Define a new node name for it, add a new partition for that node to the partition table scheme that exists throughout all the nodes, set up a publication for it on the primary node for that partition to be distributed, and then subscribe it from the other node. They describe how to do that here. They do discuss some other alternatives. So for example, you could use the sequence instead of a UUID by defining that it will be generated, say, on one partition, starting with one, incrementing by three for the next table, starting at two, increment by three. Next table three, start increments by three. But that's going to be a problem trying to add new nodes to it. That becomes very complicated very quickly. The only downside they're mentioning to the UUID is that you may run into a collision. So you're going to have to plan for that or figure out how to address that. But this was a very interesting post on how you could set up a Federated PostgreSQL with multiple master nodes. So if you're interested in that, definitely check out this post.

The next piece of content is "PostgreSQL 13 Benchmark: Memory Speed vs. TPS". This is from blog.crunchydata.com and they're examining increasing the memory speeds of a particular instance from 2,133 MHz to 3,000 MHz to 3,200 MHz. So basically asking the question, does memory speed matter in terms of transactional throughput? Looking through this scaling up with a different number of clients, you can see the more clients you have, the better it gets. But on average from his analysis, it looks like about a 3% improvement for memory speeds. So not a huge improvement, but it can give you a little bit. So if you're interested in this analysis, go ahead and check out this post.

The next piece of content is "What Postgres SQL causes a Table Rewrite?". This is from thatguyfromdelhi.com and he has presented this great table here that shows you what operations require a table rewrite. So essentially you're going to be doubling the size of the table if you execute these commands on the table. So for example, if you alter the table and set it to a new table space, it's going to rewrite that whole table. So essentially you're going to use up twice the disk space of that particular table. Or if you're going to alter the table and set it to be logged or unlogged, both of those operations require a full table rewrite. He's also listed this by versions from 9.5 to 13.

As you can see here, the great improvement that was made in version 11 is DEFAULTS not having to require a table rewrite. So that was a great, great enhancement. So as of right now, I mentioned the table space, I mentioned setting, logged, unlogged. Of course, when you cluster that's reordering the data, which rewrites the whole table as well as truncating the table. But the thing about this is it really doesn't use up double the disk space. It's basically kind of what, as he says, rewrites the relfilenode. He also lists the VACUUM FULL and then, of course, ALTER COLUMN TYPES. So when you're going to alter a column type and presumably go from an INT to a bigint, that does require table rewrite, or altering the column type of text does require table rewrite. So this was a great table and blog post talking about SQL that can cause a table rewrite.

The next piece of content is "THE SHIBBOLETH OF POSTGRESQL". This is from cybertec-postgresql.com and he's talking about shibboleth, which as Wikipedia defines it as any custom or tradition, usually a choice of phrasing or even a single word that distinguishes one group of people from another. They're basically talking about the Postgres community and really how to write and pronounce PostgreSQL. The way to do it you can just do Postgres or the other way to do it is POST-gres-cue-ell. So those are the accepted ways to pronounce it. Ways you should not pronounce it are Postgre or POST-gres-SEE-quel or POST-gres-ESS-cue-ell. So if you're interested in learning more about that, you can check out this post.

The next piece of content is a YouTube channel. It is the PgDay San Francisco channel. So for the last month or so, they've been posting videos and they've got five on the site now, presumably more maybe coming because they seem to be posting one about every week. It has content such as explaining EXPLAIN, advanced data types, vacuum through pictures, et cetera. So if you're interested in this type of video content, you can check out this channel on YouTube.

The next piece of content is "Checkpoints in PostgreSQL". This is from highgo.ca. They're talking about what a checkpoint is. Basically, a checkpoint is when data is changed in Postgres, it's written to the WAL, and on a periodic basis a checkpoint is done that takes all the different changes that have happened in the WAL or that still reside in memory and flushes those changes to the actual database file. So it's making a consistent checkpoint that the data on the disk essentially matches what's in memory. He says what a checkpoint does is that a WAL record of this checkpoint is recorded in a log. All dirty pages of data are written to disk, and the checkpoint is written to the pg_control file.

Then it synchronizes all the data files and he says what triggers it starting is the max_wal_size. So if it reaches that limit of the maximum amount of wall you want to maintain if it runs into the checkpoint timeout, basically a time expires and you need to do a checkpoint. When a backup is started from pg_base_backup or pg_start_backup is issued, or if there is a database server shutdown or an actual checkpoint command is executed by an administrator. Then it goes over some of the ways that checkpointing can be controlled by defining the max_wal_size and the checkpoint timeout that was mentioned up here. Also by the checkpoint completion target, how long the checkpointing process happens, as well as a checkpoint warning to let you know if checkpoints are happening, say, too frequently. So if you're interested in learning more about checkpoints, check out this post.

The next piece of content is "How to install and configure PostgreSQL Debian/Ubuntu - for developer use". There are two parts, 1 and 2. These posts are from depesz.com. Part one covers general installation. Now, of course, you could just download the package manager from the distribution and install it. But this post walks through actually using the Postgres repository, which gives you access to all the different Postgres versions that exist for doing your installation. As well as discusses how to get it set up and how to connect to it using psql and the different configurations that you might want to do to be able to connect as a user.

The second post covers a lot about configuring the system, such as altering log settings, defining a wall level, enabling archive mode as well as an archive command, and looking at pg_stat_statements and auto_explain to understand what your queries are doing. Now again, this is for development servers, not necessarily production servers, but if you're interested in that, you can check out these posts.

Also from the same site, depesz.com, three posts were done on what's coming in version 14 of Postgres. The first is "pg_stat_statements: Track time at which all statistics were last reset". When you do a reset of pg_stat_statements, it wasn't recorded where that was happening, but in 14 now you can get that information from the pg_stat_statements info view. You can look at the stats_reset column to know when it was reset. So that's a great addition.

The next post is "Multirange datatypes". In addition to supporting ranges within a data type, you can now do multiple ranges that are separated by commas using a curly brace. So then that's an interesting implementation. Now they did mention it doesn't work for indexes yet, but it's still a ways off for 14, so that may get implemented before version 14 goes live.

The last post is "Allow subscription of store values". So in addition to using syntax with the Hyphen and the greater than symbol to be able to pull out a value, you can now use this subscripting where you use square brackets. Now, he said this doesn't actually work with indexing, but you can use it to pull out values. So if you're interested in learning more about that, you can check out these posts.

The next piece of content is "Postgres, Pl/Python and SciPy/NumPy for Processing Images". So this post from blog.crunchydata.com is about using a Django app that uses some of these Python tools in order to process images along with Postgres. So if you're interested in that, you can check out this post.

The next set of blog posts are all from Devrim Gunduz's PostgreSQL Blog. They're talking about the RPM packages that are used on Red Hat, Fedora, and CentOS. So the first post is "What is new in PostgreSQL 13 RPMs?", and he covers all the different changes that have been made for the new ones.

The second post covers "LLVM issues with PostgreSQL Yum repository on CentOS 8". Basically ways to address that.

The third post is "RPM repository for unsupported PostgreSQL releases/distros". Basically, how to get access to those.

Finally, fourth is "Installing PostGIS 3.1 and PostgreSQL 13 on CentOS 8". So if you're interested in any of these posts about running Postgres on CentOS or Red Hat Enterprise Linux, or Fedora, definitely check out these.

The next piece of content is "Loading Data into PostGIS: An Overview". This is from blog.crunchydata.com and it covers numerous ways that you can import data. Namely, they put vector data into PostGIS and, of course, Postgres. It covers using the command line tools such as shp2pgsql, or ogr2ogr as well as covering using raw SQL as well as some GUI options such as QGIS. So if you're interested in that you can check out this post.

The next piece of content is "Query Load Balancing in Pgpool-II". This is from b-peng.blogspot.com. So this discusses how to do query load balancing in the load balancer Pgpool II. So if you're interested in that you can check out this post.

The last piece of content is "Timeouts in Pgpool-II connections". So this is from pgsqlpgpool.blogspot.com and describes how you can set up these timeouts.

Active-Active & Federated, Memory Speed, Table Rewrite Causes, PostgreSQL Shibboleth | Scaling Postgres 146

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript