background

Postgres Migration, Bulk Data Loading, Updates vs. Upgrades, UUID Benchmarks | Scaling Postgres 144

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss a Mongo to Postgres migration, the best way to bulk load data, running an update vs. an upgrade and benchmarks for UUIDs.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Seamless MongoDB to PostgreSQL migration". This is from The Coinbase blog at coinbase.com and basically a lot of their major transactions that they track within Coinbase migrated from MongoDB to Postgres. It looked like this happened in 2019. They go through the process that they used to do it. Now they wanted to do it very accurately because it involves transactions, but they also wanted to do a lot of iteration. So they made it a very repeatable process for what they were testing out. This is the general process that they use. So they had their legacy system and the clients against it. So what they chose to do is continue to read and write from this legacy database, but then do parallel writes to the new Postgres system. 

And then during this process that they are parallel writing between essentially phase two and phase three, they backfilled the data. In the phase two stage, they were only reading from the legacy, but they were writing to both. In this stage, they were still writing to both, but reading from the new database system. Then finally, phase four indicates the cutover process. Now they described their repeatable backfill process. Basically, they were taking MongoDB, sending files to S3, and then importing them into the Postgres RDS system from there. Because it took a while to do the backload, they actually did it into temporary tables and recreated all of the indexes. 

Then during the cutover process, they used a transaction to essentially rename the tables. Now they also mentioned they did use triggers that tracked the INSERTS, UPDATES, and DELETES to keep the data in sync between the live and the backfill tables. So that's something else that they did. They had this list of the processes they went through to do the backfill and transfer process. They said the automated process to do this data transfer took about four to 6 hours. But this was done without any downtime of their system at all, the process that they worked out here. So if you're interested in checking out how they did this process, definitely check out this blog post because I found it very insightful.

The next piece of content is "Bulk loading into PostgreSQL: Options and comparison". This is from highgo.ca, and they compared a couple of different processes of bulk loading data. They described the system they're using here, and they used a CSV file that had 5 million rows, 14 columns, and was about 624 megabytes in size. Then they show you essentially the schema here. So the first thing they used, of course, was the COPY command. They also tried using the client copy command that's a part of a psql. They tried using a file Foreign Data Wrapper to do it as well as a pg_bulkload tool. Now, I haven't used this tool, but it says it skips the shared buffers and WAL logging process to be able to upload data fast and here are essentially the results. 

This is the method used here, the COPY client, copy_fdw, the bulkload, and the top of the table is a standard table without an index. Now, as you can see here, the bulkload command wins out in every instance. So bypassing the shared buffers and the WAL is a very efficient way of loading data if you need the most efficiency. But I would still probably resort to just using the general COPY command myself, particularly if you care about your data and you want it to be able to handle any sort of failure. Now, they also tried this with an unlogged table and interestingly it shaved a little bit off of the times from the COPY commands and the Foreign Data Wrapper, but it actually was a little bit longer on the bulk load so it didn't make much of a difference. 

Then he compared what it's like to load data with even just one simple index. I believe it was on the country column and you can see suddenly it's over five times slower. Even in an unlogged with one index, still, it wasn't an appreciable difference. But the well-known process of removing indexes from a table, loading the data, and then applying the index, that was around 25 seconds, which was double the time of without an index. But it's less than half the time of leaving the index in place. So this well-accepted standard of dropping indexes, loading the data, and then reapplying the indexes is still the fastest way even from this analysis. But it looks like if you want the most performant way to load data, it is this bulkload that bypasses the shared buffers and the WAL logging. So, definitely an interesting blog post, and if you want to find out more about the numbers and the process used, definitely check out this post.

The next piece of content is "UPGRADING AND UPDATING POSTGRESQL". This is from cybertec-postgresql.com. So they're making a distinction between upgrades and updates. So an update they're classifying as moving to the next point release. To do that, all you have to do is upgrade your binaries. You don't need to alter the data files at all, so you don't need to do a literal upgrade process. What they're calling an upgrade is where you're moving to a major version release. So you're going from say 9.5 to 9.6 to 10, 11, 12, or 13. Then you need to go through an upgrade process. The simplest ones just do a pg_dump and then reload the data. 

But on a large database that takes a long time. The next technique you can use is a pg_upgrade and that basically copies the data, as they say here, on a binary level. But if you use the hard link process you're able to keep all the data files in place and the upgrades happen very fast, like a terabyte/multi-terabyte database can be done on the order of seconds. Now, one thing they did mention here is they said, quote "What is important to note here is that pg_upgrade is never destructive. If things go wrong, you can always delete the new data directory and start from scratch". But there is a caveat with the link option because if you're using the link option, you essentially don't have a copy of the data files. 

What the Postgres documentation says is if you use the link mode for the pg_upgrade, the upgrade will be much faster, and use less disk space, but you will not be able to access your old cluster once you start the new cluster after the upgrade. So things definitely change and you can't go back. Generally, I do use the link mode when I do pg_upgrades, but I always have a replica in place. So essentially I'm upgrading that replica or I upgrade the master and keep the replica offline as a backup to be able to fall back if I need to. Then the rest of the post gives an example of going through the process of doing an upgrade using pg_upgrade. So if you're interested in that, you can check out this post.

The next piece of content is "Benchmark v4 UUID generation in Postgres". This is from shussion.info. They describe two different functions that exist for generating UUIDs in Postgres. One is the uuid_generate_v4 which is available via the uuid-ossp extension. But as of version 13, there is a gin random UUID function that's built-in. So we actually did a test of this using version 12 of Postgres because he actually used the pgcrypto extension and he determined how fast you could generate 10,000 UUIDs across a set of different hardware and different platforms. 

What he found was a much better performance for the gin random UUID function for generating UUIDs. So definitely you're going to want to reach for that, and built into Postgres 13, that makes sense. But in particular, the Windows server was awfully slow with the uuid_generate_v4 function. So if you run Postgres on Windows, you definitely want to be using this gin random UUID. So if you want to learn more about this, feel free to check out this blog post.

The next piece of content is "Using Pl/pgSQL to Calculate New Postgres Columns". This is from blog.crunchydata.com and they're talking about analyzing information from a data science perspective and that generally, they want to calculate zscores right next to the data that they're analyzing. To do this, he actually generated a PL/pgSQL function to generate those extra columns containing those zscores. So he has a link to the full function on Git Hub right here, but then he walks through each portion of it. 

So he has a function where he determines, okay, what schema we're going to be changing, what table name, what prefix to append to the beginning of the column as well as an array of columns that he wants to do the calculations for. Also has an indicator of what the primary key is, to be able to update the data easily. So the function does a FOREACH LOOP through those array of column names, it generates what the new column name should be, alters the table to add that new column as a numeric, and then populates that column using this function that contains two CTEs to do the calculations for the zscores. So if you're interested in using functions to have this ability to assign zscores and do calculations with particular tables, definitely check out this blog post.

The next piece of content is "What's the special for logical level of PostgreSQL WAL". This is from highgo.ca. So this post is asking the question of when you have a WAL level of replica and then move it to WAL level of logical. It says it records more information. Well, what is that more information? That's what this blog post covers. Now predominantly it talks about differences with what it records, with whether the WAL record is a full-page image or not. Full-page images have to happen after a checkpoint to make sure it has all the necessary information to do a recovery. So if you're interested in determining the differences, you can check out this post.

The next piece of content is "Waiting for PostGIS 3.1: GEOS 3.9". This is from blog.crunchydata.com. It's talking about PostGIS and how it has additional libraries it works with. The one that they're mentioning here is GEOS which have some enhancements that have been added specifically to the overlay engine function that allows for greater capabilities and more performance. They're mentioned two times faster than the original. So if you use PostGIS and specifically GEOS, you may be interested in this content.

The next piece of content is "HANDS ON WITH OSM2PGSQL'S NEW FLWX OUTPUT". This is from rustprooflabs.com. So this is talking about the OpenStreetMap data and being able to get it into Postgres. Apparently, it has a new output that can serve as an input for loading into Postgres and it has a lot more flexibility and you can use JSONB fields. So if you work with the OpenStreetMap data, perhaps you want to check out this post.

The last piece of content, the PostgreSQL person of the week is Carole Arnaud. So if you're interested in Carole and her contributions to Postgres, definitely check out this blog post.

episode_image