background

Weekly News, Full-Text Search Performance, pg_cron, Bulk Data Loading | Scaling Postgres 139

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss Postgres weekly news, full-text search performance, enhancements to pg_cron and the best way to bulk load data.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "New PostgreSQL Core Team Members". This is the notification that's been placed up at  postgresql.org and Andres Freund and Jonathan Katz have joined the core team. So presumably this is to balance out the number of members that EDB or EnterpriseDB had on the core team. They added two new members to balance it out and have no more than 50% of the membership coming from EDB. So congratulations to Andres and Jonathan.

The next piece of content is also from postgresql.org in that they have started a PostgreSQL Weekly News. So it looks like just some general news, a few bits of product news, and then a lot of patch information about all the different changes that have happened to Postgres. So if you want a lot of detail on what's going on with the Postgres code itself, this could be a good resource for you to keep track of. So it says weekly news, so presumably this will be coming out every week.

The next piece of content is "Full-text search since PostgreSQL 8.3". This is from 2ndquadrant.com, and this is a follow-up to two previous posts talking about performance since Postgres 8.3 that looked at OLTP workloads and analytical or BI queries. This one focuses on full-text search. Now, they talk a little bit about the history of it and the fact that there's not really a great benchmark for it, but he had some search queries against posts to the PostgreSQL website or searches against the PostgreSQL website. So we thought that would be a real-world benchmark. Now, what he tested was gin indexes and GIST indexes. Gin indexes are generally accepted to be faster for searches, whereas GIST are lossy and require recheck during searches. 

Therefore they're a little bit slower. He mentions the hardware used and then he goes into the different reports for each version of Postgres. Now, for the data loading, there's not an appreciable difference between them. I mean, early on it was a little bit longer duration, but not anything appreciable. The index sizes with regard to gin were more favorable starting around 9.4, and the queries got significantly faster around 9.4 as well. They've been generally trending up slightly over the different versions. Now, that wasn't the case when he was looking at GIST indexes. Now, the size stayed pretty consistent, but the performance has fallen since about 8.3 slowly, and that's pretty universal for all the different performance tests that he did. So this was an interesting post of benchmarking data you may want to check out, but the overall conclusion is that generally, you should reach to gin when you're doing a full-text search in terms of performance.

The next piece of content is "Evolving pg_cron together: Postgrees 13, audit log, background workers, & job names". This is from citusdata.com and in 2016, they released an extension called pg_cron. So a way for you to set up scheduled tasks to take place within the database. So you can see CALL this procedure, vacuum this table, or alter this extension based upon a cron tab-like entry of when these activities should occur. Now this was an open source project of course, and actually, they've been working with others. So Citus Data is a part of Microsoft and they were working with Amazon's RDS team to add some additions to this. 

Now, generally, I just use cron at the OS level. I don't have a hosted solution such as RDS or Azure's database for PostgreSQL. However, if you are using one of those, then having an extension such as pg_cron in it is advantageous. So they've made a number of additions to make this easier. So first of all, it gives you an audit log of your jobs that have run within the data tables themselves, so that you can see what jobs have run, their level of success, and what the current status is. 

They've also switched to having an option to use background workers so you don't have to configure your connection information. So this way you don't have to make changes to your pg_hba.conf file to be able to use this extension. They also added the ability to add job names to particular jobs as well as Postgres 13 support. So it's a very interesting product to run scheduled database jobs when your database is hosted at a cloud provider such as Microsoft or Amazon. So if you're interested in this extension, definitely check out this blog post.

The next piece of content is "Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up]". This is from 2ndquadrant.com. You can click here to gain access to the webinar and it talks about the fastest way to bulk upload data into Postgres. Basically, the way to do this, make sure you don't have triggers, make sure you don't have foreign keys or indexes on the table, and add them after you've loaded the data. That COPY is a much better way to upload data rather than doing insert statements because the general rule is doing a single iINSERT statement with a COMMIT is one of the slowest. But if you insert multiple rows and then do a COMMIT, that'll be faster, and then COPY, of course, is the fastest. There are possibilities for using parallelism. They even talked about a technique here where you could use an unlogged table to actually load the data and even do an ANALYZE on it and then do an INSERT SELECT statement to actually place it in its final destination. So if you're interested in speeding up your bulk data loads, definitely check out this webinar.

The next piece of content is "HAVING A 2ND LOOK AT POSTGRES V13 IMPROVEMENTS". This is from cybertec-postgresql.com, so it's further Postgres 13 improvements. Now we've covered some of these, but they cover generating a backup manifest file for base backups and then the pg_verify_backup tool to be able to verify that those files exist. You have the WAL files to do the restore, allowing pg_rewind to use the target cluster's restore command to retrieve needed WAL. So this way you can point at a WAL destination so it can find any WAL it needs to do the restore process. Add an option to pg_rewind to configure standbys. 

Basically, this is being able to write the recovery.com file or create the signal files necessary to do a restore. Allow reindexdb to operate in parallel. So parallel re-indexing uses the directory of the pg_upgrade program as the default new-bindir setting when running pg_upgrade. So there are some advantages to this. What they say here is for small web console emulators so you can make your pg_upgrade command a bit shorter. Allow dropdb to disconnect sessions using the target database, allowing the drop to succeed. So this is again the command to be aware of that would allow you to drop your database if it has active connections. 

Some changes to pgbench to make it easier to work with. And then the max_slot_wal_keep_size so it can balance out if a replica is falling far behind, you can basically drop it out of being a replica. They also covered improved retrieval of leading bytes of toasted values for better performance using incremental search sorting which we've discussed previously in terms of speeding up multicolumn sorts. Allowing control over how much memory is used by logical decoding before it is spilled to disk. Lastly, they mentioned adding the backend type to CSV log output so that's something to be aware of. For example, they mentioned the tool PgBadger was caught off guard by this. So if you're using the CSV log output, be aware of that. So if you're interested in reviewing some of these changes, check out this blog post.

The next piece of content is actually a YouTube channel, the Percona YouTube channel, and they've had Percona Live Online and they had a few presentations that have been posted in relation to PostgreSQL. One is a look at the elephant's trunk, PostgreSQL 13, and the democratization of databases. So there are a few posts and there may be some more postgres related in the following days.

The next piece of content is "Provisioning a PostgreSQL cluster with TerraForm and Ansible". This is from enterprisedb.com. So this talks about using TerraForm and Ansible for setting up a Postgres infrastructure and it has links to the saved deployment scripts for TerraForm and the playbooks for Ansible. So if you're interested in using these tools, you can check out this post.

Related to it is this YouTube video that discusses the same thing. The video is called "Automating a PostgreSQL High Availability Architecture with Ansible". This is also from the EDB channel, and it talks about using both TerraForm and Ansible to deploy a high-availability PostgreSQL cluster. Now they said here when they did a survey of what DevOps tools their clients were using, you can see that over 50% of the responses were TerraForm and Ansible. So that's probably why they're using these tools. So if you're interested in doing this, definitely check out these two pieces of content.

The next piece of content is "Random numbers". This is from 2ndquadrant.com. They're talking about a way to generate random data to create data in a database for testing purposes like say, billions of rows of data. They want it to be reproducible so random, but reproducible by using some sort of seed. So he's using a pseudo-random number generator to be able to do that and tested a number of different generators and showed some of their performance here. So if you're interested in doing the same thing, maybe you want to check out this post.

The next piece of content is "Election Night Prediction Modeling using PL/R in Postgres''. This is from blog.crunchydata.com. So there was a Tweet that talked about election night forecasting using R. So basically, this blog post took that R code and put it into Postgres to be able to do a direct analysis because PL/R enables you to run R on Postgres as part of a function. So they set that up here. And there's a lot of code with this, but it helps you do this election night prediction. So if you're interested in that, check out this post.

Next piece of content is "Monitoring PostgreSQL Clusters via Pgpool-II with Prometheus". This is from b-peng.blogspot.com. So they're talking about a Pgpool II exporter which sends monitoring data to Prometheus so you can analyze it along with your Postgres instance. So if you're interested in setting this up for Pgpool II, check out this blog post.

Next piece of content is "How to Analyze a PostgreSQL Crash Dump File". This is from highgo.ca. So they show how you can use GDB to be able to debug a crash file from Postgres. If you're interested in that, definitely check out this blog post.

Next piece of content is "FIND YOUR LOCAL SRID IN POSTGIS''. This is from a rustprooflab.com, so if you're interested in doing these types of geographic analyses, definitely check out this blog post.

The next piece of content is "Announcing Swarm64 DA 50". This is from swar64.com. This is an extension for Postgres and they've actually done some major changes recently where now for doing their data warehousing analysis workloads, they're actually changing to using a compressed column store index to be able to give you high-performance analysis of analytical queries. It also includes Postgres 12 support. So if you're interested in that, maybe you want to check out this post.

The last piece of content, the PostgreSQL person of the week is Elein Mustain. So if you're interested in learning more about Elein and her contributions to PostgreSQL, definitely check out this blog post.

episode_image