Full Text Search, Query Optimization, Exception Blocks, Procedural Language | Scaling Postgres 100
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss full text search, a process for query optimization, caution with exception blocks and adding a procedural language.
Content Discussed
- [YouTube] "The State of (Full) Text Search in PostgreSQL 12" by Jimmy Angelakos
- The State of (Full) Text Search in PostgreSQL 12
- Introduction to Profiling and Optimizing SQL Queries for Software Engineers
- The strange case of the EXCEPTION block
- PL/PgSQL Exception and XIDs
- Creating a PostgreSQL procedural language – Part 1 – Setup
- An Overview of Job Scheduling Tools for PostgreSQL
- Can PgBouncer session survive everything we throw at it?
- How To Migrate From Trigger-Based Partitioning To Native in PostgreSQL
- Guard Against Transaction Loss with PostgreSQL Synchronous Replication
- [YouTube] Oracle to Postgres Schema Migration Hustle
- How to use the KNN Machine Learning Model with 2UDA – PostgreSQL and Orange (Part 1)
- How to run a clustering algorithm within PostgreSQL
- New in pg12: New leader_pid column in pg_stat_activity
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Welcome to episode 100. So after approximately two years of doing this, we are at the 100th episode. Our first piece of content is a YouTube video. "The State of (Full) Text Search in PostgreSQL 12 by Jimmy Angelakos". This is actually on the Peter Cooper YouTube channel and this does exactly what it says. It explains different ways of searching text, especially the full-text search capabilities of Postgres, how you can basically use it, and the different features available within it. Now you also may find it beneficial that there are the slides here that I will provide as a separate link and they discuss the different contents that are part of the presentation in terms of operators, functions, dictionaries examples, indexing, and types of indexes to use.
Interestingly, had like a 4,000-fold improvement by adding a gin index to one of those examples he had. Talks about non-natural text searching, collations, other text types as well as maintenance vacuum becomes very important, particularly with different index types as well. Now, one thing I liked about the presentation was the advent of Postgres 12 using a generated column. So using a generated column to actually build a tsvector of the data that you want to search on so it's automatically maintained and then you could just index that field for full-text search purposes. But if you're interested in full-text search, I highly encourage you to check out this presentation because it has a lot of great information.
The next post is "Introduction to Profiling and Optimizing SQL Queries for Software Engineers". This is from the Scope blog on medium.com. Now this is not meant to be Python or Postgres specific, but those are the examples he's using. But he also speaks more generally about how these techniques can be used and it's mostly from a developer's perspective. So you want to speed up a slow query. How would you do that? And the first thing he discusses is how to define them. So you can look at the database's slow query log, that's one method to do it. You could also use pg_stat_statements in Postgres to look for some of the different queries there. And he talks about other places you can look for it in certain middleware or application logs. And he has some Python or Django-specific tools you can use, or even application performance management platforms to be able to track and identify slow queries. He shows how you can do it. So once you find it, how do you profile it? He talks about EXPLAIN and EXPLAIN ANALYZE and the difference between them.
Basically, EXPLAIN ANALYZE actually runs the query and he has an example, a pretty comprehensive query he's showing here, and then what an EXPLAIN ANALYZE output looks like. Now someone could be quite intimidated by it, but this query is pretty intimidating in and of itself. But he actually shows a visual representation and talks through how to look through the EXPLAIN ANALYZE output to be able to determine what's going on and what could be the slow component of it. He uses the website explain.depesz.com to give you a graphical representation of your EXPLAIN plans. Then he discusses a particular thing to adjust or an index to add that would improve the performance and then even a way to test the performance by using a transaction and see what the differences are. So this is more on the basic level, depending on your level of database knowledge, but definitely good information for developers if they want to find out how to optimize slow queries in their database.
The next post is "The strange case of the EXCEPTION block". This is from pgdba.org and he's talking about Postgres functions and EXCEPTION blocks where you begin something and then if something has an EXCEPTION, you can define what EXCEPTION will trigger it and then you're going to execute some other code instead when this code here errors out. So he's talking about EXCEPTION blocks within it. He had a very unusual thing happen where the XIDs were being incremented. So he actually replicated this like setting up a unique constraint, and then said when you hit it, basically you don't do anything. But what he noticed is that when this is running, I believe the loop is 1,000, it actually increments because he's looking at the frozen XID age here.
It's actually incrementing the XIDs even though absolutely nothing has happened, no data has been added, basically just burning XIDs. And he says, quote "Whether the exception is handled or not, the DML consumes an XID every time it's executed". He also makes note of here "...this kind of behavior that I'm struggling to find documentation for is what caused the one and only emergency shutdown to prevent XID wraparound I had to deal with in my career". So this is definitely something to be aware of, that this can happen with exception blocks within functions in PostgreSQL. Actually, what seems to be happening is that these are subtransactions and transactions are essentially running with these exception blocks when that code is executed.
This is further seen in a second post discussing this: "PL/PgSQL Exception and XIDs". This is from fluca1978.github.io. He mentions here, quote I think PL/pgSQL is using subtransactions or saving points to handle EXCEPTION. So whenever those EXCEPTION blocks happen, they're going to use a transaction ID to handle it. Or not. He actually does a different implementation where he has set up a function and he's actually doing an output and tracking what the XID does with the different behavior using txid_current_if_assigned and txid_current. His assessment is that EXCEPTIONs are quite clearly implemented in PL/pgSQL and possibly in other languages by means of subtransactions. So if you're using Postgres functions and EXCEPTIONS blocks, this is just something to be aware of because if you have a function running very fast and erroring out, you could be burning through your XIDs for that particular table. So just something to be aware of.
The next post is "Creating a PostgreSQL procedural language - Part 1 - Setup". This is from 2ndquadrant.com. So this is a setup to add a new procedural language to use with Postgres. So by default, as part of the standard distribution, there's PL/pgSQL, PL/Tcl, PL/Perl, and PL/python. Well, they actually wanted to add a PL/Julia programming language, so the first step they want to do is actually create an extension for it. So they wanted to create a control file that defines the basic properties of the extension, a SQL file that creates the extension's objects, a C file for the extension itself, and a makefile to build the extension. He has the exact code used here for setting up PL/Julia as a new procedural language, the SQL for the functions, the C code, as well as the make file. So these are basically the setup, the bones for getting this set up. We'll have to consult part two in order to see how this moves forward.
The next post is "An Overview of Job Scheduling Tools for PostgreSQL". This is from severalnines.com. Now, last week I believe we talked about a pg_timetable as one means of scheduling, and they're discussing three others listed here just using a general Crontab in Linux, which is what I tend to do. There's also an agent called a pgAgent, which I believe requires pgAdmin. So that's something to take into account. The last thing they mentioned is the extension pg_ron, but this looks like it can only operate on things within the database itself. So store procedures, SQL statements, and PostgreSQL commands. So I don't believe it can run jobs outside of Postgres, but these are three options for managing jobs for your PostgreSQL installation.
The next post is "Can PgBouncer session survive everything we throw at it?". So this is the fourth in a series of blog posts from enterprisedb.com covering PgBouncer and its connection and pooling capabilities. Predominantly, using the EDB Failover Manager, which is kind of like a virtual IP manager and has some other features. This goes through in-depth on how to set this up with the Failover Manager for PgBouncer and doing different tests of failure to see what survives and what doesn't. And at the bottom here they have the general conclusions of what is possible. So if you want to learn more about handling the failover of PgBouncers, definitely a blog post to check out.
The next post is "How To Migrate From Trigger-Based Partitioning To Native in PostgreSQL". This is from crunchydata.com. So this is going from the trigger-based partitioning to basically the declarative partitioning that was set up in Postgres 10, improved in 11, and then of course 12. Now, it looks like a lot of these instructions are with regard to pg_partman, so it doesn't look like it's using just the trigger base, but a lot of the instructions are if you've used pg_partman and moved to it. It looks like the basic process is creating new parent tables and then detaching the child partitions from the current primary, attaching it to the new declarative partition scheme, and then doing some renames. But they go through the whole process here in this pretty long blog post. So if you are interested in doing that, definitely a blog post to check out.
Also from crunchydata.com is "Guard Against Transaction Loss with PostgreSQL Synchronous Replication". Now, this does discuss synchronous replication, so there is some knowledge to be gained about what it is and how it works. Generally, when you set up replication, by default it is asynchronous, meaning that transactions are committed and saved on the primary database and then they're sent off to the replicas or streamed to the replicas. When you set up a synchronous replication, the write doesn't get fully acknowledged to the client until it's written to two synchronous systems, or it could be more. And they have a great graph here that displays this.
So the client sends some data to the primary, but the primary, before acknowledging, sends the data to the synchronous replica. The synchronous replica acknowledges the primary and only then does the primary database acknowledge success to the client. So you're always going to have data in synchrony. So the client does not get acknowledged until it's written essentially to both places. Now there's a performance hit for doing this, of course. With the stock synchronous replication PostgreSQL, if your replica goes down, essentially no writes happen on the primary because it can't do asynchronous write. So that's something to be aware of. Now, what this blog post discusses is actually their PostgreSQL Operator 4.2.
So this whole post is written from that perspective. So that's something to keep in mind if you want to look at this. It's not about setting up synchronous replication necessarily, it is doing it through the Postgres operator, which Postgres set up in Kubernetes, but it doesn't just basically tell you how it is set up normally. So you can see here they're using the Postgres operator commands, PGO, et cetera, to set things up. So if you want to learn a little bit more about synchronous replication, but especially if you want to use their PostgreSQL operator for Kubernetes, then definitely it's the blog post to check out.
The next piece of content is a YouTube video and it is "Oracle to Postgres Schema Migration Hustle". This is from the EnterpriseDB YouTube channel and they're covering considerations of migrating from Oracle to Postgres with regard to migrating schemas and all the different data and objects that need to be passed over. So if you have a need for that, definitely a webinar you'll probably want to check out.
The next post is "How to use the KNN Machine Learning Model with 2UDA - PostgreSQL and Orange (Part 1)". This is from 2ndquadrant.com. Well, this post is about machine learning capabilities with something I've never heard of before, 2UDA. But if you're interested in machine learning using PostgreSQL and Orange, this is a blog post to check out.
The next post is "HOW TO RUN A CLUSTERING ALGORITHM WITHIN POSTGRESQL". This is from cybertec-postgresql.com. This clustering algorithm is essentially, they say here the well-known K-means. So if you have a need to use that in PostgreSQL, you can check out this blog post.
The final blog post is "New in pg12: New leader_pid column in pg_stat_activity". This is from rjuju.github.io. So there's a new leader_pid column in pg_stat_activity that says "...tracks the PID of the group leader used with parallel queries". So if you have an interest in doing that, definitely a blog post to check out.