background

Connection Queuing, Connection Pool Size, Simple Audting, Fuzzy Matching | Scaling Postgres 154

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss connection queuing, estimating connection pools size, how to do simple auditing and how to run queries using fuzzy matching.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Connection Queuing in pgBouncer: Is it a Magical Remedy?". This is from percona.com and it's talking about a PgBouncer, specifically what it can do as a connection pooler is it does connection queuing. This means that if you have a query that needs to run, but all of the server processes or server sessions are being used, it puts those connections in the queue to be run when a server connection is available. Now, of course, most people use PgBouncer so that they can get more connections without using a lot of memory and resources on the server. But this post talks about how the queuing component is equally beneficial. So if you are not using some sort of connection pooler, you could experience a number of problems. 

Problem number one, a spike in load can jam and halt the server. The reason is because there's a one-to-one mapping between active PostgreSQL sessions and the running processes on the box. So every new connection to Postgres spawns a new process. In addition, with parallel execution, you can actually get many more processes running than sessions that exist. So you can suddenly get a ton of different processes running on the box as you're adding more connections. They simulated an example here where they had ten active connections on a two virtual CPU server and the speed at which they could connect and run a command compared to just having two active sessions went up to 15 seconds when they have ten active sessions on a two virtual CPU box. So that server is struggling to operate efficiently. 

They say, quote "...we could see that as the number of active sessions approaches double the number of CPU cores, the performance penalty starts increasing heavily". So that's something to keep in mind as a problem that you can experience if you have a lot more connections than your CPU cores, for example. The second problem is "too many clients already" errors. So this is basically you've run out of connections, and you're trying to make a new connection, but it gives a fatal message: "sorry, too many clients". The third problem that you can experience is a big max_connection value and the overhead that it causes. So there have been a number of posts that we've actually covered in Scaling Postgres, and they link here about how just having a lot of max_connections defined still consumes resources even though they're not being used. 

So you don't want to set your max_connection value in your PostgreSQL configuration too high all the time. You want it optimized at a particular level. So now a solution to help you with all of this is using a connection pooler. Like PgBouncer. But we can also assist with connection queuing, which is what this post is about. So for the example where you try to connect and there are too many clients, you've run out of connections, PgBouncer allows you to run more active connections than exist on the server side. So for example, here he's showing the pools of PgBouncer, and you can see that the client that is active is 1 because they only have one active server connection. And you see that four of the client connections are waiting. 

So this is this queue that he's talking about. They're waiting for their opportunity to use this one existing server session. So that's what PgBouncer can do for you. If you actually bump up the number of server connections to 2, and you run that same five, you can see that you can get more active running and less waiting. He then tried an example with 20 direct server connections and you can see that the average load went up to about 17. So this is not using PgBouncer. Similarly, when trying to connect to that box using SSH, it took 17 seconds to connect. I assume this is the two virtual CPU servers. But when they implemented PgBouncer and set the pool size to four on the server side, the load average dropped to 1.73. So about a ten times less load average. Connecting took half a second via SSH, so it was dramatically better in terms of general performance. 

Now then the question comes, does this actually affect your throughput because you're having connections wait to do work, these client connections wait to do work. He actually did a test of database throughput and by using the queuing, it was actually 5% better. So it wasn't worse. This waiting time, allowing the server to run in an optimized state and just have PgBouncer pass off the new connections when they're ready, actually resulted in better performance, not worse performance, even though you have some waiting going on. Then when he switched to a transaction level pooling performance got even better. So it's just something to keep in mind that PgBouncer is not only for lowering your connections to save on memory in terms of Postgres but can actually give you more throughput to optimize the number of sessions allowed to connect to the database versus how many CPU cores you have. So if you're interested in that, you can check out this post.

The next piece of content- "ESTIMATED CONNECTION POOL SIZE WITH POSTGRESQL DATABASE STATISTICS". This is from cybertec-postgresql.com, and he's talking about how there have been some new connection statistics added to version 14 of Postgres, mainly to the pg_stat_database view. It covers things like session_time, which is the total time spent by sessions, the total active_time of sessions, time spent in idle_in_transaction, the total cumulative number of sessions, the number abandoned, the number of fatalities, and the number of killed. So this is great information and more statistics to have. 

But actually, where he finds this useful is in terms of estimating how large you should make your max connections on the database server if you're doing transaction-level connection pooling. So exactly the problem. This previous post was talking about setting max_connections at an optimal value, just taking into account CPU cores like don't set it too large. Here he actually has a formula where his recommendation is that the number of connections should be less than the maximum between the number of cores and your parallel_io_limit. How many queries can you run at the same time? 

Given the IO storage solution that you have, divided by the session_busy_ratio, how busy your sessions are multiplied by the average parallelism you're doing. Some of these numbers are a  little loose sometimes. I can imagine it being difficult to really nail the parallel_io_limit or the average parallelism. But he says you can actually calculate the session_busy_ratio by using some of these new statistics given this query right here. It will give you what that ratio is to factor into this function to give you an estimate of what's the optimum setting for your max connections. So if you're interested in this, definitely check out this post.

The next piece of content- "Simply auditing your database changes". This is from mydbanotebook.org and this post is talking about a technique where you can do auditing of statements where you don't have to use triggers, you don't have to use a special extension, you don't have to use third-party tools. You're just basically using the built-in functionality of Postgres to do this. I found this particularly interesting. So number one is you need to log the statements that you want to track. So, for example, they recommend setting the log destination to a csvlog. She said that we'll see why later. Turning on the logging collector, as well as setting what statements you want to log. 

So you could do DDL or mod or all, whatever you want to log, you need to set that log at that level. Then they ran a pgbench to get some data going into the logs and then used a Foreign Data Wrapper. So create extension fdw, create the server for the Foreign Data Wrapper, and then create the foreign table giving its definition to the log file that's being produced. So essentially, PostgreSQL logs the file, and the statements, and then uses the Foreign Data Wrapper to actually read the log file that it's generating. Then for convenience, they actually add a number of views that let you look at the DDL or the DML. So they're basically going into the logs, pulling out particular data to be able to view it more conveniently. 

So, for example, when querying the DDL, you could say, look at the table in the database bench in the query having to do with pgbench_tellers, you can see all the operations that happen. There's a drop table, create table, truncate table, alter table. So all the DDL statements, similarly to the DML, you can look at a particular database name in a table and look at all the different statements that are being run against that table. So that's a pretty interesting technique to be able to use the database to query and see what's happening in the logs of the actual database that you are logging. So if you want to learn more about this, definitely check out this post.

The next piece of content is "Fuzzy Name Matching in Postgres". This is from blog.crunchydata.com. So to get started, they actually created a table with 50,000 names using the fake name generator. So they give an example of some of the data that's produced here and basically, they want to do fuzzy matching against these names. The first example they came up with to do a prefix search is just using the LIKE operator and they got results in about 11 milliseconds. They tried adding an index, but of course, it still came back in about 11 milliseconds because LIKE by default is not going to use that index. But if you index it using text_pattern_ops, then it will use the index and I believe only for prefix matching. But there's also a case issue. 

So you can do case-insensitive searches, but if you do that and want an index, you're going to have to do a function index. So you actually define an index with that lower function on it to be able to efficiently pull out those records. Now getting onto the fuzzy matching, they actually used the fuzzy string match extension, added that in and it does levenshtein distance between strings. So they did a query to actually use this levenshtein function to be able to pull out records where one character was missing from the search. But the problem was it took over 100 milliseconds. So pretty darn slow. But a solution to that is using a Soundex function. This algorithm reduces a word to a phonetic code. 

So for example, even though each of these Harrington names are spelled differently, they actually return the same phonetic code. For example, out of 50,000 records doing a Soundex check against Harrington actually, only 46 records return. So it helps you reduce the number of records that are present. So to handle this, what you do is create a function index using this soundex function against the name. Then when you use the same query but with this new index in place and of course adding the soundex, it actually returns data in just one millisecond. So very efficient fuzzy matching. So if you have a need for supporting fuzzy match searching, you may want to check out this post.

The next piece of content- "POSTGRESQL: HOW TO WRITE A TRIGGER". This is from cybertec-postgresql.com which tells you how to set up a trigger. So they created a table for tracking temperatures, which has a few different columns here. Then the next thing you need to do is to define a function and then create a trigger that's going to use that function. Now, before creating the function, they mentioned that triggers can be either BEFORE or AFTER. So you can have as many BEFORE triggers as you want before the actual INSERT or UPDATE, and then as many triggers as you want AFTER. Actually, it does it in a defined order by the name of the trigger. So you want to name it appropriately so your triggers happen in a particular order. So as mentioned before, the first step is to create a function. 

So creates a function called f_temp, probably for function temp. Basically, if the new value is less than zero, then set that new value to negative one. So that's the basic function. It's going to change the value of the row being inserted if it's less than one. Then you create a trigger and you specify BEFORE whatever operation you want to do BEFORE UPDATE, BEFORE DELETE, BEFORE INSERT, and it's done BEFORE INSERTS, and for each row execute the procedure that was defined up here, the f_temp procedure. So with this trigger in place, he did some tests and the result came out as he expected. When trying to insert a -196, it returned a -1, otherwise, it returned what was inserted. 

Now they use the keyword here, NEW and that refers to the new row being inserted. But there also exists an OLD and it depends on what operation is happening, and which keyword is available. So NEW is in INSERTS and UPDATES. The OLD is present also in UPDATES but also DELETES and nothing is present in TRUNCATES. Now, in addition to these keywords, there are some other ones you can use. So we mentioned the tg_relid schema, the table_name, the relname, as well as the op or the operator to say whether it's happening because of an INSERT, DELETE, UPDATE, or TRUNCATE, as well as other options that you can use in your triggers. So if you want to learn more about using triggers with Postgres, definitely check out this post.

The next piece of content is "Postgres Logical Replication Advantage and Step by Step Setup". This is from highgo.ca, and they're talking about the reasons why you would want to set up logical replication. They mentioned a few here. One is sending incremental changes in a single database. Two is consolidating multiple databases into a single one. The third is replicating between different major versions of Postgres. Like maybe you want to migrate data or do an upgrade. Then lastly, share a subset of the database between various databases. So a lot of use cases for moving data around. Then they go through the process of setting up a basic logical replication setup where you set up the server to be the publisher, set up the server to be the subscriber, create the publication on the publisher, and create the subscription on the subscriber in order to start syncing data over. So if you're interested in that, you can check out this post.

The next piece of content- "Faster data migrations in Postgres". This is from citusdata.com. He's talking about fast ways to migrate data from one database to another, primarily from Postgres to Postgres. He says the number one tool to do that is pg_dumps and pg_restores, because they basically logically let you back it up. You can do individual tables or individual schemas to be able to transfer data from one database to another. Some of the keys to doing it efficiently is using the jobs option to specify the number of processes that can run in parallel to do that export and then import job. But he says if you have very small tables that may not benefit from parallelism, you can actually string a pg_dump and pipe it straight into a pg_restore. That can actually transfer data pretty quickly. So I haven't actually seen this technique. 

So that was interesting if you want to check out that method. But then if you have a lot of large tables, another solution they have is actually parallel exporting and loading. So there's a Python script he mentioned, and there may be one that does another language that's called parallel loader. So basically, it can take one large table, for example, and use multiple threads to execute COPY commands in parallel. So you can see it's using PostgreSQL's COPY to actually transfer the data. Using this parallel method of transferring data, it went from a pg_dump and restore for greater than one day to 7 hours and 45 minutes. So it's definitely a lot faster being able to do it in parallel because again, you just have one process that can work on a table at a time. So with a large table, being able to break it up and run the process in parallel gives you a lot of performance. So if you're interested in learning more, definitely check out this post.

The next piece of content is "FIRST REVIEW OF PARTITIONING OPENSTREETMAP". This is from rustprooflabs.com, and this goes over his review of whether he should partition the OpenStreetMap data that he works with and import it into Postgres. Again, he was using a lot of schemas for this management, but he's wondering if he could partition it and make data management easier.

Now, he covers some of the issues here, but in the next post, "ROUND TWO: PARTITIONING OPENSTREETMAP" describes how he ultimately did it, and here are some highlights from his results. So, the bulk imports that he does on a regular basis now generate 17% less WAL. So that's good the bulk DELETES generate almost 100% less WAL because deleting data doesn't have to actually do a delete. You just drop the partition table. Then they mentioned a simple aggregate query runs 75% faster. Now, they did mention that not all queries are running faster. I believe he said that quote "...partitioned tables are generally fast, if not faster...". 

So it looks like the speed is pretty good or a little bit faster. So not everything is this blazing 75% faster. I believe that was one query that he mentioned. But he goes through a lot of the process of his testing in his thoughts about the best way to optimize this. And this is really how you have to handle partitioning because it is a burden to manage partitioning. So you want to make sure that the benefits outweigh the costs, of course. So this has been a great series of four posts. I believe that I should look at the decision process of whether I should partition or not. So if you're interested in that, check out this blog post.

The last piece of content, the PostgreSQL person of the week is Takayuki Tsunakawa. If you want to learn more about his contributions to Postgres, definitely check out this blog post.

episode_image