Max Connections, SQL Tricks, Fast Text Search, Security & Compliance | Scaling Postgres 110
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss tuning max connections, different SQL tricks, setting up fast text search and handling security & compliance.
Content Discussed
- Tuning max_connections in PostgreSQL
- 6 SQL Tricks Every Data Scientist Should Know
- SQL Optimizations in PostgreSQL: IN vs EXISTS vs ANY/ALL vs JOIN
- Full Text Search in Milliseconds with Rails and PostgreSQL
- [YouTube] Webinar: Security and Compliance with PostgreSQL by Boriss Mejías
- Waiting for PostgreSQL 13 – Add logical replication support to replicate into partitioned tables
- The Origin in PostgreSQL logical decoding
- Replicate multiple PostgreSQL servers to a single MongoDB server using logical decoding output plugin
- Intersecting GPS-Tracks to identify infected individuals
- Ilaria Battiston
- Waiting for PostgreSQL 13 – Allow autovacuum to log WAL usage statistics.
- Waiting for PostgreSQL 13 – Add the option to report WAL usage in EXPLAIN and auto_explain.
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. This week, our first piece of content is "TUNING MAX_CONNECTIONS IN POSTGRESQL". This is from cybertec-postgresql.com. They're talking about how best to tune max_connections, which is basically the maximum number of connections you can use to connect to PostgreSQL. They mention the default setting is 100, but three of these are reserved for superuser connections. So essentially you only have 97. But they caution against setting max_connections too high. Now, they first talk about why you would set it too high. And they say, well, first, because it requires a restart to change it, you probably want to be on the safe side and keep it higher. You may be getting an error message: "remaining connection slots are reserved for non-replication superuser connections".
So you just raise the limit. Or the application developers convince the DBA that they need more database connections for better performance. And then they go into the problems of setting it too high. The first one is you're overloading the database, so the more active connections you have, it's going to be actively using all the resources of that server. Now, the thing that I've run into is people put in 100, 200, 400 connections. But the thing you have to keep in mind is that if you have an 8-core server, a 16-core, or even a 32-core server, each of those connections is a process. And if those processes are very actively doing work, you can't have the number of connections too far removed from the number of cores. Now there is wait time associated with doing network activity or disk access or things of that nature that the cores can be doing something else and they can handle more connections than just one per core.
But you just have to keep in mind all of those connections are going to be using the resources of that database server that you have. So you want to be cautious and not overload that or overload the memory. With regard to memory, they talk about insufficient resources for each database connection and that each connection uses your allocation of work_mem you have or higher than that work_mem, depending upon the statements that need to be run. So each connection consumes memory. So you want to make sure that you don't run out of memory by setting it too high. Now, they give a formula you could use to kind of derive how many max_connections you should have.
But mostly I've had to use it empirically, test out the server, test out the different connections to see what the workload needs. But this is definitely one way to look at it and examine it. Then, of course, they say when you want to start having really high connections, you're probably going to want to investigate a connection pooler. Now, a lot of application frameworks have built-in poolers but there are also advantages to having a separate connection pooler. The one that I use the most is PgBouncer, but there's also Pgpool that you can use. They talk about how you can have many clients that are connected through PgBouncer utilizing a smaller number of actual database connections. You typically achieve this using transaction pooling. So if you want to learn more about max_connections and setting it, here's a blog post to check out.
The next piece of content is "6 SQL Tricks Every Data Scientist Should Know". This is from towardsdatascience.com and they're talking about some SQL tricks you can use if you are a data scientist, but this is equally applicable if you use SQL in your daily work. The first trick they talk about is COALESCE to basically recode NULL values into some other type of reference. Like for example here they COALESCE a NULL value into something called MISSING. So when they query it, it says MISSING. Now I should say this, this is for Oracle, so it's not PostgreSQL specific, but I thought a lot of the tips that they gave here would be beneficial. The next one they talk about how you would do a rolling total and cumulative frequency. Well, they show it using window functions and a number of these tricks use window functions. So of course they're using this to get the window functions.
Then as part of the subquery that they built, they're able to calculate this cumulative frequency, partitioning by this join ID that they set up here. The next trick is to find the record with extreme values without self-joining. Again, they're using a window function to achieve that with this code here. Next is a conditional WHERE clause. So using a case statement and a formula here to be able to determine what needs to be selected. Fifth is looking at a Lag() and Lead() to work with consecutive rows. So here's an example where they are using the previous number as a reference to indicate differences and six has to do with Python and R. So if you do use those, this is potentially relevant. So if you want to learn some more tricks and tips working with SQL, particularly window functions, definitely check out this blog post.
The next piece of content is "SQL Optimizations in PostgreSQL: IN vs EXISTS vs ANY/ALL vs JOIN". This is from percona.com. So they're looking at different ways to write particular queries. Here they are looking at some inclusion queries and no matter which way you write it, whether it's using IN, ANY, EXISTS, or an INNER JOIN to get the answer that they're looking for, the PostgreSQL planner is intelligent enough to give the exact same execution plan. So no matter which way you write it, when you do an EXPLAIN plan, it'll give you the same result. So that is good news, it is able to find the most efficient path to give the answer that you need. But then they looked at some exclusion queries.
So using NOT IN, NOT ALL, NOT EXISTS, and LEFT JOIN whereas and IS NULL. And here they got a couple of different variations from the plan, but kind of based on what they were interested in. It looks like the NOT EXISTS or the LEFT JOIN where data is NULL, tends to give close to the better performance. Now this actually did report something faster, but they said once they increased the number of records, then the NOT EXISTS/LEFT JOIN won. And the other thing of note was down in the summary, they gave some pretty good advice in terms of developers writing good SQL statements. He says first to make a list of tables from which the data should be retrieved, then think about how to join those tables, and then think about how to have the minimum records participating in the JOIN condition.
So always think of what's the minimum number of records I need to pull to do this. That will generally lead to the most efficient query. Then, of course, quote "Never assume that the query is performing well with a small amount of data in the table". So it's always advantageous to work on a copy of production or production in order to gauge what is a good query because the statistics are different and different decisions will be made by the query planner based upon those statistics. So you need to have something as close to production as possible. So this is another good post about SQL to check out.
The next post is "Full Text Search in Milliseconds with Rails and PostgreSQL". This is from pganalyze.com. Now this is talking about the Ruby on Rails framework, but most of this post covers PostgreSQL and full-text search. So they give some comparisons of using LIKE and ILIKE versus the trgm extension for similarity searches and then full-text search and they show how you can use tsvectors and then tsqueries in order to query and do a full-text search. And they're even including rankings here. Now once they look at some of that, they then use a full text search within the Rails application framework. So here they're using a library that makes it easier to work with, and this library is a ruby gem and it's called pg_search. They show you how to put it into your application, how you can do a basic search, what the query actually runs to give you the results of that search, and then how you can actually configure it so you can define what columns you're searching against, what dictionary you're using, and even doing a weighting.
Now, in terms of optimizing it, they're using Postgres 12 and generated columns. So they are showing a database migration where they're adding a column that it's doing a generated as stored where they're setting the weight and they're converting it to the English dictionary, to tsvector for the title, and weighting that as A, in the description, weighting at B. So they're doing as much work as possible within the generated column without having to resort to triggers. Then, of course, they applied a gin index to it. They adjusted the application configuration to look for that particular column. And after doing that optimization, a particular query that did take 130 milliseconds now takes 7 milliseconds. I'd be interested to know what difference the generated column had versus the gin index, but it's still quite a significant speed boost. So if you're interested in learning more about full-text search, definitely a blog post to check out.
The next piece of content is actually a YouTube video and it's "Webinar: Security and Compliance with PostgreSQL by Boriss Mejías. This is from the 2nQuadrant PostgreSQL YouTube channel. This webinar is about an hour in length and it covers all sorts of different aspects with regard to setting up the security of your PostgreSQL installation as well as compliance. Namely talking about PCI DSS compliance. So if you have an interest in that topic, definitely check out this webinar on YouTube.
The next piece of content is "Waiting for PostgreSQL 13 - Add logical replication support to replicate into partition tables". This is from depesz.com. and they're talking about how you can actually do logical replication into a partition table. So that's particularly interesting. And this is another feature that's been added to capabilities with regard to logical replication. So definitely an interesting new addition.
The next piece of content is "The Origin in PostgreSQL logical decoding". This is from highgo.ca. So this is talking about logical decoding, basically decoding of the wall files by some other application. So it reads the wall files and decodes logically what changes have been made for say, doing a change data capture solution and they're talking about the origin, which is defining the origin and where the data came from. So it was logically replicated from what origin and they go through a bit of the internals describing it and how it works. So if you're interested in that, check out this blog post.
The next post, also from highgo.ca, is "Replicate multiple PostgreSQL servers to a single MongoDB server using logical decoding output plugin". So there's definitely a specific use case. But if you're interested in that, again dealing with logical decoding, check out this blog post.
The next post is "INTERSECTING GPS-TRACKS TO IDENTIFY INFECTED INDIVIDUALS". This is from cybertec-postgresql.com. Now this is with regard to the sickness that is happening and using GPS tracking, they're showing a solution to be able to essentially do contract tracing, it looks like. So setting up data structures within PostgreSQL, set up sample tracks via QGIS segment, sample tracks to retrieve individual track points, and then do the intersection of infected individuals to retrieve possible contacts. So if you're interested in this type of content, definitely check out this post.
Next post, the PostgreSQL person of the week is Ilaria Battiston. If you want to learn more about Ilaria and her work in PostgreSQL and contributions, definitely check out this blog post.
The last two pieces of content are from depesz.com, and they're talking about things we've mentioned in previous episodes of Scaling Postgres about features that are coming in the version 13. The first post is "Allow autovacuum to log WAL usage statistics". Definitely advantageous.
The second is "Add the option to report WAL usage in EXPLAIN and auto_explain". So if you're interested in those topics, definitely check out those blog posts.