DBMS of the Year, Better Data Migrations, Idle Connection Impact, Security Implementation Guide | Scaling Postgres 147
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss PostgreSQL as the DBMS of the year, running better data migrations, the impact of idle connections and a security implementation guild.
Content Discussed
- Single User Mode and -P flag
- PostgreSQL is the DBMS of the Year 2020
- Stop worrying about PostgreSQL locks in your Rails migrations
- Resources consumed by idle PostgreSQL connections
- Performance impact of idle PostgreSQL connections
- How to limit rows to at most N per category – fix
- Announcing the Crunchy Data PostgreSQL Security Technical Implementation Guide
- PostgreSQL Database Security: What You Need To Know
- Using R in Postgres for Logistic Regression Modeling
- Improved OpenStreetMap data structure in PostGIS
- Amit Langote
YouTube Video
Podcast Audio
Transcript
All right. Hope you, your family, friends, and coworkers continue to do well. Our first piece of content is "PostgreSQL is the DBMS of the Year 2020". This is from db-engines.com, and they have ranked it based on its gain in popularity for the last year. They are defining popularity according to their methodology, which is basically about how many people were communicating about it. So, for example, their list here job offers, professional profile entries, and citations on the web. So this is actually the third time that PostgreSQL has been nominated to be DBMS of the Year. So that's another great achievement for Postgres. So if you're interested in learning more, you can check out this site.
The next piece of content is "Stop worrying about PostgreSQL locks in your Rails migrations". This is from medium.com. Now, this is talking about Rails migrations and it does cover some Ruby and Rails code, but the information in this post is valid for anyone using Postgres. If you use Django or Python or some other language, you can find important things to do and be wary of if you are working with a large database. So they have a database that handles, they say, up to 15,000 requests per second. They also have, I believe, over 100 full-stack developers working on it. So it seems to be a pretty large application. They have refined their procedures to do migrations, which are essentially DDL changes to their database safely. So the first thing that they talk about that you want to be aware of is adding an index. So if you just add an index and they are using a Rails migration. So this is Ruby code, but it basically just adds an index.
Of course, this is a locking operation until that index completes. What you want to do is create that index concurrently. Now, you can do that using the Ruby syntax by specifying the algorithm here concurrently when you add it. So you definitely want to do that. However, the other thing that they apparently have set for their application is a statement_timeout for the user that's running these migrations. So to avoid that index operation being timed out, they actually first, as part of their migration, store what the statement_timeout is into a variable. Then they set the statement_timeout to zero. So basically no statement_timeout for this session that they're in. They add the index and then they reapply the old value to the statement_timeout. Now, one thing that they also have to do with this is that they have to disable the DDL transaction wrapper that Ruby provides. So it basically avoids that in order to run this concurrent index migration.
So, a few steps, but the main one here that is different than just creating your indexes concurrently is altering the statement timeout so that the index will run. So that's something they have to do in their environment. The next area they talk about is adding a foreign key and with the Ruby DSL you can just do add_foreign_key, the table, and then the reference. Now again, this would prevent some writing on both tables. So one way that they could do it is to set a statement timeout to avoid that locking. So here they set the statement timeout to 5 seconds, try adding the foreign key, and then if that's successful, revert the statement timeout to what it was before. However, the better way to do it that they will mention here is you add this constraint without validation. So basically any rows moving forward are validated, but not pre-existing rows.
And then as a second step that is not going to block operations, you validate that constraint. So here's the process that they go to. They go ahead and save the old value of the statement timeout, they set the statement timeout to 5 seconds to keep it relatively short. They go ahead and add the foreign key and they set the validation to false. So that operation should run relatively quickly again, less than 5 seconds. Once it's successful, you can then set the statement timeout to 0 seconds or turn it off. Then you can do this command which validates the foreign key for this table so it goes through and checks all the values to ensure that it's valid. Then as a final step, you revert the statement_timeout to what it was before. So it's a little complex, but you can see the logic in doing it. First, you create the foreign key in an invalid state and then you validate it later. The next area they're talking about is adding a column.
So you can add a column like this and normally that happens very quickly, but where you can run into problems is with a lock queue because it still needs to get an ACCESS EXCLUSIVE lock on that table. But if there's other locks going on, what can happen is this gets put in a lock queue and then things behind it have to wait. They give a graphical representation of this, where basically, operations that require a lock have to wait until this add column query is completed before they can start processing. So everything starts getting backed up in the queue. So the way to avoid this is to have a lock_timeout. So basically, you say if it gets timed out due to lock waiting, it'll go ahead and cancel that operation, it won't move forward with adding that column. Now of course, if that happens, you need a process to retry it to make sure it's successful.
They don't show that in this particular example here. They just set the lock timeout to a short value and then revert it. Now they also mentioned there's a way to add a reference, which is basically a combination of the things that were shown. So adding a new column, adding an index, adding the reference, so multiple ways of doing it that can cause locking. Basically, this is a way to get around this issue. Instead of using this very simple DSL to do it, you need to do one step to add the column, one to add the index, and then add your foreign key. Now, it would be great if Rails offered a way to do these very safe steps to do updates to the database.
Unfortunately, they don't. Now, there may be other ORMs that offer a more sophisticated way to do it in this safe manner. There have been a number of libraries created for Ruby to do safer migrations. Well, they've actually written another library or a Gem called safe-pg-migrations that basically takes that standard DSL, simple syntax, and does all of these things behind the scenes. So this post is definitely beneficial for those who are using Rails or Ruby to create safer migrations. However, even if you're not following these steps and these processes and even thinking through them, it can help lead you to create better DDL statements for migrating your database as you're working with it. If you're interested in that, you can check out this post.
The next piece of content is "Resources consumed by idle PostgreSQL connections". This is from aws.amazon.com and they're talking about connections that really aren't doing anything and the resources that they use. Now, this is the first part of a post and the second one goes into some of the performance impact. But this one does a very test-driven approach, an empirical approach for measuring the impact of these idle connections. So first they do a test of connections with no activity. So basically they open 100 connections and just leave those idle connections open for ten minutes and see what the impact is. They also try doing a DISCARD ALL statement to try and free up any resources to see if that frees up anything. Unfortunately, it really doesn't in any of the tests that he's seen here. So you can see the number of database connections going from 0 to 100 as tested, or maybe around 102 as tested.
You can see the freeable memory drops from about 5.27 to 5.12 GB, so that's about 150 megabytes. So each connection uses about 1.5 megabytes. For the next test, they actually did some work with temporary tables, so they created one and dropped one, and then essentially left those connections idle for ten minutes to see what the impact was. Here you can see the number of connections going up and down. Then this is the second phase with the DISCARD ALL. With this one, you can see that the memory dropped from 5.26 GB to 4.22 GB. So it definitely uses more memory and those idle connections basically don't give anything back to the memory until those connections have been shut off.
So those idle connections are still using memory resources. Next, he took a look at some different SELECT queries. So try doing SELECT queries and then waiting for ten minutes for that idle state to see what happens. Again, here you can see that the memory does drop and it is only released back once those connections are dropped for the ten minutes that they are idle, they're still using these resources. They are not released back to the operating system for example. Then you combine both a temporary table and the SELECT queries and you see the same impact where both activities were done. It drops the memory down and they are only released once those connections are dropped.
A discard statement still does not free them up until the point that those connections are stopped. He says that you even get a CPU impact from these idle connections. So here's an example where he compared different test runs on CPU utilization and compared how many connections. So for example, you can see here it goes up, I don't know if it's maybe a percent for 100 connections, 500 connections, it gets up to maybe a 3% utilization with really nothing going on in terms of those connections. It gets up to about 5% utilization at 1,000 connections, it gets up to about six or 7% at 1,500 connections. Then at 2,000 connections, it gets around 8% utilization. So definitely even connections just sitting around really doing nothing still use CPU resources of the system.
Now, the next post talks about the performance impact of this. So he looks at measuring transactions. So he tested with 100 clients and then a test with 1,000 idle connections. Just using a standard pgbench test, you could see that adding 1,000 idle clients drops the transaction throughput from 1,249 to 1,140 TPS, which is about an 8.7% drop. He then said you'll probably get more of a performance impact looking at select-only workloads. So we did a select-only pgbench again, looked at a 100-client run, and then added 1,000 idle connections. In addition to that 100 clients here, he saw an 18.2% drop in TPS performance. So that starts becoming really significant. Then he looked at a custom pgbench workload where he did very large queries against the 100 clients and then the same 100 clients again, but adding 1,000 idle connections to the database at the same time.
When testing that, he saw a 46% performance drop from TPS. The transaction per second went from 378 to 206. So that's a huge performance drop from adding 1,000 idle connections. Then he gets down to the main point is that how you get around this is by using a connection pooler. So he tested using PgBouncer in this case to see what impact it had. So he configured a PgBouncer to allow 5,000 client connections and then a maximum of 200 server connections. So 200 connections to the database, but up to 5,000 application clients could connect. Then he did a pgbench run of 100. So you can see that there are 100 clients and 100 servers to get a baseline. And then he opened up 1,000 connections and left them idle. Now, just adding those 1,000 idle connections, it's not really using any server connections.
When he ran that 100 pgbench, again, he got the exact same performance. So using PgBouncer, adding 1,000 idle connections really didn't have an impact on the performance. It's the exact same transactions per second. So, for example, you can see when he was doing that 100 pgbench run with 1,000 idle connections, you can see the 1,000 idle connections here under the client active, and 100 of them are the ones running pgbench, and they're using up the 100 active server connections. Again, that performance was identical whether you had 1,000 idle connections or not. So using a connection pooler definitely helps with your performance when you have a lot of idle connections.
Then he tried an interesting thing: the system he was using only had two virtual CPUs, and running 100 processes in parallel could result in a lot of context switching. So what he did is he tried dropping the number of connections down to 20. So the server-side connections were down to 20 and ran his benchmark again. This time he actually got an increase in throughput to 426 transactions per second compared to 377. So this is an interesting case,you drop the number of server connections, but you increase the level of performance. You then would get less context switching with two virtual CPUs trying to handle 100 connections at once, when they can only handle 20 connections per one.
They become much more efficient and can give you a better transaction throughput. So definitely something to keep in mind that he mentions here is that sometimes you think just adding connections will give you more performance. Not necessarily. Then they close out the post talking about a tool available in AWS, which is RDS Proxy, which is essentially a tool like Pgpool that you can work with. So definitely interesting insights and a lot of tests with regard to connections. So if you want to learn more about this, I highly encourage you to check out these two blog posts.
The next piece of content is "Announcing the Crunchy Data PostgreSQL Security Technical Implementation Guide". This is from blog.crunchydata.com. Now, this was first released in 2017 in conjunction with working with the United States Defense Information Systems Agency, and they've come out with a new version as of, I think, November 2020. So here's the link to the Security Technical Implementation Guide that you can use to help secure your PostgreSQL installations. So if you want to learn more about this and get a link to the guide, go ahead and check out this post.
The next piece of content also security-related is "PostgreSQL Database Security: What You Need To Know". This is from percona.com. Now this is a very short post, but it says it's going to be the start of a larger series. So this is kind of the 50,000-foot view of it and they've basically broken down security into these six areas. More specifically, what Postgres deals with is regard to authentication, identifying who is allowed to access the database, authorization, what they have authorization to see, and then accounting for logging who's accessed what data, et cetera. They go and break throughout the different features that are available within Postgres to handle that authentication, that authorization as well as accounting steps. So if you're interested in this post, go ahead and check it out.
The next piece of content is "How to limit rows to at most N per category - fix". This is from depesz.com. So this is with reference to a post that was also posted here on depesz.com, where he limited the number of addresses that were allowed to a user. Someone had made a note that with a race condition or multiple people adding addresses at once, you could actually overrun his previous implementation. Because his implementation just did a query to find out how many were there and if it already had three in it, then it would just raise this exception. But if you had a case where there's transactions going on or things are happening in a concurrent fashion, you could easily overrun this. It had five or six or whatever. So what you do to avoid that is you introduce a locking.
Now the first implementation he did was a FOR UPDATE lock which essentially locks this individual row, but that locks it for everything going on. So you may not necessarily want to do that because again, you can, like we mentioned earlier, get stuck in a lock queue and have things locking behind it. So you may not necessarily want to do that. The implementation he ultimately decided on is using an advisory lock. So I would say this is only for this specific purpose. So it's kind of considered an application-level lock even though it is at the database level. It's used a lot in application development. So here you have a specific type of lock for this check address feature. So it does apply this lock with reference to the user ID that's being used. Then it does the count, and if the count is three or more, then don't do the insert, just raise this exception listed here. So if you're interested in that, you can check out this blog post.
The next piece of content is "Single User Mode and -P flag". This is from fluca1978.github.io and they're talking about if you have a case where you have damaged system indexes in Postgres, there is a way to rebuild them. But the problem is you can have trouble connecting to the database if there is this type of corruption. And the solution is that you start the cluster in single-user-only mode and he has a link to how to do that here. Then you can start a background process, ignoring the system indexes here, and then run the command reindex system, the database that is having issues. Then you can restart the cluster in order to get it back. So if you run into this issue, definitely a blog post to consider checking out.
The next piece of content is "Using R in Postgres for Logistic Regression Modeling". This is from blog.crunchydata.com. This is another post in the series talking about doing data science with wildfires in California. They're at the point where you're actually getting to do the logistical regression. Now, for that purpose, they're actually going to be using an extension called Pl/R, where they made R into a procedural language within Postgres. So basically you can install this extension and then it can read R. So they built the model in R and then they can create a function within Postgres to run this model and process the data that already exists in Postgres. So if you're interested in learning more about this, you can check out this post.
The next piece of content is "IMPROVED OPENSTREETMAP DATA STRUCTURE IN POSTGIS". This is from rustprooflabs.com. They're talking about some of the enhancements that were recently made to the OpenStreetMap data where they're allowing a more flexible output, and some of the enhancements that have been made to the general data structure. So for example, some of the opinions or additions that were added to it is that every table has a primary key. More columns get NOT NULL, prefer proper columns over JSONB, and prefer JSONB over HSTORE. In other words, as he says, also known as goodbye HSTORE, units cleaned and matching OpenStreetMap default units and conversion to other units should happen in generated columns, views, and even materialized views. So if you're interested in learning more about these improved data structures, go ahead and check out this post.
The last piece of content, the PostgreSQL person of the week is Amit Langote. So if you're interested in learning more about Amit and his contributions to Postgres, definitely check out this blog post.