auto_explain Overhead, Postgres Data Lake, Citus Shard Rebalance, SSL Authentication

In this episode of Scaling Postgres, we discuss auto_explain's overhead, setting up a Postgres data lake, rebalancing a Citus shared database and implementing SSL authentication.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Can auto_explain (with timing) have low overhead?". This is from pgmustard.com, and they're talking about an extension that is called auto_explain. What it does is, when it detects a slow query, it automatically generates an explain plan in the log so that you can check it out and see the explain plan for that slow query. But as he mentions here, it says a quote "Many places warn about its overhead, but I found concrete information difficult to come by". So he's heard anecdotal numbers in the range from around 3% to about a factor of 2. So a pretty wide range. But they wanted to look at this more in-depth. One thing to keep in mind when doing an EXPLAIN ANALYZE is that the system clock can have an overhead depending upon the system that you're using.

He has a link to an article from Ongres that we actually covered in a previous episode of Scaling Postgres that talks about this. And there's actually an application included with Postgres, pg_test_timing. So you can check out how slow your system clock is for the system you're using it on. Now, for the particular one he used here, it looks like he's doing it on a Mac with about 4 gigs of RAM, and two cores. So a relatively small system. But running the pg_test_timing, it looked like 90% of the calls were less than a microsecond. So that's pretty good. So the first thing he wanted to do was just take a baseline. So he used pgbench to get a baseline of different latencies, and he came up with 1.5 milliseconds in terms of the average latency. Now then he configured auto_explain, adding it to the shared preload libraries, and set it to take a full sample.

So sample everything for all statements. So basically every statement will get an EXPLAIN ANALYZE. Now, when he ran pgbench again, he saw a 26% higher increase in latency, so that's not insignificant. So of course, the thing you really want to do is only log the slowest. So here he's taking a full sample, but he's only looking at queries that run longer than 10 milliseconds. When he runs pgbench again, he only sees a 0.8% difference in terms of being slower. So that's very negligible. But it hasn't really started doing the analyze yet. So in this next step of doing the analyze without the timing of each step, he turned a log.analyze into true. With running pgbench again, it's only 0.3% slower than the baseline. So again, the analyze is not taking up a lot of time as long as you're only doing it for these low queries. Now he added the timing in to be able to track the time of each step.

So we did log timing, true ran pgbench again and this time it's 2.2% slower. So definitely great information to have in that. If you're using auto_explain, you definitely want to focus on only auto-explaining the slowest queries that even if you have timing on it's not going to be a significant hit to your database performance. Now, we actually decided to add a lot more parameters to the logging in terms of triggers, settings, buffers, et cetera, as well as the JSON format. He ran it again and it was actually a little bit faster than the 2.2% of the previous one that just turned on the timing. He can't really explain this, but this may have to do with variation, depending on how many times he's run these tests and what the standard deviation is. But overall, it looks like as long as you're not trying to track every statement, the hit to the database is around the 2% mark, at least for the load he was putting on the database. In this example, your experience may be different based on your load, based upon your system, but definitely interesting information. If you want to find out more, I encourage you to check out this blog post.

The next piece of content- "Build your own 'data lake' for reporting purposes in a multi-services environment". This is from fretlink.com. So they're talking about a data lake that they set up essentially on Postgres. Now, they start off by talking about some of the reporting issues that they were encountering in terms of the demands for different types of data and trying to satisfy those needs from different customers, internal users, etc. As well as issues with data quality, in terms of producing spreadsheets and passing them around. So it's a good read to kind of get an overview of what the environment was like. But the way that they approached resolving these situations starts off essentially in this graph here. So basically they broke up their monolithic application into three separate services: a career service, a pricing service, and an expedition service. Now the main tables of the application, so tables are in green. You can see some of the table names here, here and here.

They're all a part of the public schema. But what they did is they created a separate schema called reporting and then they created views within those reporting schemas. So basically, the developers who are managing the service define an interface, essentially, in views on what data can be pulled out for reporting purposes and give the definition of the columns that are defined within that view. So they did it for each one of their services. The next step is to use streaming replication to stream that data into a separate reporting server. So I assume that this is one cluster that just has multiple databases that reflect what is coming from each of the application services. So essentially it's a one-to-one copy. Then with that data, in this separate instance, they created a separate reporting database and that reporting database had a schema for each service.

So the reporting database has a "carriers" schema that uses a Foreign Data Wrapper to access its carrier service database and it essentially accesses the reporting views, then the pricing schema in the reporting database accesses the pricing service database and accesses the reporting views there to get the information needed and same thing for other services. So this essentially allows them to query this one reporting database and do queries across different databases, essentially by using these schemas. Now, one advantage of this is that pretty much all of these queries will be against, essentially, real-time data. Now, they didn't mention that they're using materialized views or any kind of summary tables or things like that. Now that would cause the data to be not real-time, but you could then optimize access to that data with indexes.

So this structure may work at a certain data level, but once you're hitting billions, tens of billions of rows, you may need to implement something slightly different, something less real-time, to handle performance for reporting purposes. Now, they go ahead and include the different scripts that they use to set this up so you can check that out. Then they also took another step to integrate other external data into the system as well. So in this example, they're showing how they have information located in Google Sheets or some other external service. Basically, they set up a way to synchronize it with a separate database within the reporting server.

Then they do the same process where the reporting database has a dedicated schema for that data and reads it through a Foreign Data Wrapper. The same thing for any other external data that they access. Then they follow up the post determining how they've automated all of this to be able to run and process data for them, following up with using Metabase to give access to this data in a visual and graphical format. So, this is a very interesting post about a process they went through to essentially put data in the hands of their end users and allow them to query it. So, if you are interested in that, you can check out this post.

The next piece of content is "Scaling out Postgres with the Citus open source shard rebalancer". This is from citusdata.com and it talks about again the open-source release of Citus 10 and how it now supports being able to rebalance shards that you have across the scale out of Citus. So they give an example here of why this is important. So maybe you start your Citus cluster with two nodes and you have three shards here and three shards there. Well, if you add a node C, you need some way to move those shards to node C because essentially, node C is not being used. Now that's where the rebalancer comes in. It can take shards from other nodes and repopulate them to C and essentially rebalance across the cluster no matter how many nodes you have. Now they also mentioned there are a couple of different rebalancing strategies. So number one is by_shard_count, so that's essentially what this is.

Node A had three, node B had three. So you essentially move one from A and one from B to node C. So now they each have an equal number of shards. The other way to do it is by_disk_size. Now this can become important if you're say sharding by say, a customer set of data and you have large customers and small customers. Well, you could rebalance by the data size so that you're essentially balancing the data size between your nodes like here. So you move a large customer three to node A and the small customer two to node B. But they also have other examples of rebalancing strategies that you can use or develop your own. So they have some information about that here.

They also talk about a way to actually shrink your cluster. So in addition to scaling out, you can also scale down if you don't need all of those nodes that you've scaled up to. There's a Citus drain node command to do that. So essentially it rebalances the shards, collapsing them into fewer nodes. Now one downside to this they did mention is that during a rebalance process the data is readable but it's not writable. So if you need to do that, you need to take into account that okay, if you're going to be rebalancing a shard, you can't actually write to that data at that time. Now they did say on the Azure version of Citus, there is a way around that, but not for the open-source version at this time. But if you want to learn more about this, definitely check out this post.

The next piece of content- "SETTING UP SSL AUTHENTICATION FOR POSTGRESQL". This is from cybertec-postgresql.com. They're talking about just setting up SSL on your Postgres server. So in the postgresql.conf file, there are a number of settings that you can set. Basically, the main one is setting SSL=on as long as you have a cert_file and a key_file. Of course, there are a number of other parameters that you can adjust for SSL. Now you can do a reload of the configuration at this point, but it won't actually work yet. You also need to do a restart to get it up and working. Now the next thing you need to change on the server side is the pg_hba.conf file because you need to use a type of host SSL for it to enforce those SSL connections.

So you need to move from just a host type to a host SSL type. Now, the next part of the post actually talks about getting that certificate file and certificate key. You can, of course, purchase those, but this runs through the process of doing your own self-signed certificate and learning how to set that up. Now, the next thing they mentioned is that in terms of clients connecting to a Postgres server with SSL enabled is that there are numerous different ways you can connect to it and the client can define the following SSL modes all the way from disable to verify-full. But of course, the server also has to support SSL for these to work.

So you can at the point of connection, say SSL mode disabled to not get any SSL. Usually, this is, I would say, one of the more common options required. So basically it's going to require SSL connections when it connects. Or maybe you might want to prefer if your server still supports non-SSL. But if you want full protection, having a certificate on the client and doing a verify-full does give you, as they say here, the strongest protection possible because you are validating the actual certificate that the server is using and the client trusts the signer of that certificate. So if you want to learn more about setting up SSL on Postgres, definitely check out this blog post.

The next piece of content- "Enhancing PostgreSQL 13 Security with the CIS Benchmark". This is from blog.crunchydata.com and this is a new version of the CIS PostgreSQL Benchmark that's been published. CIS stands for the Center for Internet Security, which is a nonprofit organization. So Crunchy Data has helped to develop a Security Technical Implementation guide for Postgres. It covers things like installation and patches, directory file permissions, logging, monitoring, auditing, user access and authorization, connection and login, PostgreSQL settings, replication, and special configuration considerations. Now, this is one standard.

Another standard from a post, also this week is from enterprisedb.com and it's "How the EDB Postgres STIGs Can be Used to Secure Your PostgreSQL Database". These are a set of standards. A STIG is a Security Technical Implementation Guide that is based on the Defense Information Systems Agency. So this is the DOD, the government organization's set of standards. So they have developed a Security Technical Implementation Guide for Postgres. This post talks about all of that and using it to secure your Postgres systems. So either of these two posts are great ones to look through to help you follow a standardized process to secure your Postgres installations.

The next piece of content is "Monitoring PostgreSQL with Nagios and Checkmk". This is from highgo.ca, and it's talking about setting up Nagios for monitoring along with Checkmk. But they go through the process of all the commands you would need to run to set this up on Redhead Enterprise Linux. Install Postgres and then set up Nagios with the command files to start monitoring Postgres as well as adding Checkmk to be able to give a more, they say, user-friendly view of monitoring your Postgres installation as well. Apparently, this uses check_postgres under the covers which basically does queries using psql against the database to check the number of different parameters. So it gives you more than just node information but also relevant database statistics as well. So if you're interested in that, check out this blog post.

The next piece of content is "Musings of a PostgreSQL Data Pontiff Episode 1". This is from blog.crunchydata.com and this gives a little bit of the background where they're going to be talking about a data science blog series. So this mostly gives some background on the author.

The second post was also made and it is "Musings of a PostgreSQL Data Pontiff Episode 2 - Hot Magick; Analytics and Graphical Output with PL/. Basically, they're using Postgres to do data analysis within it and using functions within Postgres to do that data analysis and then using it in conjunction with the Procedural Language R to generate different plots such as this. So if you are interested in learning more about data science with Postgres, definitely check out these two posts.

The next piece of content- "TRAVELING SALESMAN PROBLEM WITH POSTGIS AND PGROUTING". This is from cybertec-postgresql.com. They're talking about essentially the traveling salesman problem where a salesman has to hit a number of points in their sales route and what is the most efficient route to do that. This post discusses using Postgres and the pgrouting extension to be able to accomplish that and define the shortest route between multiple sets of points. So if you're interested in that type of geo analysis, definitely check out this post.

Next piece of content- "Introducing pg_query 2.0: The easiest way to parse Postgres queries". This is from pganalyze.com and they've released their new tool PG Query in order to parse Postgres queries. It actually uses the Postgres parser to do it because it's developed as a C library that works with Postgres to be able to parse the queries and they actually use it in their tool pgAnalyze. So they've open-sourced the query parsing part. They have a number of libraries in other languages that can use it, such as Ruby and Go and others have written others that actually use the CD library to do this query parsing. So if you're interested in that, definitely check out this blog post.

The next piece of content- "Cloud Native PostgreSQL for Application Developers". This is from enterprisedb.com and by Cloud native Postgres, this is a name they've given to their Kubernetes operator at EDB. This post actually talks about using this Kubernetes operator for application developers to be able to manage your Postgres instances if you're doing testing, I assume with multiple different ones. So if you're interested in that you can check out this blog post.

Next piece of content- "Announcing the Crunchy Postgres Operator 4.6.0 with rolling updates, pod tolerations, node affinity and more". This is from blog.crunchydata.com. If you want to learn more about the new version of their Postgres operator, definitely check out this blog post.

The last piece of content, the PostgreSQL person of the week is Anastasia Lubennikova. So if you're interested in learning more about Anastasia and her contributions to Postgres, definitely check out this blog post.

auto_explain Overhead, Postgres Data Lake, Citus Shard Rebalance, SSL Authentication | Scaling Postgres 157

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript