100K Tenants, Extensions, Query Planning, Crosstabs

In this episode of Scaling Postgres, we review articles covering 100,000 tenants, Postgres Extensions, query planning and how to create crosstabs.

Content Discussed

YouTube Video

Podcast Audio

Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about 100,000 tenants extensions,query planning, and cross tabs. I'm creston. Jameson and this is scaling postgres episode 19one.

[00:00:21] All right, our first article is called Options for Scaling from One to 100,000 Tenants.And this is from the Citusdata.com blog. Now, this kind of blog post is interesting because it kindof goes over a story or scenario, presumably that Citusdata has heard from customers wheresomeone is scaling their software as a service application because accounts are typically calledtenants. So this talks about the scenario where you start off with one customer and you getanother and another and another, and your database needs to scale as well.

[00:01:01] And some of the questions they propose that you may be thinking about should wemove off of the cloud and back to an on premise solution, potentially a proprietary databaseappliance? Because you're looking at 480gb of a memory instance on AWS, or you're looking atother cloud providers to see what can give you either a better deal or perhaps they offer a largermemory size. And then they talk. About? Do you want to consider manually sharding at theapplication layer to be able to send queries to different database instances and all the issuesthat can potentially come from that and how difficult it can be to do?

[00:01:49] Or do you want to migrate from a relational database to a NoSQL database becausepresumably those scale better. But of course they have a lot of downsides and of course theirultimate solution is adopt a solution that takes care of sharding for you at the database layer.And that's pretty much what Citus does. It does sharding of your PostgreSQL instances. So it isdefinitely a blog post about the Citus data solution, but there are some other considerations thatshould be taken into account or other potential ways of doing it. So for example, if your limit isthe 488gb of memory, well, your target is you want to try to keep a cache hit rate of around 99%.

[00:02:37] Can you do things to reduce the in memory working set? So, for example, can you dothings like partitioning your table data? Can you create summary tables that allow users who areusing your application to be but only pull back summarized data quite efficiently? And then thatcan be a much smaller size of data versus having all your history that has to be aggregatedwhen that type of data is requested. So there's some other steps you can use using stockpostgres to be able to extend the point at which you will have to do sharding. Now, I think thereis a stage at which you'll have to do it. But one thing this blog post does not go into which itmakes sense because they're wanting people to use their platform. But there's other steps youcan take to be able to extend your runway essentially before you have to do sharding. But thiswas an interesting blog post and I myself was thinking through, okay, what are some of thethings I would do before choosing to take the step to do sharding? Or what other options arethere available other than Cytus? And we're going to look at some future blog posts that kind oftalk about some of the different solution using farden data wrappers and some things of thatnature in some of the future stories that you could potentially put together a semi shardingsolution. Another way to potentially scale that wasn't mentioned in here, that people typically dois set up multiple read replicas because generally your master or your primary database takes allthe writes, but most applications are read heavy. So if you can have multiple read replicas, thatcan help again extend the runway before the point at which you have to do sharding.

[00:04:17] The next post is actually a YouTube video. Again, this is from Citize Data and it's fromOsgun Erdogan and the title is SQL Scaling and what's unique about PostgreSQL. Now, thiswas an interesting title and the content of it I didn't really feel reflected what's in the title.Essentially the content from my perspective is that he was talking about primarily postgres'ssupport for extensions and how that really empowers Postgres to fill a lot of use cases.

[00:04:55] And they have firsthand experience with this because I believe initially Citus, theirSharding solution was a fork of postgres, but then they re engineered it to be a extension ofpostgres. So essentially they tie into the postgres ecosystem using these extensions --and you can add more extensions to your system to be able to expand its capabilities. So it'sa very interesting presentation and I definitely suggest you learning about extensions because ifyou like postgres and you have a particular need that you need it to fill but it's not quite there yet,perhaps you could develop an extension to fill that need yourself. Now, some of the other areasthat it talked about in additions to extensions is they talked about the concept of relationaldatabases can't do this and they talk about process semi structured data. And of course theanswer to that is their support for XML data types, JSON data types, and particularly JSON Bdata types. The binary storage format for JSON, the one number two run geospatial workloadswell. There's PostGIS, which itself is an extension and it adds all these functions and capabilitiesto be able to do geospatial analysis. Then they're talking about non relational data storage and alot of this can be handled by extensions or foreign data wrappers. For example, they mentioned,I believe, a foreign data wrapper they worked on that does support for column storage. But youcan also, I believe he said that there are over 100 foreign data wrappers. So you could doaccess to S three, you can interact with Oracle, you can interact with MongoDB DynamoDB.Basically you can interact with all these other databases using these foreign data wrappers,bringing data in or potentially pushing data down into those storage systems to be able to workwith them. And there's a future blog post that talks a little about the foreign data wrappers andsome considerations with regard to them coming up in a post or two. And then Ford talked aboutscale out for large data sets. So of course they're emphasizing the Citizens Data solution forscaling out. But again, whenever I see this, you also have to think that it is on the horizon forcore PostgreSQL to do sharding or scaling out. Again, utilizing the concept of foreign datawrappers and pushing out the computations down to foreign PostgreSQL instances. But overall,this wasn't too long a YouTube video, so I definitely suggest you check it out.

[00:07:36] The next article is the Software Engineering Radio episode number 328 featuringBruce Mom Jin on the Postgres query Planner. So this is about an hour long, over an hour longepisode that talks about the query planner. I would say about the first 18 minutes were prettybasic. So if you are more experienced with or at an intermediate or too advanced level, maybeyou want to consider starting listening around the 18 minutes mark. Or if you are more of anewbie, you could start from the beginning. But it talks about considerations of the query plannerand how it works in the internals. So basically talking about SQL is essentially a declarativelanguage rather than an imperative language and therefore it's the planner's job to interpret whatyou want to achieve and then putting it into specific instructions and he goes into and talks abouthow it does that and some considerations. So it's definitely an interesting episode to listen to,particularly if you want to kind of get a little bit more insight into how the, I guess magic works.

[00:08:46] The next post is called Planning Queries involving Foreign PostgreSQL Tables. Andthis is from the Walking with the Elephant blog and this is a good complement to the previousRadio episode because he talks a little, it goes into more depth about cost based optimization interms of what the query planner is doing. So it's a great article to read after that to kind ofcement your knowledge of what was discussed. And then he goes into talking about the thingsthat make foreign tables different. So you don't want to necessarily constantly collect statisticsfrom these foreign data sources because that would incur a pretty large cost in terms of networktraffic and having to do it periodically. So he goes over considerations you need to take intoaccount with regard to that. And then he also goes over a few settings. So there's use foreignestimate, use remote estimate, and just some considerations, how you set these differentparameters and the effect on postgres's behavior in terms of at this stage of development. Onescenario gives you a very small planning time but the execution time is a little longer. But whenuser remote estimate is enabled you actually get a much longer planning time where yourexecution time is shorter.

[00:10: --10] So there's not an ultimate solution to getting optimized. It's mostly a blog post talking aboutsome considerations and things to think about. And potentially this resolution will be coming in afuture postgres version, but it's definitely a blog post talking about some of the existing issues orpotentially gotchas you have to be aware of if you're going to be starting to use foreign datawrappers. But definitely an interesting blog post to check out.

[00:10:37] The next blog post is static and dynamic pivots. So it's basically talking about a pivottable or what I've frequently heard them called is a cross tab table, basically a table. So forexample, maybe you have a table like this that has by year and rain days and you want to pivotit so that your column headers are the years and then it shows the rain days down here.Historically in other databases I've used, I've used essentially case statements to achieve thisand that's what he's talking about, a static pivot because you can use case statements toachieve this in PostgreSQL, although they also offer the filter function as of PostgreSQL 9.4. Soyou can use some type of aggregate like a sum or account and then use this filter command todevelop a static pivot or a cross tab.

[00:11:31] And he also mentions there is a table function extension that provides a cross tabfunction as well. So that is a potential solution you could use. But of course this article also talksabout dynamic pivots and he actually goes into detail about how to potentially do dynamic pivotswith your data. So if you are interested in cross tab reports and having it generated on the SQLside as opposed to trying to do something on your client, definitely a blog post to check out.

[00:12:03] The next post is gracefully scaling to 10,000 PostgreSQL connections for $35 amonth. Part Two So last week we talked about part one. This is part two of the article. Again, thearticle is essentially about Pgbouncer and using it again doesn't really mention what the $35 amonth is, but this second version actually goes more into the configuration of PG bouncer andthe main areas that you need to be concerned with. Generally you want to set up yourdatabases section so that you can funnel the connections to the correct PostgreSQL instance,suggesting enabling the transaction pool mode and then configuring max client connections,MaxDB connections and some additional parameters that you want to adjust. So it's a bit shorterthan previous article but they are saying there will be a part three article coming up. But if youare in the process of configuring PG bouncer, definitely an article to check out. The next articleis architecture and tuning of memory in PostgreSQL databases. This goes over a little bit ofmemory architecture. First talking about how you essentially have your database cluster and thathas one common area of shared memory and your shared buffer pool is located there, your wallbuffers there and your commit log is located there in terms of things that are memory resident.And then each process, so you could imagine there's multiple processes, has a temp buffers,work memory and maintenance work memory. So after describing each of these areas and theirpurpose, it then goes into okay, how would you configure these and gives you recommendationsfor shared buffers which is generally 25% of the system memory, how to configure work memoryappropriately, the maintenance work memory, effective cache size and the temp buffers. So notonly does it give you a little bit of insight to how memory is structured and works in PostgreSQL,but also gives you suggestions on how to configure those.

[00:14:11] The next post is upgrade your partitioning from Inheritance to declarative. And this isfrom the Walking with the Elephant blog. Now they're talking here. Prior to PostgreSQL Ten, theonly way to do partition tables was through inheritance. So you had a parent table and youmanually created the child tables and triggers to be able to direct data to the appropriatepartition. However, in version ten, they introduced declarative partitioning. So like I've set up thepartitioning through inheritance and this blog post goes about how you would want to transitionthat to declarative partitioning because it just makes the management easier and there's lessthat you would need to do. So this blog post talks about moving through it from taking a backup,starting the transactions, doing the things th --at you need to do to efficiently alter your partitioning scheme from an inheritance basis to adeclarative basis. So it's definitely a great blog post to check out if you have partition tables andyou want to make the transition or the switch.

[00:15:18] The next post is PostgreSQL extended statistics.

[00:15:22] So this is a new feature in postgres Ten. So most of the statistics by default arecollected at the basis of individual columns. But you may have cases where there isdependencies between columns. For example, in the table that mentioned here you have a daycolumn, a quarter column and a year column where there are dependencies between each ofthese. So this basically talks about the new way to collect statistics across columns withextended statistics. So basically you can use create statistics to create dependencies betweenparticular columns and you can use the special view PG underscore statistic underscore ext tolook at the data about this statistic. And this should enable the planner to give you better queryplans for these particular columns. So if you're wanting to use this new feature, definitely a blogpost to check out.

[00:16:17] The last article is PostgreSQL concurrency. Data modification language. Now it talksabout a concurrency, but a lot of the post is basically about DML or Data Modification Language.So this is a pretty basic Post, so it goes into developing a Data Model, inserting Data usingInsert Into or Insert Select, but it goes into a little bit of the background in terms of howPostgreSQL works under the COVID So they're talking about the Update statement and then itsimpact on concurrency because PostgreSQL uses NVCC and how it does Row locking in.Essentially every update creates a new Row that it then has to vacuum the old one and thenaddressing the delete command and how that works. So this is definitely a more introductorytype Post, but if you're wanting to learn more about the Data Modification Language and how itworks with PostgreSQL, definitely a blog Post to check out that does It. For this episode ofScaling Postgres, you can get links to all the content presented in the Show notes be sure tohead over to Scalingpostgres.com, where you can Sign up to receive weekly notifications ofeach episode, or you could subscribe via YouTube or itunes. Thanks. --

100K Tenants, Extensions, Query Planning, Crosstabs | Scaling Postgres 19

Join Over 1,000 Engineers & Get New Episodes Weekly!

Content Discussed

YouTube Video

Podcast Audio

Transcript