Global Indexes, Caching Aggregates, Vacuum Processing, Effective Cache Size | Scaling Postgres 91
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss global indexes, ways to cache aggregates, how vacuum processing works and the purpose of effective cache size.
Content Discussed
- Proposal for Global Indexes in PostgreSQL
- Pulling the trigger: How to update counter caches in your Rails app without Active Record callbacks
- An Overview of VACUUM Processing in PostgreSQL
- effective_cache_size: What it means in PostgreSQL
- [YouTube] PostgresConf South Africa Channel
- Which size fits me?
- 12 Common Mistakes and Missed Optimization Opportunities in SQL
- Similarity in Postgres and Rails using Trigrams
- Enhancing PostgreSQL 12 Security with the CIS Benchmark
- PostgreSQL Deployment & Configuration with Puppet
- OGR FDW Spatial Filtering
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Proposal for Global Indexes in PostgreSQL". This is from the percona.com blog and basically, PostgreSQL for partition tables does not have a global indexing. The indexes actually exist on each partition table. For explanation purposes, they're calling this a local index in this post. But they've sent a proposal to the community, according to this post, to propose having global indexes. So this is, as they are describing, a one-to-many index in which one index maps to all the different partition tables. Now one area that I think is incredibly important, for data integrity is being able to have a unique constraint that crosses all partitions.
Because I've run into a situation where I wanted a unique constraint that is actually not part of the partition keys. You can't really do it, you can set a unique constraint on the local partition, but you can't really have a global one. If there was a global index, presumably, that would be supported. The other area that they're considering is that when you're again wanting to cross partitions, you could potentially get better performance with a global index. And they talk a little bit about the syntax proposed and some potential changes that would be needed. Now one disadvantage they mentioned is that you're probably going to take a hit on writes once this is implemented. But this is an interesting feature that seems to be in discussion, so we'll see what comes of it.
The next post is "Pulling the trigger: How do you update countercaches in your Rails app without Active Record callbacks". This is from evilmartians.com. So because it's talking about Active Record and Rails, this is about Ruby on Rails, but a lot of it pertains to PostgreSQL database and triggers, locking, and things of that nature. So they have a scenario where they have an orders table and users and they want to collect aggregates on them. So how many total orders per user for example. They filled up a database with 10,000 users and 100,000 orders. Basically, the performance is not what they would hope for when you're just querying it. So their idea was to essentially cache or create a user_stats table that, by each user, stores the total orders amount and the total orders count. Now with Ruby on Rails, one method you could use is Active Record callbacks or counter_caches to do it.
But as they say here, that really won't work for the amount portion. So they wanted to use a trigger. Now they're using a Ruby gem called hair_trigger to actually implement this, but the raw SQL code is here. It just makes things a little bit easier to do. So they're just going to insert into the user_stats table, summing by the orders_amount and the count of the orders, and if there is a conflict with the user ID, so that record is already there, it's going to do an update and set what the new orders_amount and orders_count are. But they ran into a problem with this because they had some race conditions. So they created a job that ran on four separate threads and tried to update and they had some race conditions. Now they mentioned they could change the isolation level, but actually what they wanted to do was use a transaction advisory lock.
So these are application-level locks and all they had to do was create an application lock using the user ID and add to this, they were able to do these INSERTS or UPDATES and avoid the race condition that they were encountering before. They even have another way that they did that. They're calling the lock-free alternative where they're using deltas. So essentially they're doing an insert or updating and the update amount they're just adding to the orders whatever the new order amount is and whatever order count they're adding to it. And they say as long as you don't use any sub-queries, race conditions would not be possible. So these are two possible implementations to cache aggregates in your PostgreSQL database.
The next post is "An Overview of VACUUM Processing in PostgreSQL". This is from severalnines.com blog and it just gives a general overview of vacuum and kind of why it exists. So the first thing that they mention here is that PostgreSQL does not do in-place updates. So when there's a DELETE, it actually marks the row as dead and it must be vacuumed later. On an UPDATE, the current row is marked as dead, but a new row is inserted in its place essentially like a DELETE and then a new INSERT. Because of this, these rows must be vacuumed up once they're no longer visible to any other sessions connected to the database. Along with it, they describe the visibility map that defines what tuples on pages are still visible to existing transactions.
They talk a little bit about the free space map, and the importance of freezing transactions to define what is still visible to existing transactions. Then they talk about vacuum's responsibilities in terms of scanning all the pages of tables to get the dead tuples, freezing old tuples, removing the index tuples, pointing to the dead tuples, removing the dead tuples of pages corresponding to a specific table and reallocate the live tuples on the page. Update the free space map, and the visibility map, truncate the last page that's possible, and update all corresponding system tables. So a fair amount of work that vacuum has to do. Now they also mentioned this does not, of course, free up space actually on the disk. It allows space to be available for new INSERTS within the database system itself, but you're not going to save disk space.
However, a full vacuum that reorders the data does free up disk space. The disadvantage of course is that it takes an exclusive lock on the relation which prevents any SELECTS, UPDATES, INSERTS, and DELETES happening. So generally you don't want to do this on a running production database. There are certain specialized tools or extensions you can use to do the equivalent of a VACUUM FULL and then they discuss the purpose of autovacuum and go into a little bit about vacuum and VACUUM FULL. So if you're interested in learning a bit more about vacuum and its processing, definitely a blog post to check out.
The next post is "EFFECTIVE_CACHE_SIZE: WHAT IT MEANS IN POSTGRESQL". This is from cybertec-postgresql.com. This is a setting in your postgresql.conf file that helps give the database insight into how much memory is on the system that's potentially available for file system caching. Because the greater the amount, the more likely an index will be in memory and the more likely PostgreSQL will use an index scan. So they had a scenario they imagined here that you have a database system with 100 gigs of RAM and said okay, maybe 2 GB is the operating system, 3 GB for PostgreSQL.
Then you set the shared buffers to example here they said 24 GB and then pretty much whatever's left, at least on a Linux system, would typically be used for file system cache. Then they talk and show the actual comments of the C code of how as you get up in memory database sizes, the benefit of the effective cache size essentially increases the probability that an index scan will be used because in large memory sizes, more of this file system cache could be used for caching and it's a way to communicate to PostgreSQL the rough size of how big this cache is. So if you're interested in learning more about this setting, definitely a blog post to check out.
The next piece of content is actually a YouTube channel and they have just released a number of videos about 20 or so for Postgres Conf in South Africa that happened in October 2019. This is on the PostgresConf South Africa YouTube channel. So definitely some additional content to check out with regard to PostgreSQL.
The next post is "It's all in database". This is from rafiasabih.blogspot.com and it talks about sizing and how to determine the size of your different tables or relations in PostgreSQL. He talks about the \dt+ table name command to give you an indication of its size. You can also use the system table's pg_table_size and pg_relation_size. They do note that table size won't give you the size of the indexes and for that, you have to do pg_total_relation_size. And then he also mentions an extension called pgstattuple to give you several additional metrics if that's of interest to you, particularly if you're trying to find bloat or things of that nature. So if you're interested in different methods of determining the size of your database, definitely a blog post to check out.
The next post is "12 Common Mistakes and Missed Optimization Opportunities in SQL". This is from hakibenita.com, and this post describes exactly that. The 12 are to be careful when dividing integers. Guard against division by zero errors, which both seem general programmatic things to be aware of in any language. Know the difference between UNION and UNION ALL. Basically, UNION should be excluding duplicates. Be careful when counting NULLable columns. Be aware of time zones. Avoid transformations on indexed fields. Basically, if you're doing a function like lower or upper to it, that type of function may not be used on the index unless it's an expression index, for example. Use BETWEEN only for inclusive ranges to avoid this issue. I generally only use greater than or less than I usually don't use the BETWEEN syntax.
Add faux predicates. So basically look for index columns that could help you get some better performance. Inline CTEs, which is something that they recently changed the default for PostgreSQL 12. FETCH only what you need. General good advice. Reference the column position in GROUP BY and ORDER BY. You can use integers to reference the column position as opposed to doing the exact name. Format your query. Now again, this is a personal preference, but again some things are mentioned. So a lot of these items are more on the simple or basic side, but if you want to become more familiar with some of them, definitely a blog post to check out.
The next post is "Similarity in Postgres and Rails using Trigrams". This is from pganalyze.com. Now, a lot of this is actually based upon Rails, but they discuss Trigrams and the pg_trgm extension and they give a rough Ruby implementation, but then they go into doing Trigrams in Rails. They enable the pg_trgm extension and then use essentially the select count to get the similarity to what word is entered against cities like Torono versus cities. So if you want to learn more about the pg_trgm extension, definitely a blog post to check out.
The next post is "Enhancing PostgreSQL 12 Security with the CIS Benchmark". This is from crunchydata.com and this is a center for Internet Security benchmark that they've been publishing and it has now been released for version 12. So if you want to find out more information with regard to that, definitely check out this.
The next post is "PostgreSQL Development & Configuration with Puppet". This is from severalnines.com. So if you're looking for a configuration management tool to be able to deploy your PostgreSQL instances, definitely a blog post to check out. Personally, I use Ansible, but Puppet is also a viable alternative.
The last post is "OGR FDW Spatial Filtering". This is from the cleverelephant.ca blog. The headline is "The OGR FDW now pushes spatial filters down to remote data sources!". So this is definitely related to Postgres, and I'm an area I'm not too knowledgeable about, but I know they're talking to foreign data wrappers and OGR is apparently a data model. So there is a way to push down queries to the data source that you're connecting to that presumably would result in better performance. So if you're interested in that definitely a blog post to check out.