Faster Partitions, Count Speed, Extensions, Education | Scaling Postgres 82
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss faster partition performance, count speed, extensions and Postgres education.
Content Discussed
- PostgreSQL 12: Partitioning is now faster
- What’s Faster? COUNT(*) or COUNT(1)?
- PostgreSQL Extensions - A Deeper Look
- PostgreSQL Top Learning & Training Resources
- [YouTube] Postgres Open
- Postgres Cheat Sheet
- Postgres Execution Plans — Field Glossary
- The Art of PostgreSQL: The Transcript, part I
- SQL ... and now GQL
- Manage Linux control groups in PostgreSQL with pg_cgroups
- PostgreSQL 12 beta 4 up and running in less than six minutes
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 12: Partitioning is now faster". This is from the 2ndquadrant.com blog. The first thing they talk about is COPY performance, where they say that bulk loading of data into a partition table using COPY is now able to make use of bulk inserts. Previously, only one row was inserted at a time. They show a scale here going from 2 to 8,192 partitions. And they looked at versions 11 and 12 and the number of seconds to do a COPY FROM 100 million rows. And as you can see, 12 was noticeably faster. It starts getting diminishing returns as you go higher and they're talking about the reason for the slowdown being due to how COPY code makes up to 1,000 slots for each tuple per partition. So there may be some use cases where you may not see some of this effect, but definitely much improvement with partitions in general, not just with larger numbers of partitions.
The next thing they talked about was INSERT performance and they said in version 11, every partition was locked when you were doing an INSERT, no matter if it received a new record or not. But with version 12, quote "...we now lock a partition just before the first time it receives a row". So if they're just inserting one row, then only one partition is locked. And as you can see here, as the number of partitions goes from 2 to over 8,000, the transactions per second, it's pretty good, up to about 32 starts getting the difference. And you can see a huge difference that happens as you add more and more partitions. In version 12, you can see there's no penalty as you're increasing the number of partitions that exist for a parent table. So that's a huge improvement if you have a lot of partitions in your partition tables.
Then they talk about SELECT performance. Now they mentioned here, of course, that PostgreSQL 11 added a partition pruning and with version 12 they just made it a whole lot better. And as you can see what a whole lot better means. Again, the TPS, the transactions per second is still pretty consistent, although you start getting a little bit of a drop-down in the 4,000 to 8,000 partitions level, but it knocks the socks off version 11. So version 12 is much improved in this area. Then they go over some other partitioning performance improvements in terms of ordered partition scans, getting rid of single sub-plan append and merge pin to nodes, and some various improvements to the runtime partition pruning.
And there are some particular use cases that again, 12 has improved. But as you can see, as you get larger partition sizes, there still can be some issues that of course they'll be working on. But even with this drop-off, they said if they change the WHERE clause slightly, swapping out the STABLE function, and calling for a constant, the planner is able to take care of pruning. So there are still some use cases that, again your mileage may vary, but definitely a lot of great improvements with partitioning. So it seems like Postgres just keeps getting better and better with regard to this.
The next post is "What's Fast? COUNT(*) or COUNT(1)?". This is from blog.jooq.org. This post basically says there shouldn't be too much of a difference in it and you don't see much of a difference with MySQL, Oracle, and some other databases, but with PostgreSQL, you actually see a little bit of a difference. They ask what the difference is. COUNT(*) counts all tuples in a group. COUNT by an expression counts all tuples in a group for which the expression evaluates to something that is NOT NULL. So sometimes this comes in useful when doing accounts of outer joins and also when you're doing things like counting groups in a subset when using a CASE or if you're using a FILTER in PostgreSQL.
Now. He also talks about looking at various different database systems. Pretty much for MySQL, Oracle, and SQL Server doesn't really matter. But it does matter for PostgreSQL. That COUNT(*) was consistently faster by around 10% on 1 million rows. So more than he had anticipated. And someone actually in the comments mentioned that they saw something similar to. So it's definitely something to be aware of. But he does note here quote "Do note that the benchmark only tried a very simple query". So as you get more sophisticated in your counting with different types of queries, you may not see this behavior, or maybe when using having or order by, there's different things that can affect it, but it's something interesting to observe and remember and see if you see this kind of a difference in your queries.
Next post is "PostgreSQL Extensions- A Deeper Look". This is from jkshah.blogspot.com. This is from a presentation that's been embedded here on the LinkedIn SlideShare called "PostgreSQL Extensions - A Deeper Dive". Now, this is specifically about Amazon RDS, but I did still find the content relevant for anyone using PostgreSQL. But it talks about what extensions are, and what are some of the common extensions that people use. Again, pg_stat_statements is a huge one. PostGIS for doing spatial support, Postgres Foreign Data Wrapper, procedural language, JavaScript, and UUID. He shows different commands to be able to look at the supported PostgreSQL extensions. Again, this is on RDS, how you would go about creating them, listing the used ones, and how to upgrade or remove them. Then he also goes into different plugins and how to develop extensions as well. So if you're interested in extensions and getting a little bit more information with regard to them and development. Even though this is again about RDS because it's an AWS presentation, there is still some relevant content in this presentation so I suggest you check it out.
The next post is "PostgreSQL Top Learning & Training Resources". This is from severalnines.com. So this is basically a list of where you can get more information about learning PostgreSQL and of course, the number one resource is the PostgreSQL Manual. So it's the online manual that's available on postgresql.org. Another thing mentioned is distribution lists. So email list or distribution list, you can get on these to get communication and communicate with others in the community. It talks about partners and software tools, official PostgreSQL webinars, a list of PostgreSQL events, all the different PostgreSQL events taking place, local and regional user groups, and the PostgreSQL Planet which is a list of different blog posts. So a bunch of different resources to learn more about PostgreSQL.
Continuing with the theme of PostgreSQL education, Postgres Open just posted 30 to 40 different presentations that occurred at that conference. So if you're looking for more PostgreSQL education material, definitely check out the Postgres Open YouTube channel to look for more content.
Still going with the theme of education, there is a website that has come up called postgrescheatsheet.com and it is a cheat sheet for accomplishing various different functions in PostgreSQL. So for example, I'm on the tables, it tells you how to show a table, \d the table name. Showing more details with it and getting the list of tables from the current schema. You can look at users and it tells you how to set or change a user's password. So there are all sorts of just quick little cheat sheets of different commands you can use for PostgreSQL and this was built by the team at TimescaleDB so definitely an interesting resource to check out and keep on hand in case you need a way to look up something quickly.
The next post is another educational resource called "Postgres Execution Plans - Field Glossary". This is from the PgMustard blog on medium.com. Now this is defining each of the different things when you're looking at education plans and it is truly a glossary. It tells you a node type and then it tells you the different parent relationships, what a filter is, parallel aware, relation name, alias, estimate fields in terms of the total cost, startup cost, plan roles, plan with actual value fields in terms of actual loop, and actual total time. So this is a good resource in case you're trying to EXPLAIN ANALYZE a query and you want a little bit more information about what you're seeing in the data returned. It would be great if this had some small examples that could highlight it. But this was intended to be a glossary and that's exactly what it is. So if you're interested in this, definitely a blog post to check out.
Still on the education track is "The Art of PostgreSQL: The Transcript, part 1". This is from theartofpostgresql.com. This is a text version of the presentation that was given by Dimitri Fontaine at PostgresOpen. Again, one of the YouTube videos that was uploaded to the PostgresOpen channel. This is a text version of his presentation. Now this presentation actually seems to cover more why PostgreSQL. In other words, why would you use PostgreSQL? It's basically talking about its relational database roots and how PostgreSQL fits into that. So if you're interested in this content, definitely a blog post to check out.
The next post is "SQL ... and now GQL". This is from the Pulse blog on linkedin.com. So this is talking about making GQL an official standard alongside SQL and the subtitle says here a standard query language for property graphs. So this is a little bit outside of PostgreSQL's bailiwick. But again, talking on the topic of conversation, this is pretty interesting and of course, it brings up questions in my mind. Are there different types of support that will come into PostgreSQL with this potentially new definition of a language at GQL? So if you want to get more information about this graph query language, this is a great post to check out.
The next post is "MANAGE LINUX CONTROL GROUPS IN POSTGRESQL WITH PG_CGROUPS". This is from cybertec-postgresql.com which they did talking about groups. They actually have a PostgreSQL plugin called to make the management of them easier. Now, the first thing to talk about is what pg_cgroups are good for. Basically, the cgroups kind of put a constraint around the processes that are running on a system like a Linux system. So if you have one PostgreSQL cluster running on a system and you want it to have full resources of that machine, you wouldn't want to use cgroups.
But if you had multiple clusters that were running and you wanted to restrict their access to certain resources on the machine, then maybe cgroups is something you would want to potentially use. And pg_cgroups, as they talk about here, is something that makes it easier. So they've set up a project page for it and they go through the steps to how to set it up. Then once you set it up, you can, as they say here, support limits on memory consumption, CPU time, and IO bandwidth. They say one particularly nice feature is that this is the only way to limit the amount of file system cache that is used for PostgreSQL. So if you have a use case that would benefit from constraining the resources that PostgreSQL uses on the machine, definitely a blog post to check out.
The last post is "PostgreSQL 12 beta 4 up and running in less than six minutes". This is from fluca1978.github.io. And he's using pgenv, which is kind of like an environmental builder for Postgres, so much like you hear of Python's Environment Builder or Ruby's RBM for building a Ruby environment. This builds Postgres environments, and you could use this utility to just build version 12 beta 4 pretty quickly, set it to use, and then startup the database. So if you have a need to run different versions, this is a quick and easy demonstration of how you could use pgenv to accomplish that.