Indexing Documents, GIN Indexes, Data Science, Generated Columns | Scaling Postgres 71
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss indexing documents, how GIN indexes work, using Postgres for Data Science and generated columns.
Content Discussed
- Indexing Documents for Full Text Search
- PostgreSQL's indexes - GIN
- PostgreSQL's indexes - GIN algorithms
- Deep Dive Into PostgreSQL Indexes Webinar: Q & A
- GIN and GiST Index Types
- PostgreSQL meets “Data Science and AI”
- How the CIS Benchmark for PostgreSQL 11 Works
- Generated columns in PostgreSQL 12
- Postgresql Interval, Date, Timestamp and Time Data Types
- PgBouncer changelog
- [Vimeo] Database as API with PostgreSQL and Massive.js - Dian Fay
- [YouTube] PGConf.Russia
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content this week is "PostgreSQL Deep Dive: PostgreSQL Defaults and Impact on Security - Part 1". This is from crunchydata.com. There's also a second post that is part two we'll be covering as well. So this is indeed a deep dive. It is a very long post talking about the impact of the defaults and their impact on security. So basically talking about user groups and roles and kind of what they have by default, what's the impact of the public schema. Also a discussion about the situation identified by this CVE that was posted in 2018 and how you can basically protect yourself from it. So he talks about how there is the command to create users, groups, and roles, but really, they're all the same thing. A user or a group is basically an alias for a role.
So when you create a role, you're creating a user or a group. Generally, users have the connect privilege. All right, they mentioned here actually consider that an attribute whereas groups generally do not, but essentially they're all considered roles. Now, what's great about this post is the level of education about roles in the public schema and the search path in terms of what commands to execute. So these are two very important posts and they are very long. So if you're not able to read it now, I would definitely bookmark these and come back to them because they're really, really good from an educational perspective. So he's talking about roles that have basically four types of relevant properties. They have Attributes that are a capability for a role. For example login, superuser, et cetera. Membership, which is a member of another role.
Privileges. So can it do an UNSELECT, SELECT, INSERT, DELETE, et cetera? Then Settings, a custom value for a configuration variable bound to a role. Now, as a part of this post, they also go over an extension that was created called the crunchy_check_access. With roles acting as users and groups and again the public schema and privileges or default privileges, it can become really confusing what rights someone has. Basically, they've developed this extension to be able to give you an insight into what rights a given user, a role, or a group has. So in this example post, they created a basic database called Deep Dive and then they created a group. Again, these are still just roles, but they created a group using no inherit superusers, created some different roles, created GROUP app users to a role, or made it a part of a role.
Then used the \du command to get the list of roles and kind of what they're attributes and what they're members of. So again, this really goes through and gives you an education on how you can create these different things and then what the security ramifications are in terms of what privileges certain people have. Then they use the check_access extension to be able to see what they have. Some of the output appears down here. Now, I'm not going to go over everything in this post because it is super long and it will take an immense amount of time to review, but I strongly suggest you check it out and look at it for edification purposes.
Now the second part, once you have that basic knowledge down, goes into CVE-2018-1058. First, they mentioned some concepts about schema where he says basically there's a schema called the pg_catalog which includes the built-in objects provided by PostgreSQL, and then new databases have a schema called public. There's a concept of the search path where quote the "PostgreSQL searches the system catalog schema, the pg_catalog first" and then it looks in the user schema and then the public schema where the user is essentially the session username. With regard to the security issue, you can create a function that has the same name but accepts slightly different inputs.
So it can be text. But for example, this one you created here can be varchar. So you could actually create an object in the public schema that is named similar to an existing built-in function that's in the pg_catalog, but then you can overwrite that functionality to escalate your privileges. So they show a basic example here where someone created a function called lower which gives you lowercase. But they used varchar as the input and they did a harmless SELECT so and so was here. Now another user running this, the search path will eventually look in the public schema and potentially run that, and then they go over the full example of what's possible in terms of escalating privileges with this.
Now then they lead on into the fix, which basically, one of the main things is to REVOKE CREATE ON SCHEMA public FROM PUBLIC. So prevent people from creating objects in the public schema, essentially. But they say you may also want to also REVOKE EXECUTION ALL ROUTINES IN SCHEMA public FROM PUBLIC as well as altering the default privileges for that as well. So again, this is a super good set of posts to help you get an education on roles and permissions and groups and things of that nature. So I highly suggest you check out these two posts.
The next post is "The Challenges of Setting max_connections and Why You Should Use a Connection Pooler". This is from richyen.com. So here they're talking about max_connections and of course the default in PostgreSQL is set to 100. So in terms of determining how large this can be for a database or what's the best practice, he says, quote "Talk to any PostgreSQL expert out there and they'll give you a range, 'a few hundred' or some will flat out say 'not more than 500', and 'definitely no more than 1,000'". Then he talks about where these numbers come from. So he actually wanted to do a test. So he used a g3.8xlarge EC2 instance and set up pgbench with a number of clients to run against it. He set the number of concurrent connections to run from 100 up to 5,000 and he set max_connections to 12,000. So, quite a bit on this database.
Here, you can see the graph of transactions per second. So essentially how many transactions, how high it can go concurrently, and then how the latency increases. But it seems like most of the action happens in the beginning part. So he zoomed in on it here. So as he says, the optimal performance was when there were between 305 hundred concurrent connections and after 700 connections, the performance dropped off precipitously. So again, very close to that tribal knowledge quote "...'a few hundred', 'no more than 500', 'definitely no more than 1,000'". So that's pretty much what this is saying here too. Then he goes over a section called What if we need more connections? There comes connection pooling. So using PgBouncer or Pgpool?
He did some tests similar to Alvaro Hernandez's concurrent-connection test, but he also used the connect flag because he wanted to simulate the cost of building up and building down connections. That's what this chart looked like now when he added in the connection pooler. So this is without a connection pooler. With a connection pooler, you definitely see increased performance. So basically max_connections, it looks like a few hundred, probably not more than 500, definitely not more than 1,000 is the way to go. If you need more connections than that, then try to start using a connection pooler such as Pgpool or PgBouncer. But if you're interested in this type of content, definitely a blog post to check out.
The next post is "Postgres twelve highlight - More progress reporting". This is from paquier.xyz and he's talking about how we have progress reporting for vacuum, but now they've added more progress reporting in Postgres 12 for CLUSTER activities as well as VACUUM FULL activities. So rewriting essentially the whole table. And that one's called pg_stat_progress_cluster. Then also for REINDEX and CREATE INDEX operations. For that, there's a view called pg_stat_progress_create_index. So definitely some great new features are being added to 12 in order to monitor these two potentially long processes. If you want some more details, definitely a blog post to check out.
The next post is actually a presentation from Bruce Momjian from EDB Postgres talking about major features of Postgres 12. So this goes over partitioning improvements, B-tree improvements, most common value statistics where you define statistics across columns, inlining many CTE queries as opposed to materializing them, and prepared plan control, which we talked about in a previous episode of Scaling Postgres. Just-in-time compilation being on by default, the new Checksum control feature, and then REINDEX CONCURRENTLY, which again, in my opinion, is a huge feature. Great to have. So if you want some more detail with regard to these, definitely check out his presentation.
There's another presentation that Bruce Momjian also posted, which is "The Democratization of Databases". So basically he made some analogies between government and their structure and the software governance structure of Postgres and why it is advantageous. So it's an interesting read if you want to check it out.
The next post is "PostgreSQL Administrator Account WITH NOLOGIN (recover your postgres role)". This is from fluca1978.github.io. And he says, what happens if you lose the ability to log in with your Postgres account? What are you going to do? So when you try to do it, it says Postgres is not permitted to log in. So you can actually stop the service and then start it in single-user mode. So always remember that you have this mode to be able to start Postgres. With that, you can execute operations such as ALTER ROLE postgres WITH LOGIN, and then ctrl+D to stop out of it. Then you can start the Postgres service normally and be able to connect as Postgres. So, very simple post, but just something to keep in mind if you ever run into such a problem as this.
The next post is "Managing Multiple PostgreSQL Instances on Ubuntu/Debian". This is from percona.com. So this talks about some of the different cluster commands that are available on Ubuntu for being able to manage multiple clusters. So for example, there are the commands pg_lsclusters to list out the clusters that are available. There's the cluster control using pg_ctlcluster. You can also create and drop clusters. So this goes over how you can manage multiple clusters of PostgreSQL on Ubuntu or Debian platforms. So if you're interested in that, definitely a blog post to check out.
The next post is "Webinar: Migration to PostgreSQL [Follow Up]". This is from 2ndquadrant.com. So this is a webinar talking about migrating to PostgreSQL. If you go to the registration page, they talk about why you would want to do this, how to plan your migration, how to migrate your business logic, convert data types, migrate your SQL in terms of function procedures and triggers, typical challenges encountered, as well as post-migration tests. So if you're interested in migrating to PostgreSQL, definitely a blog post to check out.
The last post is "VERSION 1.6 OF PGWATCH2 POSTGRESQL MONITORING TOOL RELEASED". This is from cybertec-postgresql.com that allows you to do monitoring and it now has Prometheus support. So you can send this monitoring to Prometheus, as well as Patroni support for being able to determine from my interpretation what is the primary database to be able to follow. So if you're interested in using pgwatch, they have a new version that you can check out.