Postgres Releases, Data Warehouses, Exclusion Operators, Read-Only Users | Scaling Postgres 165
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the newest Postgres releases, implementing a data warehouse, using exclusion operators and setting up read-only users.
Content Discussed
- PostgreSQL 13.3, 12.7, 11.12, 10.17, and 9.6.22 Released!
- Using PostgreSQL as a Data Warehouse
- Exclusion operators - EXPLAINING PostgreSQL | CYBERTEC
- Creating a Read-Only Postgres User
- Postgres 14 highlight - CREATE TABLE COMPRESSION
- Clustering a Table
- Setting up PostgreSQL streaming replication
- TLS for Postgres on Kubernetes: OpenSSL CVE-2021-3450 Edition
- Regression Analysis in PostgreSQL with Tensorflow: Part 3 - Data Analysis
- Laurenz Albe
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL 13.3, 12.7, 11.12, 10.17, and 9.6.22 Released!". This is from the postgresql.org website and this release covers three security issues. One, a buffer overrun from integer overflow and array subscripting calculations. The second is a memory disclosure in INSERT ON CONFLICT DO UPDATE. This depends upon a purpose-crafted table to exploit. The third is a memory disclosure in the partition table UPDATE RETURNING. That, again, relies on a purpose-crafted partition table. So there are three security issues that are addressed and then a number of bug improvements for these versions. So definitely when you get an opportunity, you should go ahead and upgrade to the latest version.
The next piece of content- "Using PostgreSQL as a Data Warehouse". This is from narrator.ai, and they're talking about using Postgres as a data warehouse. They have a number of recommendations that they cover in this post. The first thing they cover is the differences between data warehouses and relational databases. Although, I would call it the difference between data warehouses and online transaction processing databases. As mentioned here, most queries are looking for one record or a few records, whereas data warehouses do a lot of analysis. So online analytical processing is another term that's frequently given for it. For these analytical queries, the traits that you can process many rows at a time, queries go from several seconds or less than a second to several minutes, and it may only be looking at a small number of columns. For that they're advocating a column row store may be better, but you can still use Postgres's row store for doing it.
Now they mentioned some third-party data warehouse products, but there's also the C-store extension for Postgres and the Citus extension which allows you to do column storage within Postgres. So you could check those out. Then they go into if you're going to use Postgres, how would you configure it for data warehousing operations? The first thing they mention is to avoid common table expressions, at least with versions prior to 12, because the versions after allow you to materialize that CTE or not. The next recommendation they have is to use indexes sparingly. Now this depends on how much data. If you're going to be analyzing a whole table, well then of course you don't need indexes or even a significant portion of that table. But if you're going to be analyzing, say, by a date and a time, then indexing could prove beneficial. Or if your data is so huge, you could actually partition by month or by year, not use indexes.
That way you can only target certain partitions when doing the query. So you don't have to cover the whole data set and they seem to be mostly focusing their discussions on B-tree indexes. But you could also potentially try using BRIN, the block range index. That could give some very good performance. Potentially, if you're looking at a lot of records that you're querying. Then they cover the partitioning that I mentioned before and how it can be beneficial for breaking up a large data set into smaller tables. They talk about the importance of minimizing disk IO and the importance of using SSDs or other types of fast disk access.
Then they talk about the importance of vacuuming because it's crucial after you do a bulk insert to do a vacuum to make sure that everything is optimized. They talk about making sure that your parallel queries are tuned and that you want to typically make modifications to your max_parallel_workers and max_parallel_workers_per_gather to optimize those for the types of queries you're running. Then lastly, they also recommend increasing your statistical sampling size, which you could do on a per-table basis or for the entire table to give you the best query performance. So if you want to use PostgreSQL as a data warehouse, perhaps you would like to check out this article.
The next piece of content is actually a YouTube video and it's "Exclusion Operators Explaining PostgreSQL". This is from the Cybertec Youtube channel. They're talking about the problem of preventing overlapped bookings. So if you have a room, you don't want to book that room more than once. So you want to avoid this situation where someone is going to double book a room in, say, a hotel. Now, to do this they're using range types, so they're using a start date and an end date within a single data type. And you can store this in Postgres. Then you can do things to determine, hey, does this date exist within this range of dates? It returns true or false. Now, the trick that you can do with Postgres is to use a special type of index and an exclude USING clause when you create the table to create a constraint to avoid double booking.
So the first thing you need to do is create this extension btree_gist, which combines a B-tree and a GIST index. Then you create your table that has a room with an integer as an Identifier, a range. So this is the range of dates that that room wants to be booked. Then you specify an exclusion constraint where you're excluded using GIST and room with an equals in a myrange with a double ampersand. Basically, this operator avoids overlaps. So with adding this extension and this minimal set of code, you essentially avoid double booking rooms in this implementation and that's all the code you need to worry about. When you do an INSERT and you attempt to do a double booking, it'll return an error so you can capture that in your application and then handle it gracefully. So this is a great video and I encourage you to watch the whole video and see how to implement this and even implement multiple exclusion constraints.
The next piece of content- "Creating a Read-Only Postgres User". This is from blog.crunchydata.com and they're talking about you wanting to create a user that has read access to all tables in a database or in this case, in a schema. So you can use grant and revoke to grant different objects to different users or to revoke them from different users. And the ALTER DEFAULT PRIVILEGES basically says for any new objects created by a particular user. So in their example here, they created a separate schema to store their data. They created an app schema. They revoked all permissions on the schema from the public, which is a general security recommendation. They created a table and an index and then inserted stock data into it and created a function to be able to retrieve some data from it. Now they created the user they're calling a scraper because they want to scrape the information.
They created a role as a login, that's how you would create a user. Now, when they try to query it, you're going to get a permission denial because it doesn't have access to this app.stock_data table. But these are the grants that would give you access. So you grant usage on the schema app to the scraper. You grant SELECT on all tables in the schema app to the scraper. You grant usage on all the sequences in the schema app to the scraper and then grant EXECUTE on all functions in the schema app to the scraper that will allow you to query the table or to use the functions. If you try to insert data or create a new object, it will fail and give you an error. But with just those grants you actually can't view new tables. So if you go in as a different user and create a new table, you then go in as the scraper and try to access it.
You're going to get a permission denied. So that's where you need to run these two alter default privileges in the schema app to grant select on tables and grant execute on functions that will then let you select and execute functions for any new objects. Now, there is a caveat that if a different user has access to the app schema and creates an object, if they have not set altered vault privileges, the scraper won't have access to it. So you'll have to structure your users and permissions to be able to handle that situation. Although they mentioned here that in Postgres 14, there is a pg_read_all_data default role that you can grant to users, and that will allow you to select from any table object in the database. So if you want to learn more about setting up read-only users, definitely check out this blog post.
The next piece of content- "Postgres 14 highlight- CREATE TABLE COMPRESSION". This is from paquier.xyz, and they're talking about a feature where you can now customize the compression algorithm used for toast compression. Now, this doesn't let you arbitrarily compress columns or compress tables in Postgres, but when you have a table row that exceeds the space allowed for a given row, it gets stored in the TOAST table. Well, when you store something in TOAST, it gets compressed. By default, that compression is a pglz compression, which from what they were saying here, is circa 1999. But now with this feature, you can opt to use lz4 compression, which should be much faster. The only caveat with that is that PostgreSQL must be built using --with-lz4. But with that, you can then customize the compression and give you a little bit of a speed boost. So if you're interested in learning more about this patch coming into PostgreSQL 14, definitely check out this blog post.
The next piece of content- "Clustering a Table". This is from momjian.us, and he's talking about the command cluster, ordinarily, the table rows in the heap where the actual data is stored are not ordered by any method. It may have an ordering at first insertion, but as UPDATES and DELETES happen, the rows can be placed anywhere. It's your indexes that are kept in an ordered fashion. The table is never really ordered. However, the CLUSTER allows you to actually order a table and you CLUSTER it based on an existing index. So that's a way that you can get the table order to mirror the ordering of an index.
Now, this could be beneficial if you pull out large numbers of contiguous rows at a time using an index. So, for example, what he mentions here is that if you're just pulling out one row, it consults the index and knows exactly where to go to pull out the row from the heap to present it to you. But if you're pulling up a lot of different rows from an index, they may be scattered across the heap and require accessing multiple locations in memory, if not disk, to retrieve that data. But if you know you are going to be frequently pulling a lot of rows, it could be beneficial to order the table in that fashion. But there are two big downsides to CLUSTER.
Number one, it does not maintain the ordering. So you'd have to keep clustering the data if updates and deletes keep happening. Secondly, it locks the table so you can't really access the table while a CLUSTER operation is running. So this post talks a little bit about CLUSTER and some of its use cases, but because of those restrictions, it's really not used frequently. Perhaps in data warehousing use cases, it might prove beneficial where you do a load of the data and then CLUSTER it from a particular index. But if you want to learn more, definitely check out this blog post.
The next piece of content- "SETTING UP POSTGRESQL STREAMING REPLICATION". This is from cybertec-postgresql.com and it's a recent post on how you can set up a streaming replication from a primary DB to a replica or a Standby DB. They're using version 13 and give you all the different commands to run and how you can configure both systems to set up streaming replication from one system to another. So if you're interested in learning more about that, definitely check out this blog post.
The next piece of content- "TLS for Postgres on Kubernetes: OpenSSL CVE-2021-3450 Edition". This is from blog.crunchydata.com. So they previously had a post on setting up TLS for Postgres and Kubernetes. This is an updated post where they had to update a few techniques to handle a new CVE that was introduced. So it's basically a revised post using the operator that Crunchy Data offers to be able to set up TLS for your Kubernetes Postgres installations. So check out this blog post to learn more about that.
The next piece of content- "Regression Analysis in PostgreSQL with Tensorflow: Part 3- Data Analysis". This is from enterprisedb.com. So if you're interested in doing data analysis with Tensorflow and Postgres, definitely check out this post.
The last piece of content, the PostgreSQL person of the week is Laurenz Albe. So if you're interested in learning more about Laurenz and his contributions to Postgres, definitely check out this blog post.