JSONB vs. Window Functions, Security Checklist, Full-Text Search, NFL Time-Series | Scaling Postgres 176
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss JSONB vs. window functions, a Postgres security checklist, implementing full-text search and using NFL time-series data.
Content Discussed
- pgbackrest async behavior
- Modern data analysis with PostgreSQL – JSONB throws Window functions out the…
- PostgreSQL Security: Things to avoid in real life
- Postgres Full-Text Search: A Search Engine in a Database
- Hacking NFL data with PostgreSQL, TimescaleDB, and SQL
- PostgreSQL Logical Replicas and Snapshots: Proceed Carefully
- Increasing max parallel workers per gather in Postgres
- Deploying Pgpool-II Exporter with Docker
- Improve PostgreSQL Query Performance Insights with pg_stat_monitor
- Roman Druzyagin
- Rubber Duck Dev Show Episode 6 | Handling Exceptions
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Modern data analysis with PostgreSQL- JSONB throws Window functions out the...". This is from swarm64.com. They're talking about a scenario where you have, say, orders you're collecting and you want to collect the status of the delivery of each order. So you have an orders table and then you have an order shipment status table. When you insert information into the shipment status, you include the order ID, which is the key on the order table. You have the timestamp of it as well as whatever the status is so 'loaded into van' in this example. So you would have many statuses within this table per order. So it's a one-to-many relationship. Now then, the question is how do you find out if a package was lost and its previous status was 'loaded into van'? Show me those records. This is a query that the author used to retrieve that information. And it's pretty complex and it can't use indexes all that well. In the scenario we mentioned, there's a lot of data in it. So this query is just going to be pretty slow.
But there's another way to do it using JSONB, and that is to create an array in that JSONB of the statuses. In that array, each object would contain the timestamp as well as the status. So when you have a new status, you would merely append that status to the existing one. This is what you would do; you would just update that individual order, appending the status to that order, as opposed to creating inserts on the order shipment status table. Now, he did say in order to optimize this, you probably want to set a fill factor of less than 100 for this table so that it won't have to create new pages all the time. But in doing this scenario, your query to answer that question of showing the records that have become lost, where the last status was 'loaded into the van', becomes much simpler. So they're using this technique here, essentially the -1 to grab the last item in that JSONB array and then get the status of it. And then similarly, you could do -2 to get the second to last element.
So basically look for the second to last element 'loaded in the van' and the last element was lost. And then what's great about this is you can add another AND query where you're looking for the instances where the shipment status contains using the contains operator, both loaded into the van and lost. Then you can apply a gin index on that shipment status column to get a very fast index search. So essentially your query looks like this: it's using an index very very quickly. So this does run counter to the typical relational advice of having, say, a statuses table where you're listing each status there. But for quick queries, this is a great technique in my opinion. Now you have gone from doing inserts, which are typically non-locking, to something that's an update, but hopefully, each order status wouldn't be updated that quickly. But you do incur more of a locking. Overall, the benefits might outweigh the disadvantages. So if you're interested in learning more about this, definitely check out this blog post.
The next piece of content- "POSTGRESQL SECURITY: THINGS TO AVOID IN REAL LIFE". This is from cybertec-postgresql.com. Now, overall when I think about security, I think of a defense in-depth scenario. So basically try to prevent putting your Postgres server on the Internet. Ideally, don't have a publicly accessible Internet address for it. Also, implement network firewalls and all sorts of additional security. But this checklist focuses on what you can do in PostgreSQL specifically. But a defense in-depth scenario is the best way to do security. So their first piece of advice is to avoid relaxed listening_addresses settings. So basically set your listen addresses or keep it at a local host if you don't need to connect from outside the Postgres instance. But if you do, narrow it down to whatever network interface you're using. The second recommendation is not to use "trust" in pg_hba.conf. So basically never use "trust" because that allows anyone to get in without a password. But apart from that, they recommend using scram-sha-256 for your passwords as opposed to MD5 and of course, enforce SSL encryption as well.
They also mention avoiding entries in your pg_hba.conf from anywhere because ideally, you want to narrow it down to only the hosts that need to contact the database. Also narrow down the number of databases and users that are allowed access and what they can access. The third recommendation, which was mentioned above, is to use scram-sha-256 encryption for passwords, not MD5. The fourth is to handle public permissions on schemas and databases. So basically, the recommendation is to revoke everything on particular databases from the public as well as schemas from public. That prevents a user from changing that database and potentially creating rogue objects and filling up your database. The next recommendation is to avoid using ALTER USER SET PASSWORD because this can potentially be logged if you have your logging set up such that it would capture this type of statement. They recommended using other Postgres clients to do it.
So for example, psql has a \password command you can use to change the password, and pgAdmin has a change password dialog, so they recommended doing that as opposed to sending a statement to the database. I would expect at some point that Postgres may offer a filtering for this type of command, but I'm not aware of anything like that in the works. The next recommendation is to make use of ALTER DEFAULT PRIVILEGES so that when you're setting privileges and objects, also set the default privileges so you can ensure that new objects will accept those changed permissions. Number seven is to use SSL of course, so your connections are secure. Eight write security, definer functions safely. And this is talking about when you create functions.
Normally they run as the logged-in user who's using them, but you can define the security definer to be someone with additional privileges, so you want to be very cautious of that. The next recommendation is to avoid SQL injection in database functions. So of course you need to do this for your application to avoid SQL injection attacks, but you also need to do it in your functions. So, for example, if you do something like this where you just accept a table name from something the user sets, they could easily fill up your database by executing a function where they're replacing a table name with a generate_series function. Or they could do that as well as do a UNION to find out additional data from maybe tables they don't have access to. So it is definitely important to avoid that.
You can use the format functions to construct SQL query strings if you need to. The next recommendation is to restrict superuser access as much as possible. Again, that's a standard for all computer security. The next is to update your PostgreSQL database on a regular basis and keep things patched and updated. The next recommendation is encrypting your entire server. Now they're talking about an open-source free tool that they developed, PostgreSQL TDE. But you could also do encryption at rest at the operating system level to make sure your disks are encrypted. It looks like that's what TDE does; it just encrypts between what's in memory and what gets placed on the disks. So it's basically encryption at rest. So if you're interested in learning about the details of this post, I definitely encourage you to check it out.
The next piece of content- "Postgres Full-Text Search: A Search Engine in a Database". This is from blog.crunchydata.com and this is a great post to just get started working with full-text search in what I would say is the recommended way to do it to give you the most speed. So you can do statements where you're just doing LIKE in some text string, but those inefficient indexes can't really be used with this type of search. So full-text search allows you to get around some of that. Now of course their recommendation is when you want to search in a particular field ideally you should create a tsvector column in that table using the new feature in Postgres 12, GENERATED ALWAYS, so it's kept up to date. What you store in that tsvector, you do a function to tsvector, and give it whatever dictionary you want to use. It could be English, it could be another language, or maybe in the case of names, just a simple dictionary as well as the column you want to store there.
If you're using a version prior to 12, you can just use triggers to do something similar and then you create a gin index on that new tsvector column. And now using a query such as this, using the match operator in to_tsquery, making sure that you match the same dictionary and give it a term you want to search for, you get a very, very fast search because it is able to use this gin index to quickly search through and find the relevant information. Now in addition to just searching a single word like this, you can do things such as using the ampersand to be able to say this term and this term and this term to be able to do a search. You can also use the follow-on operator so that it's only looking for a phrase of 'rain of debris' in this example. But there's also a phrase to_tsquery you can use to be able to do that easily just by using this different function as opposed to the to_tsquery.
And there's also a web to_tsquery as well to simulate what a web search looks like. And lastly, the post talks about weighting and ranking different results. So when you build that column you can set a ranking for each column within it. So you can append multiple columns into this one tsvector field of the table and assign ranks. This one is A, the event narrative, and this one, the episode narrative is a rank of B. So certain results will be ranked higher than others based upon this weighting that you assign. Then to get that to appear in your results you do an order by and use the ts_rank function with your column and then what your search query was and you get ranked results. So this is a great post that introduces you to full-text search and how to very quickly get set up to do really fast text-based searches of your information.
The next piece of content- "Hacking NFL data with PostgreSQL, TimescaleDB, and SQL". This is from timescale.com and they're talking about how the NFL has had some public data they put out that tracks a ton of information for some games. Because apparently NFL players have been equipped with RFID chips that are apparently reporting data at 100-millisecond intervals for each play, describing speed, distance, the players, et cetera. Essentially, a big bucket of time series data. This post goes through actually getting access to this data and how to use it and explore it as well as some queries to analyze it. So I thought this post was pretty interesting. Now you don't need TimescaleDB to do this analysis, they talk about where they have some features that could be advantageous for giving you more performance in this but you can just use normal PostgreSQL to do it and they talk about all the different queries that they did to analyze this result set. So if you're interested in that you can check out this blog post.
The next piece of content- "PostgreSQL Logical Replicas and Snapshots: Proceed Carefully". This is from ardentperf.com. They're talking about a post that was put up, talking about getting a logical replica set up with Postgres by first doing a snapshot in RDS to get the initial data load and then turning on logical replication to start streaming the data. A lot of recommendations you see in this post is if you're going to go this route, be very careful. Now, I've heard of different use cases for snapshots to try and copy data quickly with a running Postgres instance and I've always been very hesitant about it and proceeding carefully is basically what I would advise.
If you're doing any kind of snapshotting, basically not using the tools that Postgres provides to do backups, I would spend a lot of time validating to make sure that you can restore that data reliably and you're not going to have any issues. Because you don't want to run into a problem where you can't bring your database backup. This is similar in that you're copying a database from one to another because you're logically replicating it and you don't want issues to crop up or transactions to be missed like they're talking about at the snapshot point. If there are transactions in flight, what's going to be their status? Are they in the WAL or not? So there are a lot of questions about it but if you're interested in learning more about this discussion, definitely check out this blog post.
The next piece of content- "INCREASING MAX PARALLEL WORKERS PER GATHER IN POSTGRES". This is from pgmustard.com and they're talking about over the years with each Postgres version they've added more and more parallel functions to Postgres. So it allows you to utilize all the cores of a machine when you're doing one query. Now, if you have a lot of connections to your database in a transaction processing application, you're going to be using all of those cores because the processes are going to be using different cores. But this addresses the case where you want one really long query to be able to use multiple cores of the machine to answer questions. By default, they say that the max_parallel_workers_per_gather that's essentially per query is limited to two.
So if you have more cores in your machine you may want to up that to be able to use more cores to answer questions faster, particularly if you're doing any kind of online analytical processing. So they show the parameter as it exists and then how you can actually increase it per session if you want to. There are a few other settings you're probably going to want to adjust to the work memory because each worker can utilize it separately. The total max_parallel_workers for the system as well as the max_worker_processes. They also mentioned some costs that you can adjust in the Postgres configuration to determine whether you're going to get a parallel query or not. So if you want to learn more about that, definitely check out this post.
The next piece of content- "pgbackrest async behavior". This is from fluca1978.github.io. He's talking about pgBackRest from an async perspective. Now in terms of async, they're talking about the management of the WAL. So when you're running the database in archive log mode, essentially you are going to need to save all those WALS somewhere and pgBackRest allows you to save those WAL in an async manner. So if you're trying to save them to say S3 or a storage system, that may have some latency to it to transfer files. You're going to want to use an async method if you're going to transfer them in that fashion or just do bulk loads on a periodic basis to S3. So this async becomes very important in that use case.
The other case it becomes important is when you're doing a restore because you're going to need to read back those WAL files. If they're stored on S3, doing a round trip for every WAL file needed will take a long time. And the database may not even be able to catch up to a primary if that's what it's trying to do. Therefore, the async mode becomes important there because you can actually pull down a batch of files in advance of when they need to be applied to the database during a restore operation. So if you want to learn how pgBackRest can help you with that, this post is a great example of walking you through showing two different instances of using these async functions in their behavior.
The next piece of content- "Deploying Pgpool-II Exporter with Docker". This is from b-peng.blogspot.com. This is about using the Pgpool II exporter which provides metrics to Prometheus. So if you're interested in setting this up for Pgpool II, you can definitely check out this blog post.
The next piece of content- "Improve PostgreSQL Query Performance Insights with pg_stat_monitor". This is from percona.com, and they're talking about a new extension that they developed called pg_stat_monitor. It basically takes some of the data that are output by existing extensions such as pg_stat_statements, which gives you a record of the statements that have been applied against the database for all time. The pg_stat_activity shows you in real-time what statements are running against the database, as well as auto-explain. If there is a statement that takes too long, it will automatically do an explanation on that statement and place it in the log for you. It basically combines these functions into one extension. So if you're interested in checking out this new extension, you can definitely check out this blog post.
The next piece of content, the PostgreSQL Person of the week is Roman Druzyagin. So if you're interested in learning more about Roman and his contributions to Postgres, definitely check out this blog post.
The last piece of content is that we did another episode of The Rubber Duck Dev Show this week. This episode was on handling exceptions in your programs. So if you're interested in that, you can check out that episode.