background

Data Directory, Vacuum Features, Binary Performance, Network Latency | Scaling Postgres 114

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss the Postgres data directory, new vacuum features, binary storage performance and the impact of network latency.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is new PostgreSQL releases 12.3, 11.8, 10.13, 9.6.18, and 9.5.22 are released. This is from postgresql.org. The only security issue relates to the Windows installer and it was running executables from uncontrolled directories. So if you run on Windows, maybe you want to upgrade faster. Otherwise, there are over 75 bug improvements. So I would probably kind of look through this list to see if there's any reason you want to upgrade relatively quickly.

There's also this post that was done on Devrim Gunduz's PostgreSQL blog that says "yum users: some -devel RPMs require a new repository". So he says if you're using Red Hat Enterprise Linux or CentOS 7, then development sub-packages in 11 and 12 require an extra repo enabled for llvm and clang. So he says you just need to update this for CentOS or Red Hat Enterprise Linux, you need this command. So just something to keep in mind if you use those versions.

The next piece of content is "Don't Manually Modify the PostgreSQL Data Directory!". This is from the Robert Haas blog on rhaas.blogspot.com. This is a big no-no. Unless you have actually done PostgreSQL development, I would probably never modify anything in the data directory. What's interesting, a lot of people, when he gives this recommendation, say, really? What if I do X, Y, and Z? Don't do it, just don't do it. He says in 100% of cases, my answer is that it's not safe. But going back to what if I do this, he kind of goes through this post on what happens if you do it. So for example, if you remove the postmaster PID, there's a chance, of course, you're going to be running multiple versions of that same PostgreSQL cluster that could lead to who knows what kind of effects. And he says if you suspect it's present without Postgres running, assume that you're wrong. 

Because he says in his 20 or so years, he's never really seen it happen that way. So definitely something not to do. They said removing WAL files is another big no-no, because people look, at least in older versions of PostgreSQL, the WAL files were stored in the pg_xlog directory. So people assume, hey, these are logs, I can delete them. No, they're part of the WAL. And they've actually renamed it to pg_wal in recent versions because you don't want to delete these because this helps with crash recovery and maintaining the consistency of your database. And there's all sorts of ill effects he goes through that can happen here. But he says instead of removing files manually from pg_wal, consider why the system is, say, keeping so many of these around. So he says maybe you have a failing archive recovery command. 

Maybe you're just not keeping up with the rate of WAL generation. I've had slow disks that could cause this type of thing. Do you have an existing replication slot or an orphan replication slot that is preventing the WAL's segments from being removed? Maybe you have long-running transactions or maybe you have your max_wal_size or wal_keep_segments are really high, preventing them from being removed. So again, these are things you can fix without having to change things in the data directory. Next, he covers removing files from the pg_xact directory, the transact directory. He covers a pg_multixact directory and talks about trying to do single-page restores which has a large degree of risk. He communicates here and all the types of things that can go wrong. 

So if you're interested I definitely suggest checking that out. And then there's a section on "Is manual modification of the data directory ever justified?". And again, if you are in a corrupt state in certain cases, again very few where this is perhaps warranted, but again I would leave that to an expert. Leave it to someone who has done PostgreSQL development and knows exactly what they're doing. He makes it a comparison to surgery. So essentially you want to find a surgeon to do this. You don't want to just willy-nilly go in and try changing things because you'll probably make things worse. So this was a great post. If you've ever considered modifying the data directory, basically just don't do it. So definitely a post I suggest you check out.

The next post is "Improved (auto)vacuum in PostgreSQL 13". This is from amitkapila16.blogspot.com. Improvement number one is vacuum will be allowed to process indexes in parallel. This is huge and where it does it in parallel is the indexes. The indexes in my experience take the longest part of vacuum and you can do various configuration settings by looking at the max_parallel_maintenance_workers and min_parallel_index_scan_size to be able to configure it so that you are indexing in parallel. It looks like this will be the default. So that's good for doing autovacuum as well. But you can specify the number of workers when you're going to do a manual vacuum. So this is a great, great improvement coming in 13. Improvement number two is to allow inserts to trigger autovacuum activity. 

So again, by default when you just have INSERTS, there's nothing really to vacuum up in terms of deleted rows in a table or updated rows if all you're getting is INSERTS. The first vacuum that's going to be is an anti-wraparound vacuum which will probably take a really long time and potentially use more resources to do it, and it will not autocancel. So there are some disadvantages to it. But with some configurations that they've introduced it will trigger a vacuum once so many INSERTS are done into a table so that you can do some general cleanups such as freezing rows and assuring heap pages are visible and also allow index-only scans to skip heap fetches. So being able to do this vacuum is important for those reasons. Improvement number three is to allow autovacuum to display additional information about the heap and index in case of an error. 

So this is good that it can actually tell you where particular errors are if the vacuum runs into a problem. Now, knock on wood, I haven't experienced this, but this reporting would be good to have in case it happens. Improvement number four, autovacuum will now log WAL usage statistics along with other information. So this is great to understand how many things like full page images are used and things of that nature. Improvement five makes vacuum buffer counts 64 bits wide to avoid overflow of buffer usage stats. So this will just make the stats accurate as opposed to invalid, so that's a good benefit. And then six is added weight event vacuum delay to report on cost-based vacuum delay. So he says this will help us to monitor the auto vacuum throttling. So a lot of great features in the coming 13 to vacuum, especially this parallel. So if you want to learn more about it, definitely check out this blog post.

The next piece of content is "BINARY DATA PERFORMANCE IN POSTGRESQL". This is from cybertec-postgresql.com. So this post covers performance using binary data types in PostgreSQL. Now, they cover three ways that you can do this. One is storing the data in files and then storing a reference to that file. These have advantages and disadvantages. The disadvantage is that consistency is not guaranteed, like when you write the file, and when you put the record in the database, those are going to be misaligned. The benefits are the database is small and the performance can usually be very fast because reading directly from the file system is better. The next option is storing the data in large objects. This is an older way of storing binary data in PostgreSQL. 

Now, it's the slowest, but it does have an advantage in that you can store arbitrarily large data with large objects and the API has support for streaming so you can read and write large data in chunks. The third option is storing data as a bytea data type and that's the new way of storing binary data. Now, it says the main disadvantages are there's an absolute limit of 1GB, all the data has to be stored in memory, and there's no streaming support. He says with bytea, this basically uses TOAST to be able to store that data so it stores it separately from the table. Now, along with this, because normally PostgreSQL compresses large data being stored, you can avoid that by setting the storage to external for your bytea data type. So for example, here setting this to external actually avoids the compression. 

So if you're already compressing it, use the external here to avoid trying to compress it again. Next, they talk about a Java implementation they did to test out the speeds of these different solutions. They come up with a table at the bottom here that shows using the file system directly for two different data size types. You can see that's the fastest, but again, the disadvantage is inconsistency. The large objects are pretty darn slow, but it has support for streaming. And then the bytea data type is faster than large objects, but nowhere near the speed of the file system. And then in the summer here they talk about the advantages and disadvantages of each type. So if you want to store binary data in PostgreSQL, definitely check out this post so you can make the best choice for what you're choosing to do.

The next post, also from cybertec-postgresql.com is "POSTGRESQL: NETWORK LATENCY DOES MAKE A BIG DIFFERENCE". So here they wanted to test performance with increased latency. So they're using a tool called tcconfig to configure latency in Linux and then using pgbench to set up some transaction tests. So running at full speed with no latency introduced, you get 8,813 transactions per second. When introducing ten milliseconds of delay and then running the same benchmark, it drops to 461 transactions per second. So a huge drop off. So essentially these are clients waiting for their results to come in before they send the next transaction. That type of latency of ten milliseconds can result in that. Then if you bump it up to 50 milliseconds, now your TPS is at 88. Basically, it dropped over 100 times. So network latency is hugely important. They mentioned it specifically here for online transaction processing because you have a lot of small short transactions, any latency you introduce from wherever the client is to the server is going to significantly decrease the transactions that you can execute per second. So definitely an interesting blog post showing you how latency can impact your queries.

The next post is "Postgres Distinct On". This is from johnnunemaker.com. This is an interesting post because he's a developer and he walks through a problem he was having with performance and how he wanted to fix an N+1 query to get better performance. So this talks a lot about Ruby as well as the SQL he used to solve his problem. Basically, the solution he chose as the most efficient one was using DISTINCT ON to pull up the first record that he needed. And then he goes through his Ruby implementation of how he introduced this query into his code to get the best performance. So if you're interested in a developer's perspective of seeing a problem and coming up with a solution using DISTINCT ON, definitely check out this blog post.

The next post is "Overcoming Imposter Syndrome: Working with PostgreSQL's SSL Modes". This is from richyen.com. He's talking about all the different SSL modes that are available in PostgreSQL. Now, what determines what uses SSL or not from the server's perspective is the pg_hba.conf file. So, however, in that first column you have it set will determine whether SSL is used. Now local uses a Unix-domain socket, so it's not going to use any SSL at all. Using host, it could use SSL or it could not. Using hostssl, it only uses SSL. And then using hostnossl means don't use SSL with us. Now, a convenient way to reject all non SSL traffic is to put this in the top line of your pg_hba.conf file. Basically, for hostnossl for all databases, the user's IP addresses reject it. 

So basically reject all no SSL connection attempts. Then you would need to put in a hostssl line for others wishing to connect. So this is how you enforce it on the server side. Now, there's also an SSL mode on the client so that dictates how the client connects. So it can prefer SSL. It can require it, it can disable it. There are also some certificate verifications. So here he goes through the different ones. So he set up the server as described to reject all non-SSL. It's saying only use SSL when connecting with a password. The first attempt here where the SSL mode of a client was disabled, it rejected it. It won't work because the client has said, I don't want to use SSL, but the server says, no, I'm only going to use SSL. So the connection doesn't work. The connections are permitted through when the SSL mode is allowed for the client. 

It works if it's preferred for the client, or it works if it's required. However, you'll notice that Verify CA fails as well as Verify Full fails. And this is where you're asking for the client to verify the certificate used by the server. So this will only work if you have set up and manually installed certificates on the server and the clients trust the certificate authority who signed those certificates. Verify CA basically means verifying the CA who signed the certificate of the server you're using. Verify Full means you're also verifying that the common name or the server alternative name that is in the certificate matches the database server that you're trying to connect to. So basically, this is the strongest way of verifying authenticity that you're not connecting to a server. That's not what you expect. So if you want to learn more about these settings, definitely check out this post.

The next piece of content is "Benefits of External Key Management System Over the Internal and How they Could Help Securing PostgreSQL". This is from highgo.ca. So this is talking about the benefits of an external key management system. Now, why are we talking about this? This is because this organization at High Go is working on a transparent data encryption feature. This is where data gets encrypted at rest by the PostgreSQL server. Now, in order to do that, they mentioned that encryption is relatively simple, but what is not is key management. So they're working on this transparent data encryption feature, potentially for version 14, and they're talking about having some internal key management features for it. But they're going through and talking about this post from the Benefits of an external key management feature. So if you want to learn more about the benefits of this, you may want to check out this post.

The next post is" Keep your libraries updated and nobody gets hurt". This is from 2ndquadrant.com. They're talking about the importance of keeping all your libraries up to date even when running PostgreSQL because PostgreSQL relies on a number of libraries and if those aren't kept up to date, you could run into problems and issues. They say that not upgrading may give you a false sense of stability because things are fine and you may have a fear of upgrading, but eventually, you're probably going to run into issues. It's usually related to an old software version that hasn't been updated. So they go into some actual scenarios where, say, someone was using a pg_dump, but then they discovered that it had a PGAudit extension that hadn't been updated in more than a year and it was resulting in an error when trying to restore the database. So that's a big issue. The next scenario they're talking about where they had a huge performance difference when trying to do a PostGIS query and they tracked it down to an issue with an old glibc version. So once that was updated, the performance went back to normal. So it's basically communicating the importance of doing updates to your system, not just your PostgreSQL versions, but also the extra libraries that potentially PostgreSQL is using.

The next post is "Multicolumn Partitioning". This is from mbeena-learnings.blogspot.com. This is a way of partitioning tables using multiple columns. And I actually didn't know this was possible where you can create a table and have different columns and look in different ranges. So if you look at this here, it's like a range from 1 to 20 and from 110 to 200 and from 50 to 200. This goes through all the possibilities and things to watch out for and how queries work when using multiple columns for your partitioning. Now, I haven't encountered a use case for this, but this was definitely an interesting blog post you may want to check out in case you have a use case for it.

The next post is "PG Badger: X-Ray Vision for Your Queries". This is from enterprisedb.com and they're basically talking about using PgBadger to analyze your logs and they go through some of the things you may want to additionally configure in postgresql.conf in order to make sure you're collecting the most information. And it shows you how to walk through and how can use PgBadger to analyze your logs. So if you're interested in doing that, check out this blog post.

The next post is "Key Parameters and Configuration for Streaming Replication in Postgres 12". This is from enterprisedb.com. So in this example, they're looking at setting up a primary server with two replicas and you can choose to have them synchronous or not. And they're using replication slots and they describe the settings that need to be made on the primary and the replicas. So if you're interested in setting this up for your system, you can check out this blog post.

The next post is "Fuzzy Searching with PostgreSQL". This is from dev.to and it's talking about Trigrams basically, so use the pg_trgm extension to do fuzzy searches or similar searches in your database. Now this goes over just a very basic implementation of it and how you could use it. It doesn't cover things such as adding indexes and things of that nature, but if you want to get a good basic description of using the pg_trgm extension, definitely check out this blog post.

The next piece of content is "Advanced SQL and database books and resources". This is from neilwithdata.com, and we had previously talked about other books that were presented as educational resources, and here he goes into several different online courses and books for you to boost your knowledge of SQL. Again, this is not necessarily PostgreSQL specific, but a number of his resources of course talk about PostgreSQL.

The next post is "Build PostgreSQL and Extension on Windows". It tells you how you can build Postgres along with extensions in Windows. This is from highgo.ca. So if you have a need to do that, check out this post.

The last post, the PostgreSQL person of the week is Fabrízio de Royes Mello. So if you want to learn more about Fabrízio and his contributions to PostgreSQL, definitely check out this blog post.

episode_image