background

Reducing WAL, What is Null?, Understanding SQL, TPC-H Benchmarking | Scaling Postgres 104

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss how to reduce the size of your WAL files, what is null, how to better understand SQL and exploring TPC-H benchmarks.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "REDUCE WAL BY INCREASING CHECKPOINT DISTANCE". This is from cybertec-postgresql.com. And to start off, he just creates a basic Postgres 12 cluster, sets the max_wal_size relatively low, and just turns off synchronous_commit to be able to have the tests run quickly. Next, he creates a pgbench test database with 10 million rows and then does a pgbench benchmark 10 x 1 million transactions. And it processed about 13,700 transactions per second. Doing that calculation of WAL generated about 77 GB of WAL files. So basically the lower max_wal_size causes it to checkpoint more often because it can't keep that much WAL around, it needs to go ahead and checkpoint to save those changes that are occurring to the data files. 

And with that, it generated a lot of write-ahead log WAL, 77 GB. Next, he increased the max_wal_size to 150 GB, which is over 1,000 times as much, changed the checkpoint timeout period to one day, and still left synchronous_commit off. Now in running it, it only generated 5GB. So 5.3 compared to 77 GB. So why would the same set of transactions cause 10x the amount of WAL to be generated? And he explains it here. Basically, after a checkpoint, anytime that page or block is altered, they need to write the full contents of it to the WAL to ensure consistency. Now, once that whole block has been committed to the WAL, subsequent changes won't require that to be done. But it needs to start with a consistent starting point to make sure that after a checkpoint, any alterations, those pages or blocks are committed to the WAL files.

Now, an older post talking about this from 2016 is actually called "On the impact of full-page writes". This is from 2ndquadrant.com. And here they're talking about full-page rights or the storing of full-page images in the WAL files. And as an example, they are looking at size differentials in terms of different workloads comparing a serial or a big serial compared to UUID. Now, because the serial is essentially sequential, whereas UUID is randomized, whenever it needs to make alterations, you're going to get more full-page images stored on the WAL. Therefore, using UUIDs causes a great deal more WAL to be generated normally. But this is a great post that goes into more depth on the reason why the need for having these full-page writes as well. So just something to keep in mind if you want to research how to potentially reduce your wal_size as your database increases in size and some potential solutions to save space and make it more efficient.

Next post is "Handling NULL Values in PostgreSQL?". This is from percona.com and they're talking about the concept of NULL having different definitions based upon different languages. For example, they talk about C/C++ here that NULL is defined as 0 in the standard definition header file. Whereas when you're talking about Java, NULL means variable references do have value and it can be tested against NULL with equality operators. But if you're using PostgreSQL, NULL basically means an unknown value. So they have a quick comparison here where you do one equals one is true. Does foo equal foo for text? Yes, it's true. Does NULL equal NULL? No. The reason is that you can't use this equality operator, an unknown value to equal an unknown value is unknown. Similarly, you can't say if this unknown value is not equal to this unknown value. You don't know, it's not known. 

And how you need to work with NULLs in PostgreSQL is to use the IS/IS NOT. So when you say NULL is NULL then you'll get a true or NULL IS NOT NULL, then you'll get a false. Similarly, you can use the COALESCE function and it will return the first set of NOT NULL arguments and you can test this expression, but normally you're going to use NULL IS NULL. Now they also mentioned NULLIF, which I personally haven't used, but it basically returns a NULL if these two values passed in or equal otherwise return the first variable and it goes into a couple of quick examples of how you can use NULL. And they have some example data here where NULLs are present. You can't say it is blank because it's not blank, it's actually NULL. So you need to actually say look where the surname is NULL. So this is a relatively brief post that talks about NULL as it relates to PostgreSQL and just some things to be aware of if you are relatively new to Postgres.

The next piece of content is "SQL DISTINCT is not a function". This is from blog.jooq.org. Basically, he's saying he's seen instances of someone writing some SQL where they use DISTINCT and parentheses around a column. But really this DISTINCT applies to this whole select statement. It's not a function that you would use to execute a column or have a column as a parameter. So these params are unneeded, they don't do anything because the SELECT applies to the whole thing. He goes through and explains a little bit why this is the case and how to work through it. Now, one exception he says, is the DISTINCT ON keyword, which is something that Postgres has an extension to the SQL standard. So DISTINCT ON  does require parentheses around a parameter that's passed in to describe how distinct on is required.

Now, related to this, there is a further post he did the next day called "5 Ways to Better Understand SQL by Adding Optional Parentheses". Now the first part of this he goes through all sorts of different posts that go into additional lessons on SQL. So if you want a refresher or you need more education with regard to that, definitely check out some of these posts. But the rest of the post talks about the placement of parentheses and how it can help you potentially understand what's going on with SQL. Now, he first talked about row value expressions where you can write your SQL this way, but you can also put parentheses around it. It doesn't mean anything and you can even use a function called row with parentheses around it or a keyword row and still it gives you the same result. 

Now you can even put parentheses around two different columns, so you can say with the first name followed by the last name, and you can look for that here to get the same set of data. And you can even use the IN command or a subquery to look for the first name and the last name where it's in the set and it will only return the records that match the set as well as doing it with a subquery. Next, he moves on to joins and explains how joins work and how you can use parentheses with that. But he also goes over some things, explaining some precedents you could potentially do. But personally I just like doing joins this way without any parentheses to understand what's going on. 

He agrees this is a nice and clean SQL, but he's using these parentheses as a reason to try and help understand what's kind of going on under the covers with SQL. And then like the previous post, he talks about DISTINCT and how it applies to the whole section as opposed to a single column. He goes over UNION, INTERSECT, and EXCEPT and how using parentheses can help you better understand what's going on with them. Finally follows up with sub-queries along with some conclusions. So if you want a refresher in SQL or you want to increase your knowledge of it, then definitely a blog post to check out.

The next piece of content is "Learnings from benchmarking PostgreSQL on TPC-H". This is from the It's All In Database blog on rafiasabih.blogspot.com. Here, they are benchmarking PostgreSQL using the TCP-H benchmark, which is a more analytical or an OLAP-based benchmark because it does a lot of testing with parallel queries. He covers the lessons learned in terms of parallel scans needing to use indexes tuning related parameters such as your random_page_costs, generally dropping that down to one if you're using SSDs, and properly setting the effective_cache_size to get better performance. He talks about parallel joins and the importance of work_mem. Now, I was shocked to see where he was saying this work_mem is 20% of the RAM size, which that's huge. So clearly this is for an analytical database where you have fewer larger queries running because I've never seen this for an online transaction processing database and he talks about a parallel merge join and then shared parallel hash joins as well as some other conclusions. So if you're interested in optimizing for parallel processing of an analytical database, maybe that's a blog post you'll want to check out.

The next post is "What to Look for if Your PostgreSQL Replication is Lagging". This is from severalnines.com. Now, this talks about streaming replication from one primary database to a replica, or from a replica to a replica. Now, first they go over all the different types of replication, but this is primarily talking about streaming replication and it's talking about essentially monitoring your replication. The number one way to do that is the pg_stat_replication view and they cover what all the different parameters are, what you can use on the standby node, what you can use on the primary node, and then how to identify the lag and the source of it. So if you're interested in monitoring your replication lag or you have a problem you'd like to explore, definitely check out this blog post.

The next post is "Creating a PostgreSQL procedural language - Part 4 - Handling Input Parameters". This is from 2ndquadrant.com. So this is part four where they're creating a procedural-level language using Julia and also passes off and executes code in Julia. So if you want to continue on with this series, the next post is available.

The next post is "Developing PostgreSQL for Windows, Part 2". This is from 2ndquadrant.com. So this is the second part where they're discussing actually developing PostgreSQL on, say, a Linux environment, but you want to ensure that it works on a Windows environment as well. Well, this series of posts walks through that.

The next piece of content, the PostgreSQL person of the week is Joe Conway. So if you're interested in checking out his contributions and work, you can check out the PostgreSQL Life website.

The next series of posts are primarily all about tools and updates that are available. So this is postgres.ai, and there is a product called "Database Lab", where basically, as it says here, it clones a multi-terabyte database in seconds. So it has a terabyte database and then essentially it does diffs as you're working with it to give you your own separate slices of the database for doing different experiments. Now with that, it also has a SQL performance advisor called Joe. So this lets you run those experiments through a slack chat interface to be able to check the performance of your queries using this database lab. So it's an interesting set of tools to be able to explain-analyze on production data and analyze the best queries. So if you've never heard of them before, maybe you'll want to check out the tools that they have available.

The next post is "Crunchy Spatial: It's PostGIS for the Web". This is from crunchydata.com. So basically, this is a tool that Crunchy Data is working on. They're calling it Crunchy Spatial. And basically, they're building a web interface leveraging PostGIS to the data within PostgreSQL. So they're developing a series of tools or microservices that are web interfaces into PostGIS. The PostgreSQL database and the number of services they want to set up are a Vector Tile server, a Vector feature server, a data import service, a routing service, a geocoding service, and a base map service. To start off, they have two tools available for this feature set: a pg_tileserv and a pg_featureserv.

Now, related to this, they have another post that's "Crunchy Spatial: Tile Serving" and they go into more in-depth on pg_tileser about how it works. So it's a microservice that communicates with the data layer. It basically provides a web user interface to the spatial data being produced. And they have a relatively short number of steps to go ahead and spin up your own example of this. So if you're interested in using PostGIS for these purposes, definitely check out these blog posts.

The next piece of content is "pgBackRest auto-select backup". This is from pgstef.github.io. So this is a post about pgBackRest and how a new version was released that has the ability to do an auto selection of the proper backup set when doing a restore based upon a target time. So if that's of interest to you, maybe you won't want to check out this blog post.

The last post is "Citus 9.2 speeds up large scale HTAP workloads on Postgres". This is from citusdata.com. Now, HTAP actually is a Hybrid Transactional Analytical Processing database and thus covers all of the new features that have been introduced in Citus 9.2, as well as a number of different performance improvements. So if you have any interest in Citus, definitely a blog post to check out.

episode_image