background

ARM Tests, Tips & Tricks, Hierarchical Structures, Benchmarking Framework | Scaling Postgres 149

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss tests of Postgres on ARM processors, 2021 tips & tricks, working with hierarchical structures and creating a benchmarking framework.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, family, and coworkers continue to do well. Our first piece of content is "PostgreSQL on ARM-based AWS EC2 instances. Is It Any Good?". This is from percona.com, and they're talking about the new Graviton 2-based AWS instances that were added last May 2020. They compared them against an x86 platform. Now, they're calling these older, so maybe there'll be a new one at some point, but an older m5d.8xlarge against the new Graviton ones, ARM-based, m6gd.8xlarge. Now, they both have the same amount of virtual CPU cores and the same amount of memory. The storage is a little bit different. They use the local NVMe storage, not the EBS store that Amazon offers. But of course, the difference you'll notice here is that it's cheaper, I think they said, 25% cheaper. 

So if the performance is equivalent, you already get a win here by being able to pay 25% less. So they installed Ubuntu on these systems, did the Postgres configuration as defined here, and ran some pgbench tests. Now, with a read-write workload, they found that the ARM processor gave a 19% performance gain. So even though it's 25% less, it gave 20% more performance. That's pretty good. With read-only, it was a little better. It gave a 30% performance gain in transactions per second for the ARM processor versus the x86 processor. So it looks like by switching to these, you could immediately get better performance for a lower price. Next, they tested a sysbench-tpcc. Now, what they did mention is that whereas they didn't run into any issues with PostgreSQL on ARM, they did have issues with sysbench on it. 

So that's just something to keep in mind. So they looked at an in-memory load with 16 threads, and you could see the AEM in blue gives you about a 15% better performance. Now, this drop-off, they said, is due to the checkpointing and the full page writes that happen. It just happened earlier here. Since it was processing more compared to the X86 platform. They did in-memory with 32 threads, and the actual difference was kind of reduced. Now, remember, these are using 32 virtual CPU cores. Then they went to 64, and the difference between them, as the threads increased, got narrower and narrower. In some IO tests, they also saw some better performance for the x86 platform. So it's not always the case that ARM wins every time. But in a fair number of the tests that they've shown here, it gave what they're calling about a 15% to 20% performance gain for 25% less money. So definitely an interesting finding. Perhaps you want to check out using PostgreSQL on ARM for your workloads.

The next piece of content is "TIPS AND TRICKS TO KICK-START THE POSTGRES YEAR 2021". This is from cybertec-postgresql.com and the first one they cover is partial indexes. This is just something that is frequently on his tips and tricks list, he says because it's so important for performance. Because if you can only index certain parts of the data, that index will be smaller and much faster to look through. One scenario where a partial index is crucial is when dealing with a status column that has a limited number of unique statuses. By creating a partial index for each frequently queried status, the index size can be significantly reduced as it only includes those specific statuses or items with low cardinality. Just create partial indexes for each of those values in those particular fields. He uses an example of when you have, say, a lot of NULLS in a column but a few values, a partial index would be good in that case. 

The next tip he mentions is estimating the average row or record size in bytes and he says you can do this pretty easily using the function pg_column_size. Now this says column size, but you can actually apply it to all the columns of the table by using a query such as this where you're referencing the entire table so that it is possible to give you an average row size. The next tip he mentioned is avoiding the risk of upgrading older Postgres instances to new major versions and basically saying to use logical replication. So if there's some reason you can't have downtime, logical replication is a means to do that upgrade, ideally without any downtime at all. 

The next tip is detecting which index types are available for my column type. So it gives a query on how you can define what column type or data type you're using and it gives the available indexes for it. He mentioned you can also add some additional extensions such as btree-gin or btree_gist to get a few more different options as well. So basically be aware of the different index types that can potentially help give you better performance for what you're trying to query. The last tip is inspecting metadata at the speed of light with psql's gdesc. So this gives you a description of all the data types from a query that you've run by running the \gdesc. So if you're interested in some of these steps, go ahead and check out this post.

The next piece of content is "Hierarchical Structures in PostgreSQL". This is from hovervear.org. So they're talking about a hierarchical structure where you have departments and what departments are located and others. So they have a name. So the Engineering department has no parent, whereas, say, the Product department is part of Engineering and Finance and Human Resources are part of Administration, so it's a hierarchical organization. Then they wanted to create a path that defined that hierarchy that you could just look through. They built this using two different implementations. The first implementation used materialized views with recursive CTEs. The materialized view was just to be able to refresh it conveniently, but the recursive CTE is how it goes through and builds that path structure that we saw. 

This is a relatively simple query to go ahead and do that. The materialized view is just to refresh that on a frequent basis. They show an example of it, of how you can query what it looks like, what the path that it generates is looking like, and how you can pull out individual values as well as look for, say, where Product is in the hierarchy. The second implementation is using ltree, and we've discussed this in a previous episode of Scaling Postgres, but ltree is an extension you can add that actually builds this for you. So you can see when they created this teams table, again, they used the data type of ltree for creating it. Then they load the data and it gives you the same type of path. Now it doesn't look identical, it's a little bit different, but it shows you how you can query this path to get the same types of answers. So if you're interested in learning how to store hierarchical data, then maybe you want to check out this blog post.

The next piece of content is "Demo of another benchmarking framework". This is from markwkm.blogspot.com. He's talking about developing a more convenient benchmarking framework that you can load in your own queries to do benchmarks against different data sets or different databases. So he used a data warehousing example and used a dbgen and qgen from the TPC-H framework. His user-defined benchmark is called pgnottpch. So it's basically him developing this benchmark that could potentially be used to put your own information into to test against databases. So if you're interested in that, you can check out this blog post.

The next piece of content is "Waiting for PostgreSQL 14 - Add pg_stat_database counters for sessions and session time". This is from depesz.com. So basically to this view, the pg_stat_database, it's adding session_time, active_time, idle_in_transaction_time, the number of sessions, and the number of sessions_abandoned,_fatal, or_killed. So it gives you an overview of how much time different sessions are spent in different activities, as well as where they potentially may fall down. So you can gather statistics on that.

Also from depesz.com, "Are there limits to partition counts?". Now this is interesting. He actually created up to 70,000 partition tables because he saw someone was referencing that there's some limit on it. He went all the way up to 70,000 with no issues of doing that. However, he did notice an extension in the planning time required for the queries that were going against the database for these partition tables. So it does work, but you probably don't necessarily want to do that. But definitely an interesting post about what's possible. He actually has a quote here- "I would probably discourage people from making 5000+ partitions on a single table, but Pg can definitely handle it". So if you want to learn more about this, you can check out this blog post.

The next piece of content is "'How To' with Dave Page Episode 3: Install and Run pgAdmin Using the Python Package". This is from Enterprisedb.com. So there's a convenient YouTube video here that shows you exactly how to do that.

The next piece of content is "GOLDEN PROPORTIONS IN POSTGRESQL". This is from cybertec-postgresql.com, and they're talking about golden proportions where they're basically certain mathematical values that have interesting concepts. So one particular value he mentioned is 1.618, and he uses this in reference to building a new office that they were discussing. Now, the interest of this post is basically on all the different mathematical and conversions you can use within PostgreSQL, because this whole post talks about doing different calculations and computations using SQL. So if you want to learn more about the different functions and mathematical computations that are capable of postgres, maybe you want to check out this blog post.

The next piece of content is "Production PostGIS Vector Tiles: Caching". This is from blog.crunchydata.com. They're talking about an issue where you have, say, a public website that needs to look at images, and you're using pg_tileserv to dynamically pull that data to present to users of your application. That creates a huge load on the pg_tileserv as well as on the database to pull that data from it. Now, one way you could do it is to set up multiple databases in a cluster like this. 

But he advocates probably the easiest thing to do is set up a caching. Now, he used varnish in this case. But basically, you cache the results of thepg_tileserv to serve your application. So you could probably get by using a 1 minute, a five-minute, or even a 30 minutes cache of data. You don't necessarily need it to be all real-time, but this would allow near real-time results from your application if you have this type of caching in place. So if you're interested in learning more about that, you can check out this blog post.

The last piece of content, the PostgreSQL person of the week is Marc Linster. So if you're interested in learning more about Marc and his contributions to PostgreSQL, definitely check out this blog post.

episode_image