Generate Test Data, Faster Archiving, Date Statistics, Useless Indexes | Scaling Postgres 247
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss how to generate test data, how WAL archiving is faster in PG15, how to set date statistics on a timestamp and how to avoid useless indexes.
Content Discussed
- Generating lots of test data with Postgres, fast and faster
- Speed Up of the WAL Archiving in PostgreSQL 15
- Improving GROUP BY with CREATE STATISTICS
- Don't do this: creating useless indexes
- JSON logs in PostgreSQL 15
- PostgreSQL hidden gems
- Advent of Code 2022: Days 6-10 With PostgreSQL
- How to Solve Advent of Code 2022 Using Postgres - Day 6
- How to Solve Advent of Code 2022 Using Postgres - Day 7
- How to Solve Advent of Code 2022 Using Postgres - Day 8
- How to Solve Advent of Code 2022 Using Postgres - Day 9
- How to Solve Advent of Code 2022 Using Postgres - Day 10
- PGSQL Phriday #004: PostgreSQL and Software Development
- PGSQL Phriday #004: PostgreSQL and Software Development
- Database DevOps and Code management with PostgreSQL
- PGSQL Phriday #004: Sequence Survey
- Postgres and Software Development: PGSQL Phriday #004
- UNION ALL, data types and performance
- Speed up Postgres queries with UNIONs and subquery pull-up
- Transparent Data Encryption (TDE)
- pgsodium
- Diffing PostgreSQL Schema Changes
- PostgreSQL Vacuuming to Optimize Database Performance and Reclaim Space
- Oracle vs SQL Server vs PostgreSQL – DATE data type
- PostgreSQL 16: part 3 or CommitFest 2022-11
- Row estimates
- Postgres year in review 2022
- Transaction ID wraparound
- Lukas Eder
YouTube Video
Podcast Audio
Transcript
All right, I hope you, your friends, family and coworkers continue to do well. Welcome to 2023. This is the first episode of the year after taking a few weeks off and there is a lot of content that was generated over the holiday break. So we've got a lot of content to go through, so I'm probably going to go through it pretty fast. But our first piece of content is "Generating lots of test data with Postgres, fast and faster". This is from kmoppel.github.io, and this blog post covers all sorts of different ways to generate test data. So from the basics using generate_series, he has a quick way to generate 30 million rows, which is about equivalent to a 1gb table in a few seconds. So there are a lot more things you can do with generate_series to incorporate randomizations and things of that nature. But just using generate_series, he calls Level 0. Then he goes to Level 1, where he basically uses pgbench. The remainder of the post talks about different configurations you can add to pgbench to generate some quick test data. Now, the benefit of generate_series is that you can generate your own test data, whereas pgbench has a particular schema of test data that it generates for you. But this post is a great resource to find out different configuration variables you can use to pgbench to generate the database size that you're looking for.
So he says these are the parameters he tends to use when he generates a database. At a scale of 5,000, it generates about 73gb. He says you can also introduce less logging, so you can do it in an unlogged fashion. So it basically bypasses the WAL and is able to happen faster. You can see that this offers a 40% boost. You could do data only, ignoring indexes, so that gives about a 3x boost, he says. You can hit the IO even harder to try and avoid any CPU bottlenecks by doing a fill factor. Here he used as opposed to a fill factor of 100, where all the tables are fully full, used a fill factor of 10. So that really pushes the IO. Or you could also do parallel streams, so you could use more than one CPU because right now it's all been single-threaded. But you can do it in such a way, as he describes here, to do jobs in parallel and that'll use more CPUs and it will push the IO even harder. Then he shows the results of the different changes and the speed-up improvements, and gives some final bonus advice when you want to push the amount of test data even harder. So if you're looking for a way to generate test data, definitely a resource to check out.
Next piece of content- "Speed Up of the WAL Archiving in PostgreSQL 15". This is from percona.com. So they're talking about archiving and how in 15, they offered the Archive Module or the Library feature to be able to bypass running a command to do the archiving. But this indicates that even the process of archiving the WAL has been improved when you're just using the archive command. Basically, it used to do a lot of directory reads, looking at the archive status to check and see which file needed to be archived next. But they actually made it more efficient by just scanning the directory once, holding the results in an array in memory, and that drastically reduced the number of directory scans. So much so that some people are reporting a 20x performance improvement with regard to the speed at which WAL can be archived. So if you run into this issue where your database is having trouble keeping up archiving WAL, maybe Postgres 15 offers some speed improvements in that area. And definitely check out this blog post if you want to learn more.
Next piece of content- "IMPROVING GROUP BY WITH CREATE STATISTICS". This is from cybertec-postgresql.com. He's talking about a situation where he has a time series table. So there's a timestamp and different values in a numeric field. When you're trying to group by them if you just group by a date, even though there is a correlation with many of those timestamp values because they have the same date, it's just the time of day that has changed. But Postgres isn't aware of that correlation. When you try to do a GROUP BY where you're truncating to the day of the timestamp, it actually has to look through every row to kind of find the information you're looking for when doing an explain plan. But he says you can use the CREATE STATISTICS capabilities in Postgres and create statistics on that day truncation and then it actually allows the query planner to give you an accurate estimate of how many rows have that particular day. Now, in the example he presented in this post, just selecting from the whole table, you don't see a difference in the performance when looking at EXPLAIN or ANALYZE. But with this additional knowledge, when you're looking for, say, a specific date range, it should allow the query planner to give you a better plan for pulling out the data that you actually need to pull out. But if you want to learn more about how to do this, definitely check out this blog post.
Next piece of content- "Don't do this: creating useless indexes". This is from mydbanotebook.org. And this is actually part of the PgSQL Phriday blog post. But I thought this warranted particular merit because they're talking about how some ORMs, even mentioned Django, had the ability to create indexes on all columns of the table, which to me sounds a little insane. And they've even seen a situation where there's a database that had twelve times more indexes than the actual data stored on the disk, which is pretty crazy. So definitely don't create indexes on every column you have, or even close to that. Usually, the recommendation is to, of course, have your primary keys with indexes, and then, of course, your foreign keys, and then apart from that, there may be additional columns that you need to index for better performance. But keep in mind that every index you add takes up more disk space and reduces the amount of more valuable indexes or data that can be in the cache, potentially. Not to mention the write throughput because every index on the table has to be kept up to date when new data is inserted or potentially updated. So you definitely need indexes, but definitely don't have too many. But if you want to learn more about that, check out this blog post.
Next piece of content- "JSON LOGS IN POSTGRESQL 15". This is from cybertec-postgresql.com, and they talk about how to set up the new JSON logging feature in Postgres 15. Now, if you do this, be sure you have a toolset that can actually consume the JSON logs. In particular, they're making a point that you need to be careful because not every set of data is represented in the JSON, meaning that in this example you actually have fewer amounts of data with a log entry vs. an error entry. You can see that there are more pieces of information contained within that JSON. Now that allows for some efficiency, but just keep in mind that you need a tool that can properly parse the JSON output if you want to move to that. But check out this blog post if you want to learn more.
Next piece of content- "PostgreSQL hidden gems". This is from peter.eisentraut.org, and he's talking about some potentially overlooked features in previous versions of Postgres, or at least what he thinks are today the most overlooked features. In terms of 9.4, he's talking about replication slots. In terms of 9.5, it's pg_rewind. He talks about non-exclusive backups in 9.6, which has definitely affected some of my clients because the exclusive backups were entirely removed in Postgres 15. Postgres 10 is the CREATE STATISTICS option. Postgres 11 is the system catalog file format, which is more appropriate for developers Postgres. Postgres 12 is setting a ssl_min_protocol_version. Definitely important to no longer use the older versions. Postgres 13 is logical replication of partition tables, definitely a benefit, and Postgres 14 is ALTER TABLE, DETACH PARTITION, and CONCURRENTLY, so being able to get rid of a partition you're no longer using without blocking the whole table. So definitely some hidden gems. Check out this blog post if you want to learn more.
Next piece of content- "Advent of Code 2022: Days 6-10 With PostgreSQL". This is from softwareandbooze.com, and for days six through ten of Advent of Code, he has released his solutions in the form of YouTube videos. Now, related to this, Greg at CrunchyData has also released his solutions, but as an individual blog post. So he has Days 6, 7, 8, 9, and 10. So definitely check out these pieces of content if you want to learn how to do the advent of code using just PostgreSQL.
Next piece of content- "PGSQL Phriday #004: PostgreSQL and Software Development". This is from hdombrovskaya.wordpress.com. And this is January's PGSQL Phriday. They're talking about software development with Postgres. Now, they're not necessarily talking about application development, but this is the database-related scripts or jobs that are associated with working with Postgres. He asked questions such as "Do you have your own scripts and where do you store those? Do you store your SQL code in GitHub or elsewhere? Do you use pgTAP and things of that nature?". So in terms of responses, andreas.scherbaum.la gave a response where he talked about his PostgreSQL Person of the Week interviews and gave some responses according to how different people he's interviewed have answered it.
Softwareandbooze.com gave a response, talking about how he tends to manage the SQL, particularly from a DevOps perspective.
Gorthx.wordpress.com gave a response and talked about identifying sequences that were about to run out and a script on how to do it.
And RustproofLabs gave a response, talking about the different types of code that they have in terms of being mission-critical, nontrivial or trivial, and how they go ahead and work with that code in their solutions. So if you want to learn more about this, you can definitely check out these blog posts.
Next piece of content- "UNION ALL, DATA TYPES AND PERFORMANCE". This is from cybertec-postgresql.com. They're talking about how they were doing a UNION to try to merge two types of tables together, and they got an unexpected result in terms of poor performance. So they had a bird table, a bat table, and a cat table, and then they created a view for a mammal and included body temperature and did a UNION ALL and they gave a pretty fast response. But then they created a view called flying_animal, including wingspan using UNION ALL, and they got a very slow response.
The reason is, if you look at the tables here, the wingspan for a bird, the data type is real. The data type for the wingspan on a bat, it's numeric. So it was these data type differences that caused the poor performance. This blog post goes into how they discovered it by looking at explain plan and looking at the differences. So basically what's going on is the implicit casting can't work because of the different data types, but you can do an explicit cast. Now, if you want to learn more about this, this is the episode that Lukas covered this week on pganalyze.com "Five minutes of Postgres". So definitely check his episode out to learn more, where he covers previous discussions that CyberTech has had about using ORs and UNIONs.
Next piece of content- "Transparent Data Encryption (TDE)". This is from percona.com, and this is a blog post that covers all sorts of different levels at which you can do transparent data encryption. Now, Postgres hardly does any of these really only column-based, so it does not offer anything out of the box for disk-level encryption. You can rely on your operating system for that. It doesn't offer cluster-level encryption, nor database level, nor table level. You can see other solutions such as Oracle, MySQL, and Microsoft SQL Server do have solutions for it, but really only column levels are available for PostgreSQL. Now they mentioned pgcrypto, so you can use pgcrypto to do column level. Definitely check out this blog post if you want to learn more about that. But there is another solution other than pgcrypto, which is pgsodium. So this is another way to do transparent column encryption for Postgres. So if you're looking into pgcrypto, maybe check out pgsodium as well.
Next piece of content- "Diffing PostgreSQL Schema Changes". This is from percona.com, and how they're doing this is they basically take logical dump manifests and then diff the difference between them so you can see the schema different changes. So if you're interested in that, you can check out this blog post.
Next piece of content- "PostgreSQL Vacuuming to Optimize Database Performance and Reclaim Space". This is from percona.com, and they're talking about vacuum and the different settings you can use to optimize your database performance and hopefully reclaim some space. You can check out this blog post to learn more.
Next piece of content- "Oracle vs SQL Server vs PostgreSQL - DATE data type". This is from migops.com, and for those of you who want to migrate to Postgres, this talks about the DATE type differences between these two database systems compared to Postgres. So check out this blog post if you want to learn more.
Next piece of content- "PostgreSQL 16: part 3 or CommitFest 2020-11". This is from postgrespro.com, and they cover a number of features that were worked on during the third CommitFest, so you can check out this blog post if we're going to learn more.
And over the break, there were actually three episodes of Postgres FM that occurred. The one on December 23 was on "Row estimates" and how it can cause poor performance, how to spot problems, and options to fix them. The episode on December 30 was "Postgres year in review 2022", and the episode on January 6 was about "Transaction ID wraparound". What it is, and some ideas to minimize the risk of it happening. So definitely check out these episodes if you're interested in this content.