Parallel Indexing, SQL vs. ORM, Logical Replication Upgrades | Scaling Postgres 29
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering parallel indexing, SQL vs. ORM, logical replication upgrades and development DBs.
Content Discussed
- PostgreSQL: Parallel CREATE INDEX for better performance
- I don't want to learn your garbage query language
- Upgrading to PostgreSQL 11 with Logical Replication
- 12 Factor: Dev/prod parity for your database
- Why do we install as root?
- CHAR: What is it good for?
- Tuning PostgreSQL Database Parameters to Optimize Performance
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about parallel indexing SQL versus ormlogical replication, upgrades and development databases. I'm creston. Jameson. And this isscaling postgres episode 29 one.
[00:00:22] All right, our first article poll this week is PostgreSQL parallel create index for betterperformance. So this is from the Cybertechgresql.com blog and basically they're covering a newfeature offered in PostgreSQL eleven that allows you to create your indexes in parallel usingmultiple cores of your machine. So I really like the methodology that this post lays out. So firsthe intends on creating some large tables in PostgreSQL, and he actually uses the newprocedure function to be able to insert a million rows at a time and then commit after each millionrows. Then he just calls this function 500 times to basically create 500 million rows. And he doesit with a single numeric. So he said he wanted to use numerics because they are more complexand it's going to take more time to build the indexes. So he felt it was a better test than doing iton some integer field. So basically he has 500 million rows of random numeric values, whichconstitutes roughly 21gb of data. Now by default in PostgreSQL eleven, parallel index creation ison by default, and I believe the default values he's mentioning here, the max parallelmaintenance workers are set to two, but for a baseline he said, all right, well, let's go ahead andset those to zero. So essentially it's just going to use one core and creating that index on thesystem he's using took about 17 minutes. Next he says, all right, well, let's try two cores. So hebumped up the max parallel maintenance workers up to two and it took eleven minutes. So apretty good drop from 17 minutes to eleven minutes. So he said, all right, well let's try four,because I believe he says he's using a four core machine. So he's upping the four minutes and itwent from eleven minutes with two down to about nine minutes. So he says at this point he'srunning into hardware limitations. He thinks of the disk or some memory constraints because hehasn't tweaked that. So he moves into adjusting memory and then he moves up into using tablespaces for where the index go and table spaces for the sorting space. So on each step you see,he presents a graph at the end. Here the improvements that happen. So going from a singlecore to two cores to four cores. Then configuring memory drops it down a bit more. Putting onseparate hardware drops it down a bit more. And just for reference, this final data point is wherehe was using integers. So index creation is really dependent upon the data that you're trying toindex. But I thought this was a great post to kind of give you a sense of what kind ofimprovements you can potentially see using these new parallel index creation features that arecoming in PostgreSQL eleven. So definitely blog post I suggest you check out.
[00:03:12] The next post is I don't want to learn your garbage query language. And this is fromEric Bernhardson. Now this doesn't necessarily mention PostgreSQL, but he basically classifiesit in this pretty short post, but he's classifying as a bit of a rant. Now he's kind of tired with all thedifferent orms that people create or domain specific languages and why can't we just stick withSQL? Like he talks about even specific SaaS products. Splunk have their own orms or querylanguage like Splunk has SPL and Mixpanel has JQL, rollbar has RQL, newark has an RQL,AdWords has Awql. So he's like, why can't we just use SQL? So I thought this was aninteresting post and I definitely feel this at times. I know in my development I do tend to useorms, but sometimes I just drop down to pure SQL when things start getting a little bit hairy and Iwant to be more specific in what I'm asking of the database. And definitely whenever I'm havingany performance problems, I look at the raw SQL. So this is just something to keep in mindbecause maybe you're having performance problems because the Orms are doing somethingyou don't expect. So if you run into performance problems, definitely look at what the rawqueries are an Orm is sending to the database because that may give you insights into how youcan change what you're asking the Orm to do. Or perhaps drop down to using raw SQL if it'snecessary. Although the times that I felt I've had to use an orm or DSL is when you havedynamic queries. Like maybe you're wanting to do so --me reporting and giving someone like a raw SQL is just too much, but you want to give them aform to be able to kind of form their own queries. Basically I've found I've had to resort to someDSL, either doing it yourself or find one that can help you provide that type of feature, butdefinitely an interesting viewpoint I thought I'd share.
[00:05:14] The next post is upgrading to PostgreSQL Eleven with Logical Replication. So again,logical replication, this is a feature that came online with version ten and they're talking aboutpotentially using that to upgrade to PostgreSQL eleven that's going to be coming out first. Theycover the three main ways to upgrade a PostgreSQL installation. First is just do a PG dump andthen restore it. That's the most straightforward. If your database is small, it's definitely the way togo. PG upgrade. When your database system gets a little bit larger, this is the route that youwould generally want to go because doing a logical dump and restore can take a very long time.Whereas you can do this with minimal downtime using the hard link option of it. And with versionten, logical replication is another possibility. So that's what this post goes into. Now I should addhe mentions here quote logical replication is the newest of the bunch here, so it will probablytake some time to work out the kinks. So this is definitely a new way to try to do it and there arecertain things you need to be aware of and handle. So it's going to require a lot of testing. Butbasically he goes over a twelve step process that you can follow if you wanted to try to do yourupgrade using logical replication. The other option if you're looking for kind of like a zerodowntime upgrade is potentially using PG Logical, which I believe Second Quadrant does aswell. And they've had previous posts that give some insight into how they do that. So withPostgreSQL eleven right around the corner, this was an interesting post to check out to see ifyou potentially want to do your upgrades using this method.
[00:06:56] The next post is twelve factor dev prod parody for your database. So this is from theCitusdata.com blog and basically they're talking about you generally want to have parity or havedevelopment your development database be as close as possible to your production database.And they say minor versions differences may be okay, but you definitely want to try to keep it assimilar as possible and if you can keep the data as similar as possible. Now of course this is noteasy if you have terabyte or a multi terabyte database to keep that in sync. And plus yougenerally in your production data have data that you need to keep secure and you don'tnecessarily want it all over developers machines.
[00:07:44] So there's a couple of different ways you could do this. You could potentially scrub allthe personally identifiable information from the database, but again, it's not an easy process todo. You may miss something. So one area that I've seen people do is that they set up stagingdatabases or even production replicas and they deploy application changes to those stagingdatabases, test out everything is still working and then deploy it to production. And for particularqueries you could test those against a copy of the live production database. Now it won't beexactly identical, but you can help identify cases where queries may choose one path inproduction but a different one in development if you have a recent copy of production. So that'ssomething you can do to minimize differences between performance and development versusproduction. And in the post they also talk about things to keep in mind in terms of when you'readding a column as a not null and need a default. The safe process to do that as well as alwayscreate your indexes concurrently. But the post continues on and mentions essentially having areplica of production, like I mentioned by forking your database and they talk about some toolsthat Citus data has for their solution. But you can do a restore to a separate PostgreSQLdatabase to replicate something similar or just maintain a separate server with a backup andapply the wall files to it. So definitely an interesting blog post with some ideas to keep yourdevelopment environment similar to your production database or if that's not really possible, howyou could minimize potential issues that are due to the database differences in development andproduction.
[00:09:24] The next post is --why do we install as root? And this is from Dave's Postgres blog at Pgsnake Blogspot.comand he basically says common questions I hear from quote common couple common questions Ihear from customers why do we install our software as Root and why do we run services asPostgres? And he says the simple answer is for security. And basically Postgres gets installedas root, so as many of the files are owned by Root. So if the Postgres user for whatever reasongets compromised, they can't modify the actual code or expand their attack to other areas of theserver as much. And why run the services postgres? Because they own the files and again itminimizes the extent to which a compromise of that service will impact the server as a whole.And he says a quote a basic principle when securing a software installation is install withmaximum privilege requirements and run with minimal. So if you've ever asked those questionsabout this, definitely a relatively short blog post to check out.
[00:10:31] The next post is Care or Char, what is it good for? And this is from thebill.com andbasically he talks about PostgreSQL has type care and it's little used. Most people use Faircareor text and kind of the reason there's no benefit to use the care data type in other databasesystems. There's more of a reason to use them because Veracare actually saves space versusa care. And a care is definitely a fixed length and Postgres does a little bit like that. But he talksabout some surprising differences with Care and basically suggests definitely only use Vericareor text. And if you want to get more details about the differences between this is definitely a blogpost to check out.
[00:11:15] The next post is Tuning PostgreSQL database parameters to optimize performanceand this is from the Procona Database Performance Blog. So basically they go over a number ofparameters to tweak when you're configuring your PostgreSQL instance. So they talked aboutshared memory, which is generally 25% of your Ram and they give a couple of suggestions onhow you may want to tweak that. Potentially they talk about wall buffers and how it's default to16 megabytes but if you have a lot of concurrent connections you may want to increase that.They talked about effective cache size which generally they talk about being 50% of yourmemory, they talk about work mem and how this is essentially a per connection setting, althoughit can be used more than once depending upon the queries that you're doing. Like for example, ifyou have a quote, if you have many users trying to execute Sort operations, then the system willallocate work memory times the total sort operations for all users. So generally you want to keepa ceiling on this. If you have a lot of connections, however, you could ramp it up for a specificsession, or if you don't have a lot of connections to your database, then it talks aboutmaintenance work memory, which is used for maintenance tasks such as a vacuum restorecreate index, alter foreign key excuse me? Add foreign key alter table and how this can boost upyour index speed, which is one of the settings that was changed in our first post that we lookedat, then talked about synchronous, commit, how you could potentially turn that off for morespeed at the cost of reliability. Although I don't know if I would necessarily do that. And thenfollowing up with checkpoint timeout and checkpoint completion targets and some generaladvice with regard to that. So if you're looking to learn more about tuning your PostgreSQLinstallation, definitely a blog post to check out.
[00:13:04] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can signup to receive weekly notifications of each episode, or you could subscribe via YouTube orthanks. --