Value of SQL, Window Functions, DB Migrations, Data Storage | Scaling Postgres 51
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering the value of SQL, window functions, scaling database migrations and efficient data storage.
Content Discussed
- [YouTube] Breaking PostgreSQL at Scale
- SQL: One of the Most Valuable Skills
- How to Calculate a Cumulative Percentage in SQL
- What’s new in PostgreSQL 11
- Move fast and migrate things: how we automated migrations in Postgres
- Performance best practices for using Azure Database for PostgreSQL – Connection Pooling
- PostgreSQL Deep Dive: How Your Data Model Affects Storage
- FOSDEM PostgreSQL Track
- FOSDEM Databases Track
- [YouTube] What's new in PostgreSQL 11
- [YouTube] Deploying PostgreSQL on Kubernetes
- [YouTube] Hacking PostgreSQL
- [YouTube] Around the world with Postgres extensions
- [YouTube] Data Modeling, Normalization and Denormalization
- What’s up with SET TRANSACTION SNAPSHOT?
- An Overview of the Index Changes in PostgreSQL 11
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about the value of SQL, windowfunctions, migrations and data storage. I'm creston. Jameson and this is scaling postgresepisode 51.
[00:00:20] Alright, before we get started this week, I wanted to mention that on February 20Eigth, which is a Thursday, I'm planning a live show or webinar of sorts about PostgreSQLconfiguration. If that's something of interest to you, please go ahead and respond in the YouTubecomments or reply to the email.
[00:00:42] Let me know if that would be something potentially of interest to you. And I'll havemore information on the next episode of Scaling Postgres about that. But to get started for thisweek, we first have a piece of content that I had the slide presentation for last week aboutBreaking PostgreSQL at scale. This is from Christophe Pettis, and they did post the YouTubevideo for it. So I will have this link in the show notes, and I definitely suggest this is a piece ofcontent to watch to match up with the slide presentations that were mentioned at last week'sepisode of Scaling Postgres. The next piece of content is SQL. One of the most valuable skills,and this is from Craig Kirstein's, and he's talking about how well SQL, or knowing it, has servedhim in his different responsibilities from being a product manager to being an engineer. So hefocuses on, number one, it's valuable across different roles and responsibilities. So whetheryou're in the business side, more business side, or more the engineering side, it's usefulbecause data exists in both of those domains. Number two, learning at once doesn't reallyrequire relearning. And he talks about SQL hasn't changed that much over many years. Nowthey add new features to it, such as CTEs, common table expressions or window functions. Butto fundamentally get to the data, it hasn't really changed that much. Whereas if you look at aprogramming language like JavaScript, there seems to be a new framework out every twoweeks or something like that. And then three, you seem like a superhero quote. You seem extrapowerful when you know it because of the amount of people that aren't fluent. So basically, hesays a lot of developers ask him advice because he knows a fair amount of SQL. So if you wantto make yourself a better developer, particularly when working with data, this is definitely a skillto learn. And so it's a good post and I definitely suggest you read it to get more of what he'stalking about. Now, related to that, the next post is how to calculate a cumulative percentage inSQL. This is from the J O Oq.org blog, and he has a chart where he's calculating a cumulativepercentage in SQL and this is the raw data, and he goes through how to do it. So this is a verysimple way to kind of learn Windows functions because that's how he's using it. So first he goesabout grouping it by date and then doing the percentage calculation using window functions,doing sum the amount over an order by a payment date and then sum of overall the amounts.And he has the resulting SQL query here to show the percentages cumulatively rising. And hesays you can even do a nest aggregate functions in window functions. So here's anotherrepresentation of doing that. So if you're wanting to learn more about window functions, definitelya blog post to check out.
[00:03:45] The next post is what's new in PostgreSQL Eleven? And this is from modern Sql.com.Now, this is not like a lot of the other PostgreSQL Eleven posts that are out here because thefocus is on the SQL. So he talks about the SQL 2011 overclause where PostgreSQL has addedadditional features in Eleven to support frame unit groups and frame exclusion. So if you wantedto learn even more about window functions and SQL in general, this is definitely a blog postbecause he goes into depth over what each of those mean. Now, in addition to it, he does goover the implementation of procedures in addition to functions and the call method to be able tocall those procedures. And he does comparisons as well in his blog posts about whatPostgreSQL supports versus other database vendors. Then he talks about parameters in Fetchfirst versus next. And this is essentially the limit clause. Limit is PostgreSQL specific but FetchFirst in rows only, for example, is the SQL standard for doing a limit clause and then he talksabout relative XPath expressions as well. So it's a different take on the PostgreSQL Elevenfeature set mostly as --it relates to SQL. So if that's of interest to you, definitely a blog post to check out. The nextpost is Move Fast and Migrate Things how we automated migrations in Postgres and this is fromthe Benchling Engineering Blog and they say here they use an Orm SQL Alchemy and itscompanion migration tool Alembic and how they use these tools to be able to make changes totheir database over time. And a problem that they ran into was they had some downtime so theywere using a reference to determine what operations were safe to use on a large and activedatabase and they were adding an institution column with a null to the users table which shouldbe safe, but they ran into a lock queuing issue. So for example, there was a long runningtransaction that had grabbed a shared lock on users. Their migration tried to happen, but it waswaiting for that to finish. Now because this once an exclusive lock, there was another requestthat comes in for the user table, but it gets queued behind it. So now there's all these otheradditional user request transactions that are waiting for access to the user's table because ofthis access shared lock on the table that is preventing its exclusive lock. Now, how you handlethis is using some sort of a lock timeout and they started using a lock timeout and a statementtimeout so that the migration can gracefully fail. But of course, when they started doing that, itrequired manual intervention. So they usually had to do these three strategies to get around it.One, check if the migration was safe to rerun, and if so, retry the migration manually to see ifthey just got unlucky. Two, investigate what locks were being blocked by and possibly shut downa Quran system for some period, because the issue they had before it was a Quran running along running job. And then three, wait till a better time, the database is less active to run themigration. Now, they tried some different things to make it easier, but they still had to manuallyrerun migrations a fair amount. So they actually built an infrastructure, they say, to automaticallyretry safe migrations. And with this system, they only retry migrations that have no intermediatecommits. So basically that migration is just doing one thing, like maybe adding one column, waittwo minutes between each retry to give systems any time to recover, or engineers to respond ifa problem, and try at most ten times to do it. And they say, quote in three months of running thisin production, we have seen all of our migrations go through successfully without any manualwork. Now, they also talk about the concept of pre and post deploy migrations, which I believe isa feature of the Orm that they're using. They had some problem with the post deploy migrations,so basically what they went with was just pre deploy migration, so just make the change to thedatabase and then deploy the new code to the application. So they basically moved to only that.And they said, quote it removed a lot of complexity and room for error. But then they had anotherissue that they tend to encounter, which was backward incompatibility. So for example, they'retrying to remove a column from a table and their basic process is remove all usages of thecolumn in the code and then remove the column with a migration. However, this can causeproblems because as I say here, quote every query to the user's table failed until the new codewas deployed a few minutes later. So even with that pre deploy migration, the time at which themigration happened to the new code was uploaded and running all queries failed. And that'sbecause orm SQL Alchemy among them, and I know this happens for Rails as well, for certainselects and inserts, they keep a reference to the columns that exist. So if the column is suddenlymissing in the table, selects and inserts or other database activity can fail. And they say hereSQL Alchemy has two configuration options to truly remove usage of a column. So they have adeferred setting and evaluates none setting. But of course this requires a member to do that.And Rails has, I believe, an ignored columns capability. You can add to a model and you need todo that before you actually remove the column from the database so that the Rail software stopsreferencing that column. But they actually built in some compatibility checks to make this easier.So they built a custom tool to handle it to make their migrations easier. So overall, I felt this wasa good blog post th --at goes into kind of the issues that they're running into and some best practices that youshould consider adopting and things to watch out for in your application as you're changingthings fast. So, definitely a blog post I suggest checking out.
[00:10:05] The next post is Performance best Practices for using Azure Database forPostgreSQL Connection Pooling this is from the Azure Microsoft.com blog and they do havereferences to some other older posts here. But why I picked this, I thought it was interestingbecause anything about Pgbouncer I believe is important to learn more about because you'regoing to have to use it at some point with scaling. So they go a little bit about Pgbouncer. Butwhat I found was interesting when they were doing a PG bench test because of the amount ofwork required creating and breaking down connections for PostgreSQL, because you have tofork another process just by adding PG bouncer to the test scenario they had up here. They saidwith PG bouncer, the throughput improved four times as shown below, while connection latencywas reduced by 40%.
[00:10:55] So in addition to reducing memory usage by being able to drop the number ofconnections in PostgreSQL by using PG bouncer here, you definitely can see someperformance boosts with using it. So Pgbouncer or PG Pool is another option, is definitely a toolto do connection pooling for your database to improve its performance.
[00:11:17] The next post is PostgreSQL deep dive. How your data model affects storage. Andthis is from Crunchydata.com. Now this is a pretty long post, but he goes over three or fourdifferent scenarios of different way to store, different ways to store data in your database. So forexample, if you have a parent table and a detail table, maybe you have an orders and an orderdetail table and he loads it up with data, I believe it was 100,000 say orders and 10 million in thedetail table. And he checks the size of them and even some queries. But then he looks at it froma different perspective. What if you just have it parent detail table merged together using like atext arrays? How does that size change, how does the performance change? And then he goesover through three or four different iterations to look at the different considerations in terms ofsize and a little bit to the performance trade offs for each one. And one thing he said here, quote,we have shown here that in certain, but I think reasonably common circumstances, you may beable to reduce 100 terabytes of data down to eight terabytes. So depending on how you decideto store the data, you could get a lot of space savings. So if you have a huge data set and you'repotentially thinking about maybe ways to restructure it, definitely check out this blog post.
[00:12:41] The next blog post is actually from Foss Demo. So this is the conference thatreferenced the first article where I was talking about breaking PostgreSQL at scale. This isactually the PostgreSQL track and the databases track and there are relevant talks and I'vediscussed them in last episode of Scaling Postgres as well as this episode of Scaling Postgres.So these are all the different talks by the different speakers. And what I also did is I lookedthrough YouTube and what videos have been posted. I'm not showing them here on the show,but I'll include them in the notes for the show. Links to each of the YouTube videos as well,although for each one you can download the video recording as well. So definitely a lot of newcontent that you can feel free to check out.
[00:13:31] The next post is what's up with set transaction snapshot. So basically this goes overwhat this feature is and basically it's a way, if you're using repeatable read and you have multiplesessions that you want to look at a consistent snapshot of the database, you can use this settransaction snapshot so that each client or a session that is connecting to the database gets aset view of it. Now, the scenario Hugh is talking about here is maybe you want to do a PG dumpand do it in parallel with multiple sessions, but you could think of other scenarios where you'retrying to read data, like maybe you're trying to do some process in parallel analyzing the dataand you want a consistent snapshot. This is potentially a command and technique you can useto do that. So, definitely an interesting blog post to check out. And this is from thebuild.com thelast post is an overview of the index changes in PostgreSQL elev --en. This is from the several nines.com blog and they just go over each of the different featuresas it relates to indexes. So they talk about it introduced a parallel Btree index build. So you canbuild your indexes faster. You just need to be able to set your max parallel workers max parallelmaintenance workers and you may also want to increase your maintenance work memory to beable to do better. Parallel index builds or more quickly added predicate locking for hash just ingen indexes. As I say, these will make serializable transaction isolation more efficient. Whenusing those indexes, allow entire hash index pages to be scanned. You can specify a statisticsvalue for a function index with the new features to partitioning. You can now do local partitionindexing. And of course the big one for me is covering index, where you can use an includeclause to add additional columns onto that index as kind of a payload so you can do more indexonly scans. So if you're wanting to look into the index features of PostgreSQL Eleven, definitelya blog post to check out.
[00:15:34] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the Show Notes. Be sure to head over to Scalingpostgres.com, where you can signup to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes.Thanks. --