50TB Upgrade, Vacuum, DB Security | Scaling Postgres 4
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering developers loving PostgreSQL, a 50TB database upgrade, information on vacuums, parallel PostGIS patches, database security, DB normalization and primary keys.
Content Discussed
- 2018 StackOverflow Insights Survey
- Migrating our production deployment from MariaDB Galera to PostgreSQL and Patroni
- Three reasons why VACUUM won’t remove dead rows from a table
- What PostgreSQL Full-Text-Search has to do with VACUUM
- Patching Plain PostgreSQL for Parallel PostGIS Plans
- PostgreSQL 9.0 is here in OSS Greenplum
- A Deep Dive into Database Attacks [Part III]: Why Scarlett Johansson’s Picture Got My Postgre Database to Start Mining Monero
- Database Normalization and Primary Keys
- Fun with SQL: generate_series in Postgres
- PostgreSQL Streaming Replication
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about developers loving Postgres, a 50terabyte database upgrade, vacuum FYIs, and protecting your database. From ScarlettJohansson. I'm creston. Jameson. And this is scaling postgres episode four.
[00:00:25] You all right? Our first article comes from Insights Stackoverflow.com for their 2018survey. And with this, the most loved database PostgreSQL comes in a close second to Redis.So PostgreSQL one of the most definitely the most loved relational database. So not too bad.Interestingly looking at the dreaded is IBM's DB Two Oracle, so some non open sourcealternatives, but definitely interesting that PostgreSQL is getting this much love from developers.So if looking at surveys like this are of interest to you, go ahead and check out the 2018 surveyon Stack Overflow.
[00:01:15] Related to that, the next article is migrating our production deployment from MariaDBGalera to PostgreSQL and Petrone. And this is from the behind pretext EU blog. So this is a verygood article in terms of the length and the in depth they go into about their reasons for switchingfrom MariaDB to PostgreSQL, and where they don't specifically address really big scalyconcerns. They go into some of the advantages of their existing deployment on MariaDB andthen some of the disadvantages as well as they go into kind of the rationale why they made theswitch and some of the advantages they found from a scaling nature. What's very interesting inthat one quote here, while it's been basically impossible to search all orders in the system for aname in less than 30 seconds, it now takes 800 milliseconds. We call that a success. Sodefinitely it has helped performance by doing the switch to PostgreSQL in their estimation andthey also go into in depth in the migration and the steps that they went through to do it. So,definitely a really great blog post to read to get a behind the scenes look of how someone didthis. It's a really great post. I encourage you to check it out. The next post is called Updating a50 terabyte PostgreSQL database and this is on Medium, a part of Adentech, and they are apayment processor and they're talking about how they have about 5000 PostgreSQLtransactions per second coming in across multiple clusters. And apparently they're doing this intheir own server that they run. So it's not hosted on a database platform, but they have their ownserver that's handling this traffic in this 150 terabyte database and it goes into their approachingwhen they did an upgrade when they were at the ten terabyte stage and discuss some of thedownsides of the approach they use for that. And then they go into the new approach they usefor upgrading to PostgreSQL 9.6.
[00:03:40] So it's definitely once you get up to those higher database sizes, this would be aninteresting blog post to take a look at when you're considering to doing your upgrades. So Idefinitely encourage you to check it out.
[00:03:53] The next blog post is from the CyberTech Postgresql.com blog and it's three reasonswhy vacuum won't remove dead rows from a table. So it goes over to basically an introductionabout why vacuum is useful and then a problem of how some rows can be left behind. And theyhighlight three primary reasons. One is long running transactions. I've mentioned this in previousepisodes, but if you're primarily using PostgreSQL as a transactional database, you want tokeep your transaction as short as possible. Because if you do have long running transactions,that can cause problems with concurrency or locking, as well as causing problems withvacuuming as well. So definitely you want to try to keep your transactions as short as possiblefor an OLTP database. The other reason they mentioned is abandoned replication slots. So ifyou're using replication and you're using PostgreSQL's built in replication slots, you want to besure to get rid of those after a replica is no longer in use.
[00:05:04] Because what happens? Those replication slots cause the master server to retain thatright ahead log. And if you have a replica that goes down for an extended period of time or losesconnection, you're basically going to have unbounded write ahead log growth. So definitelysomething to keep in mind if you're using replication slots to clean them up when they're nolonger in use.
[00:05:29] And lastly, he talks about orphan prepared transactions. So these are when you'refor a particular transaction. Now, what's interesting is that when you look in the PostgreSQLdocumentation, they indicate prepared transaction is not intended for use in applications orinteractive sessions. Its purpose is to allow an external transaction manager to perform atomicglobal transactions across multiple databases or transactional resources. So basically, thisshould be a rarely used feature of PostgreSQL, particularly if you're just developing applications.So hopefully you won't have any of these in your application.
[00:06:16] The next blog post is what PostgreSQL full text search has to do with vacuum. Again,this is also from the CyberTech Postgresql.com blog. And what's interesting about this post istalking about the importance of vacuum with Gin indexes. So Gin indexes are a special type ofindex that tend to be used when you're wanting to do a full text search of a particular field. Andwhat's interesting as they explain how gen indexes work is that when a row is added, it's actuallyadded to a gen pending list, which is basically a to do list that's processed by vacuum. Sobasically, if Vacuum does not get an opportunity to go over this index, it starts behaving lessefficiently and he goes into an actual practical example of showing you the ramifications of nothaving that particular index vacuumed and the performance ramifications. So if you're using fulltext search in PostgreSQL with gen indexes, I would definitely take a look at this post to makesure that your vacuum is operating optimally the next article is Patching plain PostgreSQL forparallel PostGIS plans. And this is in the Carto.com blog.
[00:07:44] And in it they explain how they had a need for greater parallel processing for theirPostGIS. Two, four queries.
[00:07:54] And they talk about some patches that they discovered that they could implement thatactually sped up their queries to give them these benefits. So if you're using the PostGISfeatures of PostgreSQL, I definitely suggest taking a look at this blog post to see if thesepatches might be something that you want to implement.
[00:08:13] The next article is from Greenplum.org. Now, Green Plum is actually based uponPostgreSQL, and it is a massively parallel processing analytical database. And they've beendoing recent work to catch up to the current version of PostgreSQL. And in this post they'veannounced that they've just completed merging PostgreSQL version nine into the Green PlumDB master. So Green Plum is an open source database for analytical processing. So if this issomething of interest, you may want to keep track of the work that they're doing here. Okay, sothis next blog post is a security related one and it's from the Impervia.com blog and it's a deepdive into database attacks. Why? Scarlett Johansson's picture got my Postgres database to startmining monero, and that is a cryptocurrency.
[00:09:14] So basically this goes through an attack on a PostgreSQL server and how it was ableto take control essentially of the database, determine what GPU you have, if any. And part ofwhat they used for the malware payload was an image of Scarlett Johansson. So from a securityperspective, this is a very interesting blog post, but one thing that really caught my eye is nearthe end where they say under the section, how can an attacker find PostgreSQL databases?And they were basically able to find about 710,000 PostgreSQL databases open on the Internetthat are potentially vulnerable. So definitely for security reasons, do not open up PostgreSQL tothe Internet. Make sure it's behind a firewall or it's restricted to an internal network. So this typeof attack just is impossible and they go into some items to limit this attack. So if you'reresponsible for administration of your PostgreSQL installation, definitely a good thing to lookover this post.
[00:10:27] The next article is database normalization in primary keys. And again, this just followsmore on developer education about best practices for data modeling. He goes over the variousnormal forms and has a discussion on primary keys and presents his perspective on a properdata model to adhere to first normal form. Basically, if you have a surrogate primary key, be sureto use a unique constraining index to assure that you definitely have unique rows from abusiness perspective. So if you're a developer, I think this would definitely be an article ofinterest for you.the Tapoeh.org blog. The next article is from the Citusdata.com blog and it's called Fun withSQL generate series in Postgres. Now this is a very short post, but it basically talks about agenerate series that lets you generate a series of data and they even go into when you're doingreporting how if you have an absence of data for a particular time period, you can use generateseries to help fill in data gaps. So, just a quick little blog post to take a look at.
[00:11:48] And the last article is actually an in depth tutorial I put together on PostgreSQLstreaming replication. So if you're looking to start doing that, I welcome you to go ahead andcheck it out.
[00:12:02] That does it for episode four of Scaling Post, you can get the links to all the contentpresented in the show. Notes be sure to head over to Scalingpostgres.com to receive weeklynotifications of these episodes. You can also subscribe via YouTube or itunes. Thanks. --