PgMiner Botnet, Collation Index Corruption, postgresql.conf, Custom Data Types | Scaling Postgres 145
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss the PGMiner botnet attack, how collation changes can cause index corruption, managing your postgresql.conf and implementing custom data types.
Content Discussed
- PgMiner botnet attacks weakly secured PostgreSQL databases
- Don’t let collation versions corrupt your PostgreSQL indexes
- Repository, Tuning Guide and API for your postgresql.conf
- Creating Custom Postgres Data Types in Django
- Free 1GB Postgres Database on AWS CloudShell
- PostGIS Raster and Crunchy Bridge
- PostgreSQL exclusive cron jobs using pg_timetable scheduler
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "PgMiner botnet attacks weekly secured PostgreSQL databases". This is from zdnet.com and it's talking about a botnet attack being set up to attack Postgres servers. It says it only impacts Postgres servers running on Linux thus far. It also says it follows a typical pattern in terms of searching out a particular network range and looking at the default port. So, of course, number one, you don't want to have your PostgreSQL database exposed on the internet. Number two, you don't really want to use the default port, you should choose a separate port for it and then it tries to use a brute force attack to find out what the Postgres password is.
So a third thing is don't allow logins via Postgres from an IP address. Set up a separate account to do that purpose and just leave the Postgres for say, local-only connections. Then it uses COPY from PROGRAM to try and escalate its privileges. So something definitely to be aware of. Be sure to secure your Postgres instance as I had mentioned here, so you'll be protected. Given that it's the holiday season, there might be more attacks going on because probably people are paying less attention to security during this time period. So definitely check out this topic or others about PgMiner to learn more about it.
The next piece of content is "Don't let collation versions corrupt your PostgreSQL indexes. This is from citusdata.com and they're talking about an issue where you have a collation set up for Postgres. Usually, the default collation is related to the libc library that you have that defines how text is sorted within the database. They give an example of using different collations here where they have the same words: Aarhus, Banana, and Cat. When you collate them using English for a New Zealand dialect, it sorts it essentially ABC. But AA actually has a different meaning in a Danish dictionary that means this character here. When you do the order by collating by Danish or da_DK, it actually sorts the A word at the bottom. So collations determine how data is sorted and this is particularly important for indexes.
They show an example here where you can list your databases and see what your correlations are for them and that the default correlations are typically set up by initdb. Now under the heading of this article where they say, so what's the catch? It says from the Unicode Technical Standard #10 saying "Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that requires changes;...". So you can think of this, it's like time zones. Time zones constantly change based on what's stated. Well, different orders based on culture and standards apparently change too. They say "...and finally, new characters added to the Unicode standard will interleave with the previously-defined ones.
This means that collations must be carefully versioned". Now, how this impacts Postgres is with their paragraph here, it says, I quote "The most obvious problem this causes for databases like PostgreSQL is that B-tree indexes can become corrupted if they are ordered by a collation that changes. Typical symptoms reported in the wild include orders that cannot be seen by one query but can be seen by another query that is using a different plan. Or sometimes a unique index fails to prevent a duplicate value from being added to a table". Now I've seen exactly these problems happen after a Postgres upgrade. Now it actually may have been one of the libc versions that changed because the upgrade not only did it upgrade the database, but it was placed on a newer version of Ubuntu, and perhaps the libc versions changed and had this impact.
Now, the resolution to fix these issues is just to do a re-index. Once you do the re-index, it'll resolve these problems. The second thing to keep in mind, this only applies to text indexes. So if you have an index on integers or date time fields that's not going to be impacted by collations, it's only your text fields that will be impacted. That's exactly what happened in the instance that I mentioned. Now, in terms of this happening, I've been mentioning libc because this is an operating system-provided facility for collations. There's also ICU which is the international component for Unicode. So this is where you can think of a cross-operating system way to define collations. So you could choose to use this and hopefully avoid some of these problems. But you're still going to have to have versioning in effect if you're going to go to the next release of ICU.
But it does have a number of advantages that they list out here. But most people stick with the default correlations that are typically reliant on the libc library of the operating system that Postgres runs on. Now they do say that Postgres releases 10 to 13 versions if you're using these ICU collations. There's a way to capture what they are and present a warning if things change. But not really for the libc, which is the predominant correlation being used by Postgres databases. But what they've worked on here for release in version PostgreSQL 14 is a way to essentially record what correlations have been used for building these indexes. Then if the correlation changes for say, the operating system it can report back and say, hey, you probably need to re-index these text indexes and even specify which ones.
So this shows how it handles different versioning based upon the operating system in use. Now the warning he says, will look something similar to this. You'll get a warning the index depends on a collation which is default was version 34 but the current version is 36. So this gives you a warning that you've changed collation versions. It gives the detail that the index may be corrupted due to changes in sort order and the hint is to reindex to avoid the risk of corruption. So basically, when you change these versions, it looks like you do need to do a best practice of reindexing your text indexes. So this is a patch that should make its way into version PostgreSQL 14. Now, they also mentioned there are some other instances where text is in use that could be impacted or you could see some of this corruption.
One, he says if range partitions are using text; if a collation changes, you might get things going into different partitions. This is particularly scary for me, so you might not want to use something like a range partition using a text field for that reason. The other one is to check constraints that perform string comparisons whereas once they may pass, now they may fail. So that's another thing to keep in mind. Now they don't have anything to address either of these two issues, they're just making you aware and then they go over and discuss some other things about indexes. But this was a great post explaining kind of the issue and the hope for at least a warning to be produced in the log if this has happened for your particular database. So I highly encourage you to check out this post.
The next piece of content is "REPOSITORY TUNING GUIDE AND API FOR YOUR POSTGRESQL.CONF". This is from ongres.com. So this is referring to a site called postgresqlco.nf or postgresql.conf. Now of course, this is the main configuration file for Postgres, and up to this point, this was just a source of documentation that highlights specific areas of the postgresql.conf file and gives you recommendations and ranges for each parameter that you can change. They list some of the details with regard to it here. But what they've added today and they're saying in beta, is a way for you to manage your postgresql.conf files via a service. So it provides a repository to store your postgresql.com files along with a tuning guide and an API for working with it. So this is an interesting tool to help with the management of Postgres. I myself use a configuration management tool called Ansible, so that's what I'm going to be sticking with. But if you don't have a similar tool, you may want to investigate this post and see what they're working on. So if you're interested you can check out this post.
The next piece of content is "Creating Custom Postgres Data Types in Django". This is from pganalyze.com. Whereas they are talking about Django, which is a web framework for Python, the first part of the post just addresses custom data types. The first one they talk about is Custom Domains. So essentially this is taking an existing data type and then applying a check constraint to it so that you can easily use it in your table creation. So for example, they created a domain here called string_no_spaces. So essentially it's a custom type, string_no_spaces. It's created as a VARCHAR, it's NOT NULL with a check constraint that no spaces should be included. Now, with all of that built into it, you can then specify when you create a table. For example, I want a column called username and the type is string_no_spaces.
Then they show you when you try to insert a string with spaces in it, it'll give you a constraint violation error. Now, what they also said is that you can build upon it, so you can essentially nest domains within domains. So here he created another domain called email_with_check and it uses as its base the string_no_spaces domain. So essentially you're going to get a NOT NULL string_no_spaces. Then he adds a check constraint that must include an ampersand. Then he gives an example of the different errors you can get. If you try to insert one with a space. You'll get a check constraint violation for this check. Then if you enter one without an @ sign, you'll get an email_with_check.
The next type of custom data type they cover is Composite Types. This is joining one or more types together to represent essentially one type. They use the example of RGB values. So these can be represented as three integers, but you can create a separate type and they've called it rgb_color_value. Here that is a composite of three integers and they show examples for its use case and then how you can even pull out an individual value from within that composite type. So these types make working with your database easier and then they cover how to use these if you're using the Django web framework. So the rest of the post covers how to use these types within Django. So if you're interested in that, you can check out this post.
The next piece of content is "FREE 1GB POSTGRES DATABASE ON AWS CLOUDSHELL". This is from ongres.com. Now, this is quite interesting because CloudShell is just a web interface to give you a shell prompt to be able to interact with AWS services. But apparently, it is its own environment and it has about 1GB of storage. So this post covers how you could actually install a Postgres database within this environment. Now, it has a lot of interesting issues. Like number one, the instance is not permanently running only when you enter the CloudShell from your web browser. You can't connect to it from the outside and there are no replicas and no backups. But I guess I'd call it an interesting hack of running Postgres in this minimal environment. So if you're interested in that, you could check out this post.
The next piece of content is "PostGIS Raster and Crunchy Bridge". So this is from blog.crunchydata.com, and they're talking about using the Raster data type in PostGIS. Basically, it records Raster-type data, which are images, elevation grids, model outputs, and more for working with geographical information systems. They're talking about how typically when working with this type of data, you either store it in the database or you store it outside of the database. So have a reference to some data somewhere, like usually the file system, so that PostGIS can access and work with that data.
Now, I said initially DB storage was used, but as information continues to grow and grow and grow, it's moved more towards storing the files outside of the DB. But what's interesting here is that this Raster tool has a way to access tiled GeoTIFFs of data. So you can actually store tiled images within an object storage system like S3 or Google CS or Azure Storage, and then pull individual tiles for analysis. So you don't have to store them all in the database, you're just storing a reference to it in the database. They give you a demo of how you can do this using Crunchy Bridge. However, you could do the same technique using any version of PostgreSQL with Postgres. So if you're interested in checking it out, you can check out this post.
The last piece of content is "POSTGRESQL EXCLUSIVE CRON JOBS USING PG_TIMETABLE SCHEDULER". This is from cybertec-postgresql.com. So if you're interested in using the scheduler called pg_timetable, you can check out this post that explains some new features that have been added to it.