Microsoft SQL Server Compatible, Time Series Performance, Feature Casualties, BLOB Cleanup | Scaling Postgres 143
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss Microsoft SQL server compatibility, the performance of time series DBs, DB feature casualties and how to clean up BLOBs.
Content Discussed
- AWS announces Babelfish: open source Postgres with SQL Server compatibility
- Want more PostgreSQL? You just might like Babelfish
- TimescaleDB vs. Amazon Timestream: 6000x higher inserts, 5-175x faster queries, 150x-220x cheaper
- Feature Casualties of Large Databases
- BLOB cleanup in PostgreSQL
- Infographic: 5 Things to Include on Your PostgreSQL Security Checklist
- Running PgBouncer on AWS Elastic Container Service
- How pgBackRest is Addressing Slow PostgreSQL WAL Archiving Using Asynchronous Feature
- Replacing Lines of Code with 2 Little Regexs in Postgres
- Deploy Pgpool-II on Kubernetes to Achieve Query Load Balancing and Monitoring
- Evolution of tree data structures for indexing: more exciting than it sounds
- PostGIS Trajectory: Space plus Time
- 2020 PG Asia Conference Ended Successfully at an Unprecedented Scale!
- Stéphane Schildknecht
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "AWS ANNOUNCES BABELFISH: OPEN SOURCE POSTGRES WITH SQL SERVER COMPATIBILITY". So this is a post from ongres.com, and they're talking about an announcement that happened at AWS ReInvent, where they announced BabelFish for Aurora PostgreSQL. Basically what it does is it provides line compatibility with Microsoft SQL Server. So it interprets the different SQL features that have been added to Microsoft SQL Server as well as T-SQL, which is what their store procedures are written in.
They have a little diagram they posted here that's from the BabelFish GitHub IO repository, where you can have an application with a PostgreSQL driver that goes through PL/pgSQL to PostgreSQL to be able to speak to it. But you can also do the same thing with a Microsoft SQL Server client driver that speaks T-SQL over TDs. That's the line protocol for Microsoft SQL Server to BabelFish to then be able to speak to Postgres. So if this works as advertised, essentially you could use PostgreSQL as a drop-in replacement for Microsoft's SQL Server without potentially having to change anything with your application.
Then over time, you could transition to Postgres. Now again, I believe this is mentioned for Aurora PostgreSQL, which is PostgreSQL compatible, but it says it's going to be released as open source under the Apache 2 license on GitHub in 2021, and that it will be compatible with community-based Postgres. So this is a great benefit for users who actually want to start using Postgres, but they're currently using Microsoft SQL Server.
Now, this was also published on the AWS blog and it's called "Want More PostgreSQL? You just might like BabelFish". But it pretty much goes into the same details about why they're doing it and choosing to open-source it. So if you're interested in this, definitely check out these two blog posts.
The next piece of content is "TimescaleDB vs. Amazon Timestream: 6000x higher inserts, 5-175x faster queries, 150x-220x cheaper". So that is quite a disparity between TimescaleDB and Amazon Timestream. Now, this is from blog.timescale.com, which of course produced TimescaleDB, so you have to take their results with a grain of salt and examine them for yourselves. But if it's anything close to what they're purporting, for example, look at these 6,000x higher inserts. If you're considering a time series database, you would definitely want to check out these types of performance differences yourself before you choose a particular product to use. Then they show the different query speeds that they were getting out of it. So not as dramatic as the inserts, of course, except for this one high CPU query.
I believe this is where they were breaking it out to multiple devices, doing the queries, and then they followed up with the cost differential between the two. So again, this is really, really significant. But again, if you were looking for a time series database, I would definitely closely examine this and run your own trials of each to see which performs better and cost, et cetera. But this is a very long blog post talking about the differences, the different offerings, and potentially trying to rationalize why some of these differences exist. So if you're interested in time series databases, and particularly looking for a performant one, maybe you want to check out this post.
The next piece of content is "Feature Casualties of Large Databases". This is from brandur.org. So this is a post that is not necessarily PostgreSQL specific, but it's talking about how databases get created for applications, they're worked on for a while, and then inevitably they start getting larger and larger and then having issues. What this is talking about is what areas tend to get ignored usually early on that then kind of bite you later. The number one that they're talking about is transactions. So a lot of times people don't consider transactions, especially when they're smaller, because there's a lot less activity and you don't need a lot of locking to avoid race conditions or have inconsistent data. Whereas once you get larger, those transactions become more important. Talking about referential integrity, so identifying your foreign keys to make sure that you don't have orphaned records in particular tables.
Again, when you're smaller, not as important, but very, very important once you get larger. Particularly if you have a lot of individuals interacting with the database, you definitely want that data security, that integrity within the database as opposed to just in the application. They quote "Nullable, as far as the eye can see". So basically every column that's created is usually NULLable. That is, as he says, the default in DDL will create any columns for a new table as supporting nulls. A lot of times you don't want that because it makes having to code your application a little bit more complex because you have to handle NULL cases and then maybe blank cases in the instance where you have a text column. Suboptimal indexing, there's an interesting trend where you don't need that many indexes when you're small.
But once you start getting larger and larger, indexing becomes more important. Depending upon the different database you use, you may have a hesitancy to add more indexes to it because of the potential locking that can occur and the size difference it takes for those indexes. So sometimes there's a hesitancy to add indexes. I haven't found this as much with Postgres because you can just create indexes concurrently and generally they get created without issue. And if there is a problem, it just is considered invalid then you can drop it concurrently if there happens to be any issue. But with other database systems, because that's what this blog post is for. Maybe you have more issues you have to deal with. Dangerous queries and restricted APIs.
So this individual has indicated that some of the organizations they've worked for, the DBAs or the people in charge of the data purposefully restricted very simplified interactions with the database. Like just do one single record INSERTS or one single record UPDATES and don't delete a lot of records at once. Now there's some benefit to that, but you're going to hamstring the developers and cause more work for them if you don't allow them to update more than one record or delete more than one record. There are some other controls you can put in place with Postgres to, say, limit statement timeouts or things of that nature. To prevent one query from bogging down the system. Then he follows up the post with some ideas for scalability and how you can make working with large databases easier. So I'll let you examine this and check out this post.
The next piece of content is "BLOB CLEANUP IN POSTGRESQL". This is from cybertec-postgresql.com and they're talking about BLOBs, which are essentially binary large objects. Now, first, he talks about generally dealing with binary data as long as it's not too large. You probably want to just handle binary data within a field of a database table because that's much easier to work with. That is the byte array or the bytea data type. So here you can store an image as a byte array and usually, by default, it's stored as hex. So you can go ahead and store that data in Postgres and then be able to retrieve it. But you have to keep in mind the maximum size of that is 1GB per field. So if you have files larger than that, you actually want to store them in the database and not just a reference to the file. So for example, you could store a file name that you could reference in your application code to know where that file exists. But if you're literally wanting to store it in the database and it exceeds this 1GB per field, then you want to look into the lead BLOB interfaces. So this is an example of importing a large file.
You're doing a LO, which means LargeObject_import, and then the name of the file you want to import and what it returns is the object ID of that. Now if you notice that's there and then suddenly it's essentially gone. It's not stored in any table, it's not saved anywhere, it just returns that OID, and then you need to remember it somehow. So what he says is that what most people do is create a table, say t_file here, that has an incremental ID, a name for that file as well as the object ID. So when you do an INSERT into that table for the OID field, you actually put that import statement of the file name into there so that it actually stores that object ID. So now you have all the references to all those BLOBs that exist within the database. Now that's okay, but if you do something like this and you delete that record, you've lost what that object ID is.
You have no idea anymore. Now, it still exists. So you can query the pg_largeobject table to get that large object ID, but you have no other reference to it within the database, and it's essentially just stuck there. And he says there's a couple of ways that you can handle it. You can do a lo_unlink with it, and that removes it from the database. Now, he also mentions that there is a utility called vacuumlo that can clean up dead large objects. So basically, there's no reference to this large object in any table. So you can use this to clean up dead objects that exist in your database. Now, they also mentioned in addition to the lo_import and_unlink, there's a variety of other functions you can use to work with large objects. So if you're interested in working with binary large objects in Postgres, definitely check out this post.
Next piece of content is "Infographic. 5 Things to Include on Your PostgreSQL Security Checklist". This is from enterprisedb.com, and this was a great list that breaks out things to check for security across securing access to it, say physical access, Firewall encryption access, authentication access, how to secure authentication, and the different areas you need to check within Postgres. Securing roles within the database, defining access controls to different data in the database as well as securing encryption. So if you want to check out this security checklist for Postgres, check out this blog post.
The next piece of content is "Running PgBouncer on AWS Elastic Container Service". This is from revenuecat.com, and basically, they're talking about setting up PgBouncer, and there's not a lot of documentation on PgBouncer. So I always like showcasing those posts that talk about setting up and using it. They don't get a lot into the specifics. They did run this on a container stack. So they were using a docker container that had PgBouncer included. They also, for analysis, were using Prometheus. So they were using the PgBouncer exporter that exports data to Prometheus for monitoring of that system. It goes through the process of setting it up and making changes to their file limits, which frequently you have to do.
Going into some specifics about working with AWS in terms of whether they were setting up a network load balancer or service discovery to use it. Then they talk about the monitoring with Prometheus, and finally, they follow up with the different PgBouncer settings. The most important one is selecting a pooling mode, do you want to do session or transaction-based as well as defining your connection limits? So how many clients that can connect to it is the max_client_conn, and then the pool size defines how many server connections you want to maintain. There's a default pool size that you set as well as a minimum size it shouldn't drop below, so it just always has some connections available. So if you're interested in setting up PgBouncer and some of the specific environment considerations for AWS as well as their lasting container service, definitely check out this blog post.
The next piece of content is "How pgBackRest is Addressing Slow PostgreSQL WAL Archiving Using Asynchronous Feature". This is from percona.com. They're talking about pgBackRest. They're referencing a previous article that talked about slow WAL archiving and that you want to be very cautious of the archive command you use in Postgres. If that is slow for any reason, you could get WAL archiving being backed up and essentially your data volume will continue to grow because it can't archive WAL fast enough. Now, if it's a simple COPY command, that'll rarely happen unless you have problems with your disk systems and it's a slow disk causing it. But if you make it more complicated than just a simple COPY command, for example, if you copy it to an external server.
Or in one of the worst-case scenarios they're mentioning here, you're doing an S3 COPY to S3. It just takes a long time to transfer those types of files. You can get a backup. Using this query, when they were running a pgbench workload, they saw a delay of 278 WAL segments waiting to be archived. One way to potentially speed that up is to use compression. Using compression, they were able to get it down to 166, but that still existed. But another feature that pgBackRest offers is asynchronous WALl archiving. So that means it doesn't have to wait until that file has completed the transfer to S3. It just sends an immediate notification to Postgres saying okay, it's done. Then it spawns multiple processes to then send that work to those processes to do that file transfer.
So there is a risk of something breaking down in that process, and it's acknowledging Postgres while it's still falling behind with the transfer. So you do have to make sure your configuration is correct and you're not seeing some of these delays, but it does enable you to respond to Postgres faster and then archive those files to the ultimate destination in a more asynchronous manner. They talk about the configurations you want to do, including the full configuration for pgBackRest they did here. They show a little bit of how it works, where you have a process that does the immediate notification to Postgres, as well as spawning multiple processes to actually do the transfer to S3. So if you're interested in setting up pgBackRest with this asynchronous WAL archiving, definitely check out this blog post.
The next piece of content is "Replacing Lines of Code with 2 Little Regexs in Postgres". This is from blog.crunchydata.com and they're talking about how they have a set of data and essentially they need to do some data cleanup. So the format that they want the data to be in is cam and then three numerical digits. But these were human-entered, so they're going to have some errors. Some people, if it was supposed to be 008, they just put 08. Or instead of being cam059, they just put cam59. This stands for cameras. So it's the number of cameras or they also used uppercase. So basically you can use a regex to correct this data rather than writing some code as they said here in Python or some other language to fix it.
So here's the exact query he used to correct this data. So he uses a regex replace for each column that he wants to update and he basically looks for the last two digits and sets it up to put them at the end of this. He explains exactly how this regular expression works to correct the data. Now, he also does that for the instance of the capital letters. So yes, you can do this with a regex and there should also be a way to do it in the same update statement to correct the case as well. Maybe using a nested one, or I would probably just use a lower function to do it, for example, maybe lowering the input before doing the regex search. But this is a very brief post about how to use regular expressions to correct data that you have received.
The next piece of content is "Deploy Pgpool-II on Kubernetes to Achieve Query Load Balancing and Monitoring". This is from b-peng.blogspot.com. So the focus of this article is on Pgpool II and setting it up to do query and load balance monitoring. For monitoring, they are using Prometheus in the Pgpool exporter to do that. So they're talking about setting up the process for working with Kubernetes and getting Pgpool to work with it. So if you're interested, you can check out this blog post.
The next piece of content is "Evolution of Tree Data Structures for Indexing: more exciting than it sounds". This is from erthalion.info. So this is a very in-depth examination of B-tree indexes, the different types and different considerations, how they work, and some more, so modern research into them. So if you're interested in that type of content, you can check out this blog post.
The next piece of content is "POSTGIS TRAJECTORY: SPACE PLUS TIME". This is from rustprooflabs.com. They're talking about acquiring data and then not only doing a geographical examination but also across the dimension of time. So you can see how this time has changed. On the bottom, there are tracking routes of different vehicles here and here. They even did a visualization of it. So if you're interested in this type of content, definitely check out this blog post.
Next piece of content is "2020 Pg Asia Conference Ended Successfully at an Unprecedented Scale!". This is from highgo.ca. So if you want to learn more about the PostgresConf in China and PGConf Asia, definitely check out this blog post.
The last piece of content, the PostgreSQL person of the week is Stéphane Schildknecht. So if you're interested in learning more about Stéphane's contributions to Postgres, definitely check out this blog post.