Scaling Connections, TimescaleDB, Time-series Gaps, Red & Golden Signals | Scaling Postgres 138
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss scaling connections, the release of TimescaleDB 2.0, how to find time-series gaps and monitoring RED & Golden signals.
Content Discussed
- Improving Postgres Connection Scalability: Snapshots
- TimescaleDB 2.0: A multi-node, petabyte-scale, completely free relational database for time-series
- Detecting gaps in time-series data in PostgreSQL
- PostgreSQL, RED, Golden Signals: getting started
- PostgreSQL clustering: vip-manager
- Migrating interactive analytics apps from Redshift to Postgres, ft. Hyperscale (Citus)
- Joins using LIKE or why PostgreSQL FTS is a powerful alternative
- PostgreSQL: Foreign keys and insertion order in SQL
- [YouTube] Postgres TV
- Getting Started With Postgres 13 on Ubuntu 20.04
- Using PostgreSQL and SQL to Randomly Sample Data
- Learn PostgreSQL - a new book
- Stefan Fercot
YouTube Video
Podcast Audio
Transcript
All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Improving Postgres Connection Scalability: Snapshots". This is from citusdata.com, and this is talking about Postgres scaling. To handle more connections, normally you have to start using PgBouncer once you start getting up above, say, 200, 300, 400, 500 connections. But this is looking to scale beyond that and not necessarily have to rely on a separate pooler like PgBouncer or Pgpool, because you get a number of benefits, such as being able to maintain sessions and to do prepared statements that you can't do when you typically do transaction pooling in something like PgBouncer. We had covered this in a previous episode of Scaling Postgres, but this post actually describes the changes that have been made. These improvements have been contributed to Postgres 14, which is due to be released in the third quarter of 2021.
Now, he kicks off by mentioning the performance improvements first, but the areas that he decided to work on were snapshots. So related to MVCC being able to maintain multiple versions of the database for different users to view, based upon what kind of transactions they are running, there's the necessity to track snapshots of what transactions can see. Basically, he went in and optimized that. So I'm basically just focused on the graphs here. Here he goes from 1 to 10, 100, 1,000, and 10,000 connections to the database. Now, of course, all the points are that he tested more than that, but just the graph indicated here. With the enhancement he made, you don't really see anything until maybe the 90 connection mark. Up until that point, they are identical. The blue indicates pre-patch, and the orange or yellow indicates post-patch. This is TPS transactions per second for a pgbench workload.
As I said, up to 90 connections, there was no real difference. But then at that point, you start seeing a little disparity until around 5-600, and then the difference starts growing significantly. So much so that it's between two and three times more efficient, this patch, once you get out to 10,000 or more connections. So really, this is great with high numbers of connections. You can also see that the performance improvement doesn't drop off as much as you do pre-patch. Then he looked at the state of unused connections. So basically, along the X-axis, he's tracking the number of idle connections to around 10,000 connections. You can see with 1 active connection and 10,000 inactive connections, essentially your TPS throughput gets cut in half. But after you apply this patch, it remains consistent up to that 10,000 connection mark.
Then even with 48 active connections, the performance fares even better. You basically get no difference in the number of inactive connections. So another huge performance improvement if you want to use a lot of connections. So this is great news for Postgres 14. Now the rest of this post covers the changes that were made to the snapshotting process and he mentions all the different commits and all the different information. I'm not going to cover it all here, but he talks all about the different bottlenecks he encountered and what areas he covered. So if you're really interested in the details, I highly encourage you to check out this blog post. He even has the raw data of everything that was measured at the bottom here. So definitely I suggest checking out this post.
The next piece of content is "TimescaleDB 2.0: A multi-node petabyte-scale, completely free relational database for time-series". This is from blog.timescale.com. Now normally I don't cover product announcements, but this was a pretty significant announcement I wanted to draw your attention to. So with this release, the fact that they're releasing it for free and they have cloud versions available, it's a very interesting usage model and just all the different features that allow, as they state here, being able to scale out so you can scale out their distributed hyper tables to multiple nodes and the fact it's a time series, it's a very interesting product. At this point, I feel. Now this post goes into a lot of detail about what TimescaleDB is. It's basically an extension of Postgres.
They debunk some certain myths, so it does have a lot of marketing information. But overall, what the product offers I find very interesting. So for example, going from a single node to two nodes, four nodes, or eight data nodes, you get almost linear scalability. They also get huge performance benefits by doing essentially column-based storage of data within indexes along with compression that can get huge, they say up to 94% compression for certain data that they're working with. So it's just a very interesting blog post. If you're looking for a scale-out solution for Postgres, to my knowledge there is Citus Data, but now also TimescaleDB. So if you're interested in time-series database or scale-out solutions, apart from what you can do with the community version of Postgres, maybe you want to check out this.
The next piece of content is "Detecting gaps in time-series data in PostgreSQL". This is from endpoint.com. They're talking about having a data collection scenario, but they had some gaps in that data collection and they wanted to detect where those gaps were. So this is basically how they did it with SQL. The main function that they leaned on was a generate_series. So basically, to find missing rows that were date-based, they used a generate_series command to do one-day intervals to produce data as you see here. Then they did a left join and checked for NULLS that exist in the table of interest. They show this query right here, and basically, where that's NULL, they pull up this day and the particular batch that is missing.
So then they can go and look and find the data. But then they also had a day time field that they needed to check as well. They expected something to be in that day time field around every ten minutes. So for this purpose, they did a generate_series, but they did it at a ten-minute interval. Now they did the same thing doing a join, but they actually did a lateral join. So for each row, they did a specific query against the join table of interest. And that's what lateral gives you. Using that, they were able to find the gaps. So this is a pretty interesting blog post using a technique to be able to find gaps in time series data. So if you're interested, you can check out this post.
The next piece of content is "PostgreSQL, RED, Golden Signals: getting started". This is from dataegret.com. So they're talking about a monitoring solution using what's known as RED or gold signals. They have links to each of these methodologies, RED being an acronym for rate errors and duration. So for example, when doing monitoring, you want to monitor the rate. So say the number of requests per second your services are serving. You want to track the errors. The number of failed requests per second and you want to track the duration, the time spent on servicing different requests. So they looked at this through PostgreSQL and how you can get this information from the system views. So the first is requests and what they advocate using is the pg_stat_statements view, and just doing a sum of all the calls from pg_stat_statements.
Now, keep in mind that pg_stat_statements is a cumulative record of all the queries being done. So basically you need to check the difference between the previous time it was run and the current time to get a sense of how many calls were done within a particular time period. Next, they look at errors. Now, there's not a great way to do errors, but essentially they chose to do rollbacks. So they looked in pg_stat_database to look at the transaction rollbacks. So I can imagine scenarios where rollbacks aren't always errors. For example, maybe you want to do it on purpose when processing something, but this is what was available within the system view.
Then they looked at duration and that's basically looking at pg_stat_statements. Again, looking at a sum of the total time that it took to process different queries. Now, with the gold signals, they also cover saturation. So that's basically where a request has to get queued. And you can imagine a scenario that means saying something is behind a lock. So if you wanted to just look at how many locks are occurring that's one way. But what he likes looking at here is pg_stat_activity and tracking how many queries are in an idle state or idle in a transaction, active or waiting. So if you're looking at a different way to monitor your Postgres solution using RED or gold signals, maybe check out this post.
The next piece of content is "POSTGRESQL CLUSTERING: VIP-MANAGER". This is from cybertec-postgresql.com. VIP Manager is a virtual IP manager and they're talking about using it with, say, the Patroni stack. In that stack, you have an application that speaks to whatever device is going to be routing the connection to the primary database. Now, in this case, this could be the VIP Manager and it's pointing to the virtual IP that's been assigned and then it connects to whichever is the primary node that's responsible. The VIP Manager, at least when working with Patroni, is that it consults the Consensus Store, which could be say, etcd to determine which the primary is, and it routes that virtual IP address to the correct physical address of the current primary. So that's pretty much all the virtual IP manager does. Then it goes into how you would do the configuration. It basically needs to know where this Consensus Store is to be able to look at it, read it, find what the entry for the primary database is, and what virtual IP to use to be able to do the redirection. So it's a very brief post about how you can set up the virtual IP manager specifically when using Patroni.
The next piece of content is "Migrating interactive analytics apps from Redshift to Postgres, ft. Hyperscale (Citus)". This is from citusdata.com. Now, a lot of this post includes some kind of marketing information and highlights Citus as an alternative solution to Redshift, which of course it is. But the certain thing that I looked at that I found interesting was that with Redshift, because it's column-based, you can't really add indexes as easily with Postgres Standard Row store, and by adding indexes for this specific use case they were able to get a 2x performance game versus Redshift because you could add these indexes.
Next, because it's using a more recent version of Postgres-sized data, that is, compared to Redshift, you're able to use JSONB to store semi-structured data. Previously, the customer had to store it in large text fields in Redshift, whereas they could use the JSONB to store it. That enabled them to store data in a much more compressed format and use gen indexes to efficiently query it. They mentioned here the JSON data type in Postgres gave 6-7x compression compared to how they had to store the data in Redshift. So this was a pretty interesting post of some of the advantages of using Postgres. Now, first, they tried doing it on a single node. It didn't quite suit them so they just went to a dual node that you can of course do when you're using Citus. So if you're interested in that, you can check out this post.
The next piece of content is "Joins using LIKE or why PostgreSQL FTS is a powerful alternative". This is from postgrespro.co.il. By FTS, they mean full-text search. So they had a customer they were working with and they were having slow performance when using the LIKE operator with Joins. Now, they tried at first to use the B-tree gin index, which enables you to use indexings with B-tree and gin combined, but they still didn't get great performance. As a matter of fact, they were getting a query returned in 536 seconds, so pretty slow. However, they found that they were able to use full-text search instead of using the LIKE queries. So with a full-text search, they added a GENERATED AS column using tsvector to record what they needed to do the full-text search on. Added a gin index, and then once they rewrote the data to be able to query it efficiently, they actually got down to 1.5 seconds to run the query. So here's the case where using a simple LIKE or ILIKE to be able to search ended up being not as efficient as using a full-text search with a gin index. So if you're interested in learning more, you can check out this post.
The next piece of content is "POSTGRESQL: FOREIGN KEYS AND INSERTION ORDER IN SQL". This is from cybertec-postgresql.com. Now they're talking about the case where you have foreign keys set up. So in this example, they have a currency, a location, product product description, and a product stock when you're going to be inserting values. When you have foreign keys set up, between each of them there are essentially dependencies. So for example, a currency must exist before you insert a new product because it references that. If you have a large hierarchy of dependencies set up, it can be difficult to determine what you need to insert first if you're new to a particular database.
Well, here they're presenting a recursive query that actually reads through the system tables and determines what needs to be inserted first. So when running it against the sample that they had here, they show level one data needs to be inserted first. So you first need currencies and locations, then you can insert your products, and then you can insert your product descriptions and your stock. So it's a pretty interesting query to be able to find out this essential dependency information for your foreign keys. So if you're interested in learning more about that, you can check it out.
The next piece of content is actually a YouTube channel that I found in my feed and basically, it's called Postgres TV. Now, they have had some older content that's been posted, but more recently, for the last four weeks or so, they've been publishing episodes on about a weekly basis. So if you're interested in more Postgres content, you can check out these live shows that are posted to Postgres TV.
The next piece of content is "Getting Started With Postgres 13 on Ubuntu 20.04". This is from pgdash.io. So this goes through the entire process of setting up Postgres, configuring it for external access, and how to do some basic backups. So it goes over the installation process, getting the Postgres repository is set up to be able to install the most recent version, which is 13. Talks about setting up the database cluster. How would you go about starting/stopping it, where you would find the log files, and how you do the configuration for accessing it across the network. Creating users, setting up a database, as well as doing some basic backup and restore commands. So if you're interested in learning more about that, you can check this post out.
The next piece of content is "Using PostgreSQL and SQL to Randomly Sample Data". This is from blog.crunchydata.com, and it's the next post in the series talking about doing scientific analysis with data from the California Wildfires. In this post, they're talking about needing to take a sample of non-fire data and they wanted it to be the same amount of information as the fire data to be able to start to develop their models because they wanted to do a logistic regression for it. This goes through the process of taking a sampling of the data using the table sample keyword to get the data together so that they can set up a set of data for both training their model and also verification of it. So if you're interested in that, you can check out this post.
The next piece of content is a post about a new book that's been released called "Learn PostgreSQL". This is from fluca1978.github.io. So if you're interested in more learning material such as a book, they have the outline here and a link to the book.
The last piece of content, the PostgreSQL person of the week is Stefan Fercot. So if you're interested in learning more about Stefan and his contributions to PostgreSQL, definitely check out this blog post.