Timeseries Queries, Arrays, Ranges, C Triggers | Scaling Postgres 77
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss timeseries queries, using arrays with queries, range use cases and developing triggers in C.
Content Discussed
- PostgreSQL: Trivial timeseries examples
- Timeseries: EXCLUDE TIES, CURRENT ROW and GROUP
- How to Get the First or Last Value in a Group Using Group By in SQL
- A Faster, Lightweight Trigger Function in C for PostgreSQL
- Range Types in PostgreSQL and GiST Indexes
- Waiting for PostGIS 3: ST_TileEnvelope(z,x,y)
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "POSTGRESQL: TRIVIAL TIMESERIES EXAMPLES". This is from cybertec-postgresql.com and they're talking about doing time series queries in PostgreSQL. Now the first thing they talk about is loading time series data the easy way and they're just showing an example where you can actually use COPY from a program and just pull data from a website and load it into PostgreSQL. The first very simple example they mentioned here is lag. So this is a window function. You can pull what the last value is for a given column and you're choosing the column that preceded it and ordered by the year. So for example, 1966 looks at the previous column by year and places it as the function specified here. So it gives you the previous year's set of data.
And then with that, you can actually use a function where you just take the primary column production minus this lag function and it basically gives you the difference from the previous year to the current year and you could make a percentage if you want to, et cetera. So it's just basically a very simple way to deal with time series data and you can even do computations such as correlations because there is a core function. So you can take two different values and see how they correlate. So how does production correlate to consumption for example, and this is by country, so you can see how well they correlate production versus consumption. So a very brief post, but a couple of very simple window functions. What's interesting about this week is there's a lot of content for developers.
The next post also by cybertec-postgresql.com is "TIMESERIES: EXCLUDE TIES, CURRENT ROW AND GROUP". So the first thing they did here is they prepared a set of data using generate_series and they have an ordinality column that's basically an incrementing integer and then a day column that increases by day and then defined by week in integer. Now the first example they looked at is let's look at sliding windows which move over time series in SQL. Now they use an array_agg function in order to show you what makes up the window. So you do a window function over ORDER BY the day and the rows between one preceding and the one following. So as a good example, looking at the second row here, it's going to show the current row, the one preceding, and the one following.
So again, the array_agg column just shows you where that window is and you can see as it moves to three, it has three in the preceding and the following row, et cetera. You can include more if you want, but he just did one preceding, and one following. This is not useful in and of itself. But now you do calculations, you can do averages, sums mins/maxes over that window series. The next area talks about excluding the current row. So it does the same window range, but it excludes the current row. So here on the second row, you can see it includes the previous and the next row, but not the current one, et cetera and it continues. And you'll notice at the beginning and the end, it only has one value. Similarly, up here, it only has two values because there is no previous row to row one and there is no row after row 14.
So the window will only have two values in it. Then there's excluding ties. It says here, quote "EXCLUDE TIES excludes any peers of the current row from the frame, but not the current row itself". The next area is going into excluding entire groups from a windowing function. So we'll see here, that this is a window that covers essentially five rows, two preceding, two following, and the current one, and excludes whatever this week value is in the group. So of course, since there are only ones in the window for the first two rows, nothing shows because that group is excluded. However, it does show the twos because they start showing up once the window reaches the third column here. So essentially this will be included. And then once you get to this row, these two twos will be included, but that will be excluded.
Then when it switches to a two, it will start showing these two 1s that are still in the window, and the last area they cover is DISTINCT. Now, what they say here is that in PostgreSQL, the current version, there's no way to use DISTINCT as part of a windowing function and PostgreSQL will error out. He shows the error here: DISTINCT is not implemented for window functions. He says what you have to do is filter the duplicates on a higher level so you can use a subselect, enroll the array, remove the duplicates, and assemble the array again. So he gives an example of this implementation if you want to do something like a window function using DISTINCT. So these are a lot of great queries for getting a better understanding of window functions for working with time series data. So if you do that, I definitely encourage you to check these two pieces of content out.
The next piece of content is "How to Get the First or Last Value in a Group Using Group By in SQL". This is from hakibenita.com. So again, this is another piece of content that I feel is a great example for developers. Now, they have a table here where they have data stored in essentially a JSONB field and it looks like a time log of when certain events happen, so certain amounts of money are deposited, withdrawn, and an available credit is set at a periodic time. Now, they had to do a calculation to get what the total balance was for the account, as well as what the last credit row was. Now, I find this layout a bit unorthodox, but this was the data that he had. And to get the balance, you just pull out from that JSONB field, the delta balances, and do a sum and by account, you can get what the current balance is. But the problem was finding this last credit entry. Now, he said in Oracle that you can use a LAST function to be able to do it in the manner described here.
But when he tried to do this in PostgreSQL, he couldn't really do it with a GROUP BY. So you could remove the GROUP BY and do it with a window function this way to get it, but it's still not exactly what you need. And he says, quote PostgreSQL doesn't have the built-in function to obtain the first or last value in a group using GROUP BY. Now he did some examples using SQL, but then he came up with this, what he's calling an array trick. So using this function he developed an array, found the max value of the array, filtered it only where the type is credit set, and then grabbed the second element of the array, which is the credit. That got the answer that he was looking for. And it was a very efficient query plan. So this was a pretty interesting trick and perhaps you could use it in some of your SQL queries or similar to the previous post where they were manipulating arrays to be able to do distinct counts. Perhaps you could use rays in different ways to do queries that you need to do.
The next post is "Range Types in PostgreSQL and GiST indexes". This is from alibabacloud.com and they're talking about someone who was using MySQL to be able to search for whether a particular value exists in a range and they decided, this individual decided to give it a try using PostgreSQL and using their different range types. So the post goes into describing the different range types that can be set up. So here are all the different types of range types and different functions that can be used against them. It goes over a lot of different examples of building certain range types and eventually discusses using a GiST index.
In this case, he's using a btree_gist to be able to get the query performance that he was looking for. In the process, he created an EXCLUDE constraint, inserted a series of test data, performed his query, and then looked at the execution plan, which looks like a pretty simple plan. I assume he's happy with the performance. It didn't give too much of a comment with reference to it or how the performance compared to the MySQL version. But if you have a use case that could use range types, this would be a good post potentially to look at and review how you might be able to use them in your implementation.
The next post is "A Faster, Lightweight Trigger Function in C for PostgreSQL". This is from percona.com and they had done some previous examples of using C for triggers to give more performance. This gives an example of doing a simple audit with a C trigger and seeing what kind of performance that could give. From what they state here, a lot of people use triggers for auditing purposes like to insert into a table when other parts of a table have changed, or another example is just updating as they say, an INSERT timestamp or an UPDATE timestamp using triggers. So they're wondering how fast they could make it. So basically they developed a C function that is shown here. Now, one thing that they did mention is that they used attnum to specify which column is going to be updated.
So not the actual name of the column, but this was more performant. Choosing a particular column of the table is more performant than using this SPI_fnumber function to get the column's name. So that's one caveat they mentioned here. Then once you have this C function developed, go ahead and make it into an extension that you install in your database and then create the function that the trigger will use using the C language. Create your trigger before INSERT or UPDATE for each row, and execute the function you defined. Then for benchmarking purposes, they created a PL/pgSQL function. So this is what you would normally do in this case. So they wanted to see what the performance difference was and over multiple runs, you could see that without the trigger.
Here's the baseline: adding the C trigger only resulted in about a 12% performance hit, whereas the Pl/pgSQL trigger resulted in an 80% performance hit. So as you're looking to potentially scale your database, maybe using more C functions as extensions could help increase the performance of your database. Now again, the caveat mentioned here is they did hard code the attribute number or essentially the column number. When they dropped down to using the actual name and used this SPI_fnumber function, the performance wasn't as good. It was about between the performance of these just so, just something to keep in mind. But if you're looking to eke out every bit of performance, perhaps this is a path you may want to explore.
The last post is "Waiting for PostGIS 3: ST_TileEvelope(z,x,y)". This is from crunchydata.com and again with the upcoming PostGIS 3, they are mentioning all the different features and this is yet another one. So if you are interested in PostGIS 3, definitely a blog post to check out.