Hash Indexes, Int Float Numeric Types, Postgres 14 Features, Data Science | Scaling Postgres 148
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we discuss hash indexes, the int, float and numeric data types, features coming in Postgres 14 and data science in Postgres.
Content Discussed
- Re-Introducing Hash Indexes in PostgreSQL
- PostgreSQL: int4 vs. float4 vs. numeric
- Waiting for PostgreSQL 14 – Report progress of COPY commands
- Waiting for PostgreSQL 14 – Add idle_session_timeout.
- R Predictive Analytics in Data Science Work using PostgreSQL
- The mapping of oid and relfilenode in PG
- How PostgreSQL Executes Sequential Scans with the Help of Table Access Methods APIs
- Sysbench: in-memory Postgres, Postgres is boring
- How to run some tasks without user intervention, at specific times?
- Gunnar 'Nick' Bluth
YouTube Video
Podcast Audio
Transcript
All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is "Re-Introducing Hash Indexes in PostgreSQL". This is from hakibenita.com, and they're talking about hash indexes. Now, it says reintroduced, and they've had them for a while, but as of version 10, they actually really became more usable. So the first thing this post describes is what a hash index is. You basically take a hashing function and apply it to a value, and that hashing function then places that value within a discrete number of buckets. So they have an example of one hash function doing a modulo or looking at a remainder of a number, and you're putting these values in, say 1-10, and then whatever the remainder is from three, it will be placed into that bucket. So one goes into bucket 1, but also four goes into bucket 1 because four divided by three has a remainder of one.
Also, whatever this number is, you're going to have that many buckets. Now, the hash indexing function within Postgres looks like it maps it to a 32-bit integer. So about 4 billion possible hash codes. So essentially those would be the buckets. Now, because there are a certain number of buckets, and you can see that when you have three, you have a lot of collisions, meaning the same value goes into the same bucket. Presumably, that will happen less often with a 32-bit possibility, but collisions do occur, and it discusses kind of how they handle them. They have what they call hash index overflow pages that then look to find out what the actual value is. So it's not sufficient to just go to the bucket. You need to go to what the exact value is. Then it also mentions there's also splitting that can occur where it needs to actually split a bucket into two buckets. So that's also a possibility. Then they get into actually testing these out.
So they're doing a test using a URL shortener. So they're going to have just an integer ID. The key is essentially what the short URL is, and then the full URL here. So they're going to be placing B-tree indexes on both the key and the URL as well as hash indexes. And the first thing that they looked at was the index size at different points of inserting data. The first thing that you'll notice is that the hash index usually seems to be about half the size of a B-tree index. So it has spatial efficiency on its side. Because it's a hashed value that it's storing, if you have a very large amount of data you're trying to index, it probably makes more sense to use a hash index versus a B-tree index for that. Another thing you can notice is that it doesn't grow in a linear fashion like the B-tree index, but it has the stepped approach and that's that hash splitting that happens.
They also mentioned the hash index size is not affected by the size of the index value. That's what I mentioned before, when you have a column with a lot or long pieces of information, a hash index could be more efficient and lastly, it's not affected by the selectivity of the index value. Now one thing they mentioned here, the difference between these B-tree indexes was actually due to the deduplication that version 13 introduced. If you actually look in version 12, you'll see an identical B-tree index size across the row values tested, and also that the hash value is still about half the size of the B-tree index. Now, they did some reindexing to see if they could make some difference and it does minimize the size difference between the two indexes, but it is still the case that the hash index is smaller. Then they talk a little bit about the hash index fillfactor which you can adjust, although they say it's default to 75 and between the values of 50 to 100, it doesn't seem to be that much variance.
I would probably just stick with the default. They looked at insert performance to determine how fast inserts can happen and while they were both similar, the hash index had a little bit of an edge and they're testing about 9.7 versus 10.8, so maybe there's about a 10% difference. Then they looked at SELECT performance, so selecting out different rows and here the difference seemed to be a little bit larger, although they did say it's a very slight difference. Looking at the timing, it looks to be about a 30-35% difference. So the hash index performed these SELECTS in 0.59, whereas it was 0.92 for a B-tree index. So the hash index looks a little bit smaller and a little bit more performant for select queries, but it does have of course limitations.
You can't create a unique index as a hash index, you can't index across multiple columns the way you can with a B-tree index and you can't create sorted indexes. So it really works best for getting individual values. Of course, you can't use it for range groups and you can't use it in ORDER BY clauses. Again, due to that sorting issue. So their conclusion is that the hash index is usually smaller than a corresponding B-tree index. The hash index SELECT and INSERT performance can be better than a B-tree index. The hash index removed many of the restrictions in PostgreSQL 10 and is now safe to use. The hash index has many restrictions that limit its use to very specific use cases. So if you're interested in learning more about hash indexes, and potentially using them in your projects, check out this blog post.
The next piece of content is "POSTGRESQL: INT4 VS. FLOAT4 VS. NUMERIC". This is from cybertec-postgresql.com and it's looking at these different numeric data types. So they created a table with the different values in here and then they did some comparisons. Basically, a lot of this post talks about when wanting to work with money. Of course, you would never want to use a float because it uses approximations to do its calculation using the floating point unit of your CPU. So it's not going to give you an accurate count, but more of an approximation when using floats. So you'll always want to use integers or numerics for any kind of money that you're working with.
Now of course, one consequence of this, say using numerics, as opposed to floats is that you do have a performance impact. So averaging the integer in their example here took about 255 milliseconds. Averaging the float, which is probably using the floating point unit in the CPU took 267 milliseconds, whereas doing the numeric took about 368 milliseconds. So much more of a difference compared to the previous two. Then they give you a little chart showing the differences. So if you're calculating any type of currency, generally you would want to stick with an INT or use a numeric if you're wanting to maintain decimals. So if you want to learn more about these data types and their performance, you can check out this post.
The next piece of content- "Waiting for PostgreSQL 14 - Report progress of COPY commands". This is from depesz.com. They're talking about that. There's a new view for you to monitor COPY progress called pg_stat_progress_copy. So if you start a COPY command, you can start watching the COPY progress and it will output the number of bytes processed and the number of lines processed for a COPY operation. So this looks to be a great addition to Postgres 14.
The next post, also from the same site, is "Waiting for PostgreSQL 14 - Add idle_session_timeout". So this is adding an idle session timeout. Now there already exists idle_in_transaction_session_timeout. So if you're in a transaction_in_idle, it will actually cancel that session for you if it hits this timeout. Whereas this is just you have something using a connection and if you want to drop that connection, if it's not actively being used, you have an idle_session. You can use this timeout to kill those sessions. So this is a great way to release resources back to the database. So this looks like it will be added in Postgres 14.
The next piece of content is "R Predictive Analytics in Data Science Work using PostgreSQL". This is from blog.crunchydata.com. It looks like this is the final post in the data science series that's been done by Crunchy Data. This is the final one where they're using the model to make predictions. So they're talking about numer models and using R to predict the probability of fire. So if you want to check out this data science post and cover all of the items here, all running within Postgres doing data science analysis, definitely check out this post and the previous series talking about using data science with PostgreSQL.
The next piece of content is "The mapping of oid and relfilenode in PG". This is from highgo.ca, and they're talking about how the OID of an object in the PG class table maps to a table called relfilenode, and also how those change in the architecture. So if you're interested in learning howPostgres works under the covers, as it were, you can check out this post.
Another post, also from highgo.ca, is "How PostgreSQL Executes Sequential Scans with the Help of Table Access Method APIs". So in Postgres 12, they released the table access method API so that others can use it, and with it they've looked at and analyzed the functions for doing sequential scans. So if you're interested in learning more about that, you can check out this post.
The next piece of content is "Sysbench: in-memory Postgres, Postgres is boring". This is from smalldatum.blogspot.com. They're talking about an analysis they did of Postgres 11 through 13 and basically reports no regressions for the items that he discovered, hence I believe why he calls it boring. But if you want some more detailed analysis of performance for some lower-level workloads, you can check out the results of his tests here.
The next piece of content is "How to run some tasks without user intervention, at specific times?". This is from depesz.com, and essentially the first item he mentions is just using Cron. So using Cron, you can start the psql session and send any data you want into it to, say, query against the database or create a table or do a backup or whatever. So essentially Cron is the main one he goes toward and he talks about how he uses it with the Postgres scripts he wants to run. Now, some people don't have access to the underlying operating system to use Cron and they're looking for a scheduling tool.
So he also mentions the new scheduler tool called pg_timetable. So that's an option and he discusses using it and some use cases for it. Another one that I'm also familiar with is pg_cron, which I believe is being added as an extension that's being built into, say, the Amazon RDS instances. So therefore that would be readily accessible for you to use to schedule tasks. If you want to learn more about scheduling tasks in Postgres, definitely check out this blog post.
The last piece of content, the PostgreSQL person of the week is Gunner 'Nick' Bluth. So if you want to learn more about Gunner 'Nick', go ahead and check out this blog post.