Postgres Goodies, Materialized vs Rollup, Column Oriented | Scaling Postgres 37
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering Postgres goodies, materialized vs. rollup, column oriented data and parallel scans.
Content Discussed
- Webinar: PostgreSQL is NOT your traditional SQL database [Follow Up]
- Slides: PostgreSQL is NOT your traditional SQL database
- Materialized views vs. Rollup tables in Postgres
- A Poor Man's Column Oriented Database in PostgreSQL
- Using parallel sequential scan in PostgreSQL
- It's just an expression
- Peeking At Query Performance of the Upcoming Version 11 of PostgreSQL
- The Biggest Mistake Postgres Ever Made
- Introducing PostgreSQL Anonymizer
- TimescaleDB 1.0 is Production Ready
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres we talk about postgres goodies materializedversus rollup, column oriented and parallel scan. I'm Kristen Jameson and this is ScalingPostgres episode 37.
[00:00:22] Alright, our first piece of content is webinar. PostgreSQL is not your traditional SQLdatabase. This is from the second quadrant.com blog. So this is a webinar that they put on andyou can get access to the recording of the webinar here and you just need to do a registration,but then you get immediate access. Now this was a really great presentation and the slidesfolder actually right here, I'll include the link as well. So this is the name of the presentation byGulcinjelnik, perhaps my apologies in that pronunciation. So this is the agenda for thepresentation. It went over the design choices of PostgreSQL and it's basically talking about kindof the object oriented features and some special features of postgres that have been added to itthat make it a little bit different. And this was a good review for me because it actually broughtsome things to my attention that perhaps I should be using more features like maybe the enumfeature and also some particular configurations to use for gen indexes that could give you betterperformance for some query types and make the index smaller. So basically she goes overarrays, how they could be used. Enums used for typically for status fields that are changing thestatus. You can use an enum and that's a more efficient way to store the data. Talked aboutJSON types and JSON types. Essentially just store a replica of a JSON you've receivedwhereas JSON B is good for actually storing and ordering the JSON and you can query it veryfast using a gen index and the presentation goes into full text search and the different ways youcould use it and rank it. And how do you do similar searches as opposed to exact searches aswell as talking about accent and language support. So again, this is a really great presentationand even I learned a few things. So I highly encourage you to check this out, access the slidesand download the webinar presentation.
[00:02:24] The next article is Materialized Views versus roll up tables in Postgres. And this isfrom the Citusdata.com blog. And basically this is for when you're perhaps wanting to do sometype of real time analytics or it'd be hard to be real time, but with materialized views but differentways that you can look at rolled up aggregates essentially. So firstly go over what is a view. Anda view is essentially a virtualized table. It doesn't really exist as a table, but it's a virtual structurethat allows you to pull data from multiple or even just a single table, only a few rows, but it has toaccess all that tables, all of those tables. When you do a query, when you materialize your view,it actually goes from being a virtual table to actually a physical one that you can reference. Andthen there is a command that you can refresh it and you can see refresh materialized view righthere and you can refresh it concurrently so it can be done without having to lock the table. Andthey say this use case using materialized views is good. Maybe if you are going to roll up sometotals in the evening when a system is not busy, but it may not be great for something that needsto be more real time.
[00:03:40] And here they're talking about the incremental roll ups when you need to be morescalable. So basically this is essentially a summary table that they're rolling up analytics countsand they're inserting into the roll up table and doing a select from their page views that's theiranalytic counts table and grouping it by day and page. Now, because they put a uniqueconstraint on this roll up table, you're not going to be able to insert more than one day here, butwhat they're going to use is an on conflict, do an update. And in the process of doing it, they'reactually going to be using some functions and they reference another post that we've actuallycovered on a previous episode of Scaling Postgres that I highly suggest you check out. Butbasically when you're using some of these functions that they present there, you can insert aday or you can just update the counts and presumably you can set this up on a schedule. Maybeyou want to do it every hour or every 15 minutes and it will update those totals for you. So it willbe more real time than having to do essentially you're updating just those values that havechanged ve --rsus refreshing an entire materialized view. So if you're looking into wanting to refreshessentially summary tables, here are two techniques you can use to potentially do that. So,definitely a blog post to check out.
[00:05:04] The next post is a poor man's column oriented database in PostgreSQL. And this isfrom Brianlikes postgres.com. Now, I'm going to mention the first three sentences of the blogpost or quote them. I mean, let's get this out of the way. If you need a real column orienteddatabase, use one. Don't even think about using the insanity below in production. So really thisis just an interesting thought experiment blog post, but I thought it was interesting and it helpsunderstand the rationale about why OLAP or online analytical processing environments usecolumn oriented databases. And he says here two properties of a column oriented database thatmake it beneficial in OLAP environments are queries that involve only a subset of a table'scolumns. So maybe you're aggregating some value in that column and it only needs to readthose columns off of a disk and it saves an IO. In addition, storing each column separatelymeans it can compress that data better since it's all of the same type. Now, what he mentionshere is that is he creates a one column table with just a unique identifier, and then he createsanother table that contains the unique identifier and then one other column with the data in it. Soessentially, it's one table dedicated for all of that column's information. In order to use it, youhave to join to multiple tables to be able to pull data from it. And he goes over his whole processof setting this up, different functions that he used, different trigger functions to be able to set thisup. And then he has the results for a particular query where he's just summarizing the one valuefrom a traditional table. So a table with all the columns in it not really column oriented and itcompletes in 55 seconds. So a traditional table, 55 seconds. And then in his combined wherehe's kind of column ordering things and he does the same operation, sum that one column fromhis column oriented store and it finishes in 17 seconds. So it actually worked. And he says aquote in here, there we go. PostgreSQL is smart enough to query only the underlying tables itneeds and completes in less than a third of the time of the traditional version. So it has anadvantage when it's only looking at one column and aggregating all of the values there. But whathappens when you want to retrieve just one row of all the data, like you would typically do in anonline transactional processing application, like give me the entire row with all the columns. Soselecting one row from the traditional table took 00:16 milliseconds so far less than amillisecond, whereas selecting one row from his poor man's column store view took 69milliseconds. So significantly longer to ask for all the columns from a particular row. And then helooks at insert performance and update performance, as well as delete performance. And interms of his conclusion with regard to this quote, the traditional table beats the pants off of theview on single row operations. So this makes sense. You're storing data essentially by rows. Soif you're asking for one row or inserting one row, deleting one, it's very easy to do. And then withregard to columns, it has to actually touch multiple column tables to actually change the dataand to pull a whole row back. So I thought this post was very interesting because it actuallyshows you the rationale for column oriented stores and why they tend to be used for analyticalprocessing. It's slower to put the data in, it's slower to ask for just a row of data. So it's terriblefor transactional processing workloads, but it works really well when you're analyzing a fewcolumns, like doing a summary, an aggregate of some sort or maybe in an average, some sort offunction across a single column and enables more space efficiency. So definitely an interestingperspective and experiment that he puts in his post here.
[00:09:14] The next post is using parallel sequential scan in PostgreSQL and this is from theRafiasab blogspot.com. It's all in database blog. And this was interesting in terms of telling youor explaining kind of why parallel sequential scan can potentially cost more than a regular scan,in what cases that is. It's basically the coordination of the different workers that are involvedwhen doing a para --llel sequential scan. So, for example, she says, let's say there's a table with 100 tuples, 100rows, and we have two workers and one leader. Essentially that needs to coordinate the workand aggregate the work of the workers. Let's say an arbitrary cost of scanning one tuple is ten.The cost of communicating one tuple from the worker to the leader. So if it has some work, it'sdone to pass that to the leader is 20. The cost of dividing the tuples among the workers is 30.And she's assuming that the leader gives 50 tuples to each worker. So the cost of a sequentialscan, not parallel of a table, would be the cost of scanning one tuple times the number of rows.So it's a cost of 1000. But for that same row, if you're going to be needing to scan the wholething, you have to add all these different costs up and it comes out to 2500 as the cost, so morethan twice as slow. So all the coordination of the different workers to actually get the sameresult, it's going to be twice as slow with regard to looking at cost values. But she said let'schange it up a little bit here.
[00:10:52] Quote let's say we want to list only the tuples which have the A column. One of thecolumns in her table is greater than 80. So basically pull out 20 tuples from the table that has100 tuples. So the sequential scan will be the same. You still have to scan through all 100 tuples.Cost of scanning a tuple is ten, so you still have the same cost, it's still going to be 1000.However, in parallel it works out differently where the fewer rows that you actually need eachworker to pull back and then send to the coordinator or the leader, the smaller that number is,the more efficient you can get. So here the cost drops down to 730. So actually the cost of doingthe parallel scan is less than just a standard sequential scan. And she says, quote, a parallelsequential scan is likely to improve the performance of queries that require scanning a largeamount of data, but only a few of them satisfy the selection criteria. So again, the more specificdata that you're looking for in a large table, the more efficient a parallel sequential scan will be.And she goes into different configuration variables that you can set for your PostgreSQLinstance. So if you're interested in parallel sequential scan, how it works, and kind of therationale behind it. This is definitely a great blog post to check out.
[00:12:15] The next post is it's just an expression from Haxoclock Blogspot.com blog. So thisblog post is essentially talking about text search and specifically case insensitive text search. Soone way to do it is to search on a column using Ilike but the performance of that gets pretty badpretty quickly. You could also use a case insensitive regex but again the performance of thiscannot be great at times. But he says there is one great extension you could use called CI Textwhich is a case insensitive text. It lets you use indexes and you still get case insensitivematching. But he said, well, let's say you don't have that extension installed or you didn't createthe table with that field type in the first place. And you need to, for example, search like emailaddress is the classic example where you have a lowercase email address and you're trying tofind a match for it in the table. And the data may not be lowercase in the table, so you can saywhere lower equals this, but the problem is it's not going to use an index. You can see this isdoing a sequential scan on the email addresses even if you have an index on it, but what youcan do is create an expression index so it creates this index on email addresses, the loweringessentially email address and of course now this query will use this index. So those are a coupleof different solutions to doing case insensitive searches in your database.
[00:13:44] The next post is peeking at query performance of the upcoming version eleven ofPostgreSQL. Now this was released last week but postgres eleven has been out. So this blogpost seems a little bit dated, but in here they're using the release candidate one, so justsomething to keep in mind. And they're using some test queries to test the performance of howhas Eleven changed? And they talk about some of the hardware and there's server config thatthey use to do the tests and they show an example of a sum query, a join query, a cube queryand an index query. And there are a few that are a little bit better with the version eleven releasecandidate but --some of them have dramatic improvements like the cube in the index. So if you're looking forsome performance differences in version eleven, this is potentially a blog post to check out.
[00:14:39] The next post is the biggest mistake postgres ever made. This is from the CraigKirsten's blog and it goes into a little bit of the history of postgres and where it came from.Basically it gets its name from Ingress. So post it happened after the Ingress database so that'show Postgres came into being. But as he says quote in the early days postgres there was noSQL. Not no SQL, but there was not SQL so it didn't support it. But in 1995 they added SQLsupport and with it they decided to change the name to PostgreSQL. So that single thing isessentially the biggest mistake Postgres ever made because now the official way to say it, andeven I've been corrected on it, was is PostgreSQL. Basically in retrospect it seems like weshould have just named it Postgres. But anyway, this is an interesting post, talking a little bitabout the history and the biggest mistake Postgres has ever made.
[00:15:37] The next post is introducing PostgreSQL anonymizer. So this is a very early extensionthat someone has put together, but it's something that may be interesting. I'm not sure about theimplementation, but what it does. You can put a mask for certain roles to be able to obfuscatedata. So in other words, you keep the same data in the database, but when you're going toretrieve it, you can actually replace a function that replaces what particular users will see in thedatabase. So for example, if someone with DBA rights were to query the database, they wouldget the exact data. So in this case they would see the name, they would see the phone number,but for other users that has a masked role. You could define functions so that it will scramble thedata returned so they can't actually see what the actual values are. So again, this is super, superearly. But if you're potentially interested in this for achieving GDPR compliance or just need toanonymize data, perhaps this is a project you want to check out.
[00:16:47] The last post is TimescaleDB 10, is production ready. So essentially Timescale DB 10has been released and so this is a general announcement discussing it and what have been thechanges added since release candidate one. So if you're interested in a time series databasethat is essentially an extension of PostgreSQL, definitely a blog post to check out.
[00:17:11] That does it. For this episode of Scaling Postgres, you can get links to all the contentmentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can signup to receive weekly notifications of each episode. Or you could subscribe via YouTube oritunes. Thanks our. --