The Future, Slow Counts, Avoid Or, Replication | Scaling Postgres 12
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering the future of databases, why counting is slow, the best ways to avoid ORs, and all about replication.
Content Discussed
- It's the future (for databases)
- Postgres, MVCC, and you or, Why COUNT(*) is slow (David Wolever)
- Avoiding “OR” for better query performance
- PostgreSQL Data Types: Point
- An Overview of the Serial Pseudo-Datatype for PostgreSQL
- PostgreSQL Replication
- PG Phriday: BDR Around the Globe
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about the future, slow counts, avoidingor in replication. I'm creston. Jameson. And this is scaling postgres episode twelve.
[00:00:20] Alright, the first article we have is it's the future for databases. And this is from theCitusdata.com blog. And this is presumably a fictional story of someone who's a data architect inSan Francisco and wanting to talk with someone who is presumably up to date with the latesttechnologies. And basically the data architect wanted to just use postgres, but then the personwas trying to convince them to use other, what they're calling neo Martin data architectures. Sothis is a very humorous post to go through and just see. So trust me, you'll want to read it. It'spretty humorous. So go ahead and check it out.
[00:01:10] The next post is actually a YouTube video that was posted from PyCon, Canada andit's postgres MVCC. And you or Why Count everything is Slow by David now this was presentedin November of 2017, but it was just posted recently and it goes over the issues why count isslow, because this is one of the many questions that gets asked when things start to slow down.Why is counting all the rows in a table slow? And it's basically because of MVCC, essentially rowvisibility. Now, the presentation takes a while to get up to speed, so if you have some knowledgeof postgres, I would definitely suggest checking it at around the twelve minute and 32nd markbecause that's where he starts going to slow counting. He does a little bit of live coding where heshows essentially row visibility if you're performing a transaction, how within the transaction yousee one version of the data whereas everyone else sees the old version. And then once youcommit it, you can then see the changed versions. And he talks about some of the ramificationsfor using MVCC and why essentially you can get some slow counts. And then he talks about theramifications of okay, now why do you have auto vacuum and why is that important and why doyou have transaction IDs and why can they potentially run out? So it starts off explaining whycount is slow, but also gets into more some of the deeper architectural details of PostgreSQL.Now he is a developer, so this is from a developer's perspective, but it gives a really goodoverview of how Postgres is designed way it is. If you're already familiar with Postgres and theinternals, this might be a little bit too basic for you, but if you're a developer wanting to get a bitof a sense on why Postgres is designed the way it is, this is a great overview. So I definitelysuggest you check it out.
[00:03:21] The next article is avoiding. Or for better query performance, and this is from theCybertechql.com blog and it basically shows you areas in your queries if you're using or that cancause some poor performance problems.
[00:03:38] So he goes over an example setting up two tables with some primary keys, someforeign keys and even some indexes. And he basically says the good or are things where it's notin the where statement. Essentially it's in a case expression or in the select list and he shows anexample of some bad ors. Now sometimes you just can't help it, but so in this example there'sno good alternative than selecting this way. Now you can just select based upon the ID given inthis text, field value underscore 42 is there because you know, 42. So you can just select by theID. But basically minimizing your use of or in the where statement will speed up queryperformance. And probably the most or the best advice is that N is better than or and he showsit explicitly here. When you're doing this ID or this ID, it does a bitmap index scan from each ofthe conditions and then bitmaps or them, whereas you can just do a single index scan whenyou're using N. Then he goes into another example of where if you were using the like operator,something like this cannot use an index. So for example, he's getting a sequential scan whenusing the like operator with two different texts. However, you can use a PG trigram module anda gen index to be able to use an index for these particular types of queries. And then lastly, hefollows up with the ugly or where you're oring between two different tables and he shows onealternative to that is using Union and that generates a more efficient plan for pulling your data.So definitely a blog post to review to help make your queries faster.
[00:05:28] Next blog pos --t is PostgreSQL data types point. So this is a continuation of the PostgreSQL data typesseries that this blog has been producing. This is Ta Poueh.org blog. Now, point of course is usedfor geospatial data. So it's essentially latitude longitude that you can store in a single data typecalled point. So this is quite a long blog post and there's a lot that goes into it and I would saydefinitely a very niche data type used in specific cases and I haven't really seen it been used.Mostly I've seen people using other data types to store latitude and longitude. Not necessarilystoring it in one data type field, but definitely if you are doing or working with geospatial relateddata, definitely a data type data type to check out to see if it could be beneficial for you.
[00:06:26] The next blog post is an overview of the serial pseudotype data type for PostgreSQL.This is from the several nines.com blog. So serial is considered a pseudo data type because youusually use it when you're creating a table and specifying serial. It actually creates an integerdata type and it's a primary key that will auto increment using sequences. So this goes oversome of the concepts of unique keys declaring a serial data type. So here they do it in theCreate table statement and they tell you what it's actually creating. There is no serial data type inthe resulting table. It becomes an integer field. But it does create a related sequence, definesownership and relates the sequence to the integer that's been created. And it shows a little bitabout how it works when you are inserting data, talking about missing serial values and even thesequence manipulation functions. So if you're curious about how serial works when setting upyour tables, definitely a blog post to check out.
[00:07:34] The next blog post is actually some slides from a presentation PostgreSQLreplication. And this is by Christophe Pettis at PostgreSQL Experts. Now, this is a verycomprehensive presentation and it goes over all the different types of PostgreSQL replication,from log shipping wall files, to using streaming replication to using logical replication. And I wasfamiliar with a great deal of it, but there was definitely a great refresher in and around slide 38 forme where he talks about the logical replication options and continuing on to talk about somethings that you need to be aware of. So for example, sequence values are not replicated andhe's discussing the core in PostgreSQL ten, the core logical replication. However, he is alsodiscussing PG Logical, which is a separate tool that's available for versions less than ten. Andhe also talks about how Truncate does not get replicated for Core, but in PG Logical it replicatesTruncate but not cascade Truncate. And just be aware, you can only replicate a real table to areal table. So no materialized views, views, foreign tables or partition root tables. So it willreplicate the data in the partition tables but not the root table itself. And it cannot replicatetemporary or unlocked tables. And particularly copy operations are broken into individual inserts.So individual statements are unrolled. So a single update changing 10,000 rows will be appliedas 10,000 updates. And he talks about other issues such as if a primary with logical subscribersfails over to a secondary, the current logical replication state is not passed over to the secondary,so this can cause synchronization problems. But he did make a note that PostgreSQL elevenshould address this. So, definitely a great overview of all sorts of different replicationtechnologies available in PostgreSQL, both physical replication, streaming logical replication. Sodefinitely a presentation I would suggest you review.
[00:09:44] The last blog post is PG Friday BDR around the globe. So this is an interesting post.It's about BDR, which is bi directional replication, or BDR stands for bi directional replication andit's a tool of Second Quadrant. So this is from the Second Quadrant.com blog and it wasinteresting because it discusses the use case on why you would particularly want to use thistool. So it does logical replication, multimaster replication, basic conflict resolution in terms of thelast update wins. It handles distributed, locking global sequences and high latency replay. Sothey go over scenarios where you have, say, four different data centers around the world andhow this tool can help to do that. So from the perspective of Scaling, --if you are looking to scale geographically in this fashion and essentially have one commondata set, this is definitely a tool to examine and see if this would be beneficial for you.
[00:10:48] That does it. For this episode of Scaling Postgres, you can get the links to all thecontent presented in the show notes. Be sure to head over to Scalingpostgres.com where youcan sign up to receive weekly notifications of each episode, or you could subscribe via YouTubeor itunes. Thanks. --