background

Chaos Order, Pub Sub, Petabyte Scale, PgBouncer Connections | Scaling Postgres 90

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss chaos order, pub sub, petabyte scale analytics and PgBouncer connections.

Content Discussed

YouTube Video

Podcast Audio

Transcript

All right. I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is that PostgreSQL versions have been released. So 12.1, 11.6, 10.11, 9.6.16, 9.5.20, and 9.4.25 have all been released and it's predominantly bug fixes and improvements so not really any security issues. Also, they did mention that this coming February 2020 PostgreSQL version 9.4 will be end of life. When you get an opportunity, go ahead and upgrade your version.

The next post is "Chaos Order. Randomizing Queries to Uncover Order-Dependency". This is from simplethread.com and basically, it's talking about the concept that unless you specify an order, there are no guarantees as to what order in which you're going to get returned data from an SQL query. So for example, in this first query we are asking for one user from the user's table, there's no guarantee that you're going to get ID 1 or ID 5. You could get ID 3, you have no idea. Now in a development environment maybe you're going to get it to see some consistency with regard to this, but there's no guarantee what you're going to be returning. Therefore you should always use an ORDER BY. For example, if you have an incremental ID such as this, maybe you want to order by the ID or if there's a date and timestamp, maybe you want to order by the date or timestamp that's available. 

But this blog post goes into a way to actually insert randomness into your tables to be able to sleuth out potential errors that could be caused by not using ORDER BY for all your queries. So imagine a very large code base and you want to say if it is resilient to these changes in ORDER BY potentially. So he has this PostgreSQL function that actually adds a random column and then you can cluster this table by this random column to change the actual order of the table because clustering actually changes the physical order of the rows on disk. So if you order by this random value you get to see when you do SELECT ALL from users that the ID is now essentially randomized. Now if you do new INSERTS and things of that nature you're going to have to reorder it again. But still, this is a tool to test and make sure that your applications and really all your queries are using ORDER BYs when necessary to ensure the proper orders being returned. So if you're interested in learning more about that, definitely a blog post to check out.

The next post is "System design hack: Postgres is a great pub/sub & job server". So this is from layerci.com and it's talking about how they've chosen to use PostgreSQL as their pub/sub server and even potentially a job server. And it says you have a lot of options. So they mentioned Kafka, RabbitMQ, Redis pub/sub, and Cloud solutions But you can use pub/sub with Postgres. How they've specifically used it is that they have CI workflows and continuous integration workflows they run through and they have a certain status. So an API creates a CI run that has a status of new, and then a worker grabs that new piece of work, changes the status to init, and then follows up with the status of running. So they show here where they have created their own type with specified states as an enum and all the different statuses that it can be. 

New, initializing, initialized running, success, or error. And then they have aCI jobs table. On an API call, a new row is inserted into this table. Then a worker claims a job by doing an update, setting the status to initialized. And they're choosing the IDs and using FOR UPDATE SKIP LOCKED to do it. So they're skipping those rows that are already locked. And then they have a function that actually publishes to the ci_job_status channel once the status has been inserted or updated as a part of the trigger on that table. 

And then they show the client code that actually consumes or monitors this channel that then picks up the job to do the work. And they have a graph here that says API calls don't insert into Postgres that fires the trigger that notifies the Postgres channel that they've created. Then the init worker is notified by that channel that it's listening on and it claims the row with an update, setting the status to initializing. Then work continues from there. And then they go into some of the other benefits. So if you're interested in using Postgres as a pub/sub platform, definitely a blog post to check out.

The next post is "Architecting petabyte-scale analytics by scaling out Postgres on Azure with the Citus extension". This is from the techcommunity.microsoft.com. Now there seems to be more so of a marketing piece describing the Citus extension, but this kind of goes into the story of the Windows Update platform where they release all of their Windows patches for their operating systems and the analytics that they use to track it. This was before Citus became a part of Microsoft, and it describes how they were looking for a solution for their analytics and how they identified the Citus extension and PostgreSQL as potential solutions for the analytical problems that they were dealing with. 

So it definitely talks a lot about the business case and kind of how they chose this solution. And of course, they eventually purchased them. So clearly it was definitely working out. But it goes into some of the architecture that they use for doing the analytics for the Windows updates along with the Citus cluster. Then they're also talking about building it on Microsoft Azure since it's now supported on Azure. So definitely an interesting piece of content to check out and read over.

The next post is actually a YouTube video and it's called "Discover What's New in Postgres & Advanced Server 12". This is from the EnterpriseDB YouTube channel. This is a webinar that discusses some of the new features in Postgres 12. Predominantly, these eight are listed here: partitioning improvements, B-tree improvements, multicolumn most common value statistics, inline many CTE queries, prepared plan control, just-in-time compilation, checksum control, and reindex concurrently. I believe most of these favorite features are from Bruce Momjian. So if you want to find out more and have a discussion about it in a video format, definitely a presentation you can check out.

The next post is "PgBouncer Connection Pooling: What to do when persistent connectivity is lost". This is from enterprisedb.com. It's the second post in the PgBouncer series where they did an install of it. So this is saying what happens when you do a query again through PgBouncer and you bring down the connection for a minute. Basically what happens is it hangs. But eventually, once the network interface is back up, it continues and does the query. But what happens if you actually restart the database, not just bring down the network? Well, then you're going to get a fatal error that mentions the server connection crashed, and the server closed the connection unexpectedly. 

Now they say one way to get around this is you can actually change the pool mode of PgBouncer to transactional. So it only presents this message in the cases where you're using session pooling. Now they do say that they have a server reset query that you can adjust if you want to use session pooling. But most use cases I've seen of PgBouncer are that people want to use transaction pooling, but with transaction pooling enabled, you'll see that you won't get that error message anymore. So we did a query, did a restart of the server, and then the next query there was no error message.

The next piece of content is also related to PgBouncer. "Webinar: Using SSL with PostgreSQL and pgbouncer [Follow Up]". This is a webinar given by 2ndquadrant.com and it is a pretty comprehensive webinar, about an hour in duration, that talks about using SSL with PostgreSQL. About the first 20 minutes are general SSL teachings explaining the purpose of SSL and certificates and keys and things of that nature. It's around the 20 mark that they start getting into some of the PostgreSQL configurations. And then around the 45 minutes mark they talk about PgBouncer specifically and configuring that to use with SSL as well. So if you're interested in that type of content, definitely a webinar to check out and you can just register using the link right here.

The next post is "Waiting for PostgreSQL 13 - Introduce the 'force' option for the Drop Database command". This is from depesz.com. So before, if you try to drop a database and you have a connection active, it won't let you. Personally, I kind of like that as a protection. But now they offer the option to use force to do it. So you can do a Drop Database database name with force and it will go ahead and drop the connections and then drop the database.  So another feature coming to PostgreSQL version 13.

Another feature is "Waiting for PostgreSQL 13 - Allow sampling of statements depending on duration". This is from depesz.com. So this enables two new configuration options log_min_duration_sample and log_statement_sample_rate. So the setting log_min_duration_statement has a higher priority when compared to the log_min_duration_sample. And whether that allows you to do as it mentions here is that you can use these new settings to, for example, log, as he says here, log 10% of queries over 100 milliseconds and then all queries above 500 milliseconds. So it gives you more granularity with regard to how you want to sample and track slow queries. So definitely a great feature addition for minimizing log files.

The next post is "OPEN STREET MAP TO POSTGIS - THE BASICS". This is from cybertec-postgresql.com and this blog post describes how to import OpenStreetMap data into PostGIS for the purposes of visualization and analytics. So if you're interested in doing that, definitely a blog post to check out.

The last post is "ST_Subdivide all the Things". This is from the Paul Ramsey blog at cleverelephant.ca and he's talking about the scale of geospatial data and how tracking all the points of Canada is represented with over a 68,000 point multi-polygon. In terms of analysis, subdividing using the st_subdivide function enables speeding things up. So if you're interested in doing that, definitely a blog post to check out.

episode_image