Watch, Fast Text Search, Column Defaults, Import CSVs | Scaling Postgres 28
Join Over 1,000 Engineers & Get New Episodes Weekly!
In this episode of Scaling Postgres, we review articles covering watching Star Wars, fast text search, column defaults and importing large CSVs.
Content Discussed
- !!Con 2018: Using Postgres to \watch Star Wars! by Will Leinweber
- psql Tips and Tricks
- Fast Full-Text Search in PostgreSQL
- A Missing Link in Postgres 11: Fast Column Creation with Defaults
- Using COPY in Postgres for Importing Large CSVs
- Fast PostgreSQL Data Loading Using Ruby
- Postgres data types you should consider using
- Don’t LOCK tables. Just don’t.
- [Video] Data Integration with PostgreSQL
- PostgreSQL Indexing : How, why, and when.
- What hot_standby_feedback in PostgreSQL really does
- Tune Linux Kernel Parameters For PostgreSQL Optimization
- Real world SSD wearout
- Beware of your next glibc upgrade
YouTube Video
Podcast Audio
Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about watching Star Wars fast textsearch columns with defaults and importing large CSVs. I'm Kristen Jameson, and this is Scalingpostgres episode 28.
[00:00:22] Hi. We have another week with a lot of great content, so let's get started. Now, thefirst up isn't necessarily related to Scaling Postgres, but it's super short and interesting.
[00:00:35] So this is from actually a YouTube video. It's only eight minutes long and it's fromwhere it's called a double exclamation Point Con 2018 using Postgres to watch Star Wars byWill Line Weber or Lean Weber. So basically there is an ASCII animation that someone hasdeveloped over the years for Star Wars. So it kind of looks like this, and it does text from the filmand it shows little ASCII animation of what's going on. And he actually using Ruby, I believe, tookthe raw data from the website that hosts this ASCII animation, loads it up into postgres, and thenuses the psql Watch command, which enables you to trigger a query multiple times with a delaybetween it to be able to watch Star Wars as slow or fast as you want to, based upon theparameters. And you can even update the database to determine at what point you start themovie. Or you could fast forward or rewind to a particular point. So again, not related to ScalingPostgres, but I thought this was one of the more interesting pieces of content this week. I justwanted to show.
[00:01:49] So kind of related to the previous YouTube video. This is a post called psql Tips andTricks. And this is from the PG IO blog and they just have little gifs that go through some of thetop tips when you're using psql. So the first one they talk about is timing, which can track thetiming of the queries that you run.
[00:02:12] If you don't want null to be invisible, you can choose to set it to be a particular value.Again, going back to Watch, where you watch a repeatedly executed query, you can actuallyalter the styles of the output from psql to make tables. They have an example where you saveresults of an inquiry to a CSV, to a local file with a copy command, which is similar to the CopySQL command. And then they're showing h for a built in syntax reference as well as using acommand editor. So definitely some tips and tricks to check out.
[00:02:50] The next post is Fast full text search in PostgreSQL. And this is from theAustinjwalters.com blog. So here he's talking about a fast text search and he wanted to searchthrough a large set of data and he tested a few different ways. So of course the slow way isusing the full text search using TS vectors or two TS vectors without an index. So that's clearlygoing to be slow. Something that's a little faster is just doing a light. So like searching for lightGoogle with percent on the other side index won't really have a great improvement on this,however, combining the TS vectors with a gen index that gives you the fastest text searchresults generally. Now a little twist he put to it, he just didn't put the gen index on the column hewas interested in but he actually created a new column so that he could actually set weights ofhow much to prioritize different pieces of information. So he set the weight of the story title to Aand then set the weight of the comment text to B. So it's going to preferentially weight the storytitle above what's in the comment text. Now in doing that you're actually going to have to createa function and then a trigger to be able to keep that up to date. So there is a cost with doing thisbut it gives you the ability to have those in the database. But again most of the speed isdefinitely due to the gen index being used. But if you're wanting to learn more about full textsearch, definitely a blog post to check out the next article is a missing link in postgres Elevenfast column creation with defaults and this is from the brander.org blog. So basically he's talkingabout a feature in postgres Eleven whereby you can quickly add a default to a column in a table.Now he goes over what has been the problem in previous instances of postgres. So if youwanted to add a column to a table, so for example you alter table users add a column creditswith a big int. This happens very fast.
[00:05:08] The problem comes in if you want to make something not null well then you need toset a default and then how does the database handle it? It needs to do a full table rewrite and itbasically uses an ac --cess exclusive lock while it's doing this.
[00:05:24] So basically the process you have to go through is you have to add the column aswas done up here, you have to then backfill it and then at that point you can set not null anddefault and things of that nature but it's a bit of a process to do. But with postgres Eleven they'veimplemented some features that enable adding a column with a default that won't take thisaccess exclusive lock and it doesn't have to touch every single row in the table. And in theunderhood section he describes quote the change adds two new fields to PG attribute a systemtable that tracks information on every column in the database. So attribute has missing set totrue when there are missing default values, an attribute missing value contains the missingvalue. So basically this works great for being able to add a default with new columns you'readding. And as this post goes through it's really great for helping maintain the consistency ofyour data because he indicates. What people tend to do is they just add the column and don'tworry about setting a not null or with the default because of the effort required for largedatabases to essentially rewrite that table.
[00:06:34] But with this feature more developers are going to be inclined to add that column andwhen needed add a Not Null and potentially a default. So definitely a blog post to check out ifyou're interested in learning more.
[00:06:47] The next post is using Copy in Postgres for importing large CSVs. This is from theTrenao.com blog and basically he had to import a million line 750 megabyte CSV file intoPostgres for a Rails app and basically he started using a Rake task to parse the CSV importeach row via active record, but at that rate it would take 16 hours to complete. So basically hewent through that process and basically came up with a better approach, which is definitely thecase using Postgres Copy. So he goes through the steps in terms of creating an importedrecords table copy from the file into that table, which at this point took 39 seconds to do asopposed to hours. And then he actually had a need to merge table. So from this temporary tablehe inserted into the destination table, but he also used an on conflict do an update so thatbasically if a row did not exist it gets inserted, otherwise it gets updated with the updatedinformation. And then lastly, which is optional, is in terms of doing a vacuum analyze or vacuumfull analyze if you want. Now if you're interested in learning more about this, I actually did atutorial called Fast PostgreSQL Data Loading using Ruby and I went through differenttechniques you could use and showed the timing in this video and I looked at loading one rowper insert statement. But then what was interesting is that you can really crank up the number ofrows per insert statement. Now I had to do a lot more work to get that working, but it startedapproaching the speed of copy. I mean basically, probably Copy is going to blow the doors offanything. However, you can get super close loading multiple rows per insert statement versusjust one set of data at a time or one row at a time in an insert statement. So if you're interestedyou can definitely check out this piece of content as well.
[00:08:51] The next post is Postgres data types you should consider using, and this is from theCitrusdata.com Blog and of course their number one here quote JSON b tops the list of postgresdata types that are apart from the standard one you should be using. But basically this is greatfor unstructured data that you're using in your application or interfacing with. An important thingto know about it is Gen indexes really speed up access and use of this data. So be sure to usegen indexes with it.
[00:09:24] Next quote is range types are a calendar app's best friend. So basically these aregreat to ensure that you don't have double booking of something. Another type they mention isquote defining your own acceptance values with enums. So basically at the database level youdefine what values are accepted. So it's a way to enforce a constraint on say, a state field andthen they go into cases like when you have an IP address a timestamp UUID, there's all sorts ofbuilt in types you can use in postgres. So this was a relatively short post that you shoulddefinitely check out.
[00:10:00] The next post is again a super short one but has really great information and this isand it's don't lock tables, just don't. So basically when you need to serialize access to one ormore resource you may be inclined to lock a table and basically really they're saying don't dothat because number one quote it blocks auto vacuum which can cause bloat and eventransaction ID wraparound in extreme cases. So these types of locks can also block queries andpotentially cause deadlock situations. And lastly, it's easy to cause deadlocks with bad lockingorder. And then they say if the goal is to serialize access, consider using advisory locks instead.They have all the benefits of a lock on a table while not actually blocking access to auto vacuumor access on secondaries. So definitely some advice to follow. The next post is actually a videofrom the second quadrant.com blog and it's called Data Integration with PostgreSQL. So it'sabout a 40 or 50 minutes video and basically using foreign data wrappers and logical decoding,it shows how to interface PostgreSQL with MySQL, to read and insert data or manipulate data todo it from a flat file, from a Python script, using a Rest interface, a different postgres node. Andthey even demonstrated some things with Kafka. So if you're looking at different ways tointerface postgres with other types of data, streams of data or database systems, this isdefinitely a YouTube video to check out and this was actually from a webinar called DataIntegration with PostgreSQL by Sean Thomas.
[00:11:47] The next post is actually another YouTube video called PostgreSQL Indexing. How,why and when? So this was presented at PyCon, Australia by Curtis Maloney and it justbasically goes over the basics of indexing but some good advice that's mentioned throughout,it's about 31 minutes long, there were some good refreshers at the end for me, but a lot of it Ihave experience with before. But if you're a developer that's relatively new to postgres, this isgetting indexing under your belt is essential. The next post is what hot standby feedback inPostgreSQL really does. So basically this is talking about the scenario where you have a primarydatabase that's doing streaming replication to a replica database. Essentially it's a hot standbyand there are instances where as the primary goes through updating deleting data, it canactually remove something that can impact an active query on the replica database. And whattends to happen is that that query gets canceled and you'll see something to the effect of errorcanceling statement due to conflict with recovery. Now, this blog post goes over kind of why thathappens and how hot standby feedback can help prevent these replication conflicts. But likeanything there's downsides to it and they mention here basically vacuum is going to be delayingits cleanup operation so it won't be able to clean up those dead rows as efficiently, which couldlead to bloat. And quote if the slave never terminates a query, it can lead to table blade on themaster which can be dangerous in the long run. So this is one parameter that you can tweak, butjust be aware of the downsides before you explore doing it. The next post is Tune Linux kernelparameters for PostgreSQL optimization.
[00:13:41] So this goes over a number of Linux parameters that you can tweak for PostgreSQL.Now, a lot of these are dependent upon the version of OS you're using, what distro and even theversion of PostgreSQL. So for example, Sh MMX and Sh Mall are only relevant for versions priorto 9.3 and a lot of these settings only improve performance once your database gets to a certainsize. So if you're not yet at like the terabyte stage, you probably won't get much of animprovement tweaking some of these parameters. But again, if you're going to definitely do it ina replica of your production system, testing out and seeing what parameter because it's alsovery hardware specific some of these parameters. So definitely test before you roll things outinto production. The next post is Real World SSD wearout.
[00:14:39] This is from the Okmeter IO blog and they're talking about how SSDs can tend towear out and how they are using a smart monitoring agent that collects disk drive attributes andbasically contract is the Media wearout indicator. So at what point is the SSD media going towear out? So this is just something to keep in mind if you actually have a physical server and it'snot like a database hosted solution or they are maintaining the hardware for you, but if you literal--ly have a system, this is something you need to be aware of that this eventually happens. Andwith reference to PostgreSQL, which they talk about here, is that it can definitely happen anddefinitely with regard to writes, it can happen more than you expect. For example, if you have alot of SQL queries or those SQL queries and those SQL queries are generating temp files, orthey're having to do sort operations and doing sorts on disk as opposed to memory, that canlead to a lot of disk writes and usage maybe that you weren't expecting. So it's definitelysomething to be aware of if you have physical servers. Last post is Beware of your next Glib Cupgrade and they're talking about they've done a big update to the correlation information forGanu Libc 2.2.8. Now, this probably isn't going to impact many people until the future, but it'sjust something to be aware of that this can have impact on how sorting works based upon howyou have set up your database. Now, they talk about in the bottom here quote as of this writing,only bleeding edge distros like Arch Linux have already shipped Glib C 2.8 and when it'sscheduled for Fedora, Debian and even Ubuntu. So just something to keep in mind anddefinitely a blog post to check out to see if you might be impacted by this in the future.
[00:16:41] That does it. For this episode of Scaling Postgres, you can get links to all the contentpresented in the show notes. Be sure to head over to Scalingpostgres.com where you can signup to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes.Thanks. --