background

Postgres Goes Parquet | Scaling Postgres 339

Join Over 1,000 Engineers & Get New Episodes Weekly!

In this episode of Scaling Postgres, we discuss pg_parquet allowing Postgres to read and write parquet files, other useful extensions, open source bounties, and Postgres gotchas.

Content Discussed

YouTube Video

Podcast Audio

Transcript

When I had first heard of parquet files, I think it was with regard to Geographical Information systems and it was a file format that they used. But then I started hearing about parquet files being used with things like DuckDB, so like DuckDB could read them and analyze them. So I guess I really never understood what they were. But this first post this week actually talks about parquet files and allows postgres to use them natively. But I hope you, your friends, family and co workers continue to do well. 


Our first piece of content is “pg_parquet: An Extension to Connect Postgres and Parquet”. This is from CrunchyData.com and this is a new extension, PGParquet, that allows you to export tables or queries from postgres to parquet files, ingest data from parquet files to postgres, and also inspect the schema and metadata of existing parquet files. So this is super interesting. I think the other extensions that have had access to parquet files actually use things like DuckDB to do it, like PG analytics by ParadedB that eventually became PG Lakehouse. I think they used the DuckDB database to be able to read parquet files and you actually use the DuckDB engine to do it. Whereas this appears to be allowing the native postgres engine to access these files and it gives a better insight as to what they are because it says “parquet is a great columnar file format that provides efficient compression of data”. So I actually didn't know it was a columnar file format, but it's an actual open source standard. So he says Apache Parquet is an open source standard column oriented file format that grew out of the Hadoop era of big data. So in terms of using pgparquet, you can actually use copy commands and you just specify the format as parquet once you have the extension installed. And you can not only write to local files, you can also do it to S3 buckets and you can even copy from parquet files and it also allows you to inspect them. So my understanding is you can't really read the parquet files and do queries against it. You can copy them into postgres tables or write postgres tables to the parquet files, but you can't use postgres as a query engine with them. So I guess that's a benefit of some of the other solutions I mentioned earlier, pg_lakehouse and whatnot, because they actually use the DuckDB engine to actually query information in the parquet files. But this could definitely be used for archival purposes. But we'll have to see where this goes in the future, so be sure to check out this piece of content if you're interested. 


Next post is “The future of Postgres?”. This is from craigkerstiens.com and he says because Postgres needs to be very stable and reliable at its core, he thinks the future of Postgres is actually in extensions. He mentioned some of his favorite ones down here. PGStats auto explain PG Warren can be used for warming up a cache. He talks about other types of external extensions not part of the contributor module, such as Citus, which allows you to do scale out Postgres, or PG Search, which uses basically an elastosearch quality search engine embedded within Postgres itself PGChron for doing scheduled jobs. And of course he mentioned the recent extension as well pg_parquet and there's a quote that kind of emphasizes the focus of pgparquet is we wanted to create a lightweight implementation of Parquet that does not pull a multi threaded library into every postgres process. So it's basically lean and simple. But check this out if you want to learn more. 


Next piece of content “Work on Tembo Open Source for Fun and Profit”. This is from Tembo.io. and they have apparently partnered with Algora, which is a platform around code bounties. And there's two open source projects they're sponsoring. One PGMQ, which is a lightweight message queue for Postgres, and then PGVectorize that helps automate the transformation and orchestration of text to embeddings for semantic search presumably. And each of these projects have particular issues and they're looking for people to work on them for a code bounty. So apparently you will be paid money for developing a feature for each of these open source projects. So if you have the skills and interest to do that, maybe you want to check out this blog post. 


Next piece of content There was another episode of Postgres FM last week. This one was on “Postgres Gotchas”. So in this one Nikolai and Michael talked about things that probably new users, but even veteran users can encounter at times when using postgres. The first thing they mentioned is that any discussion of nulls was off the table because there's probably too many gotchas with nulls and that would have taken the whole episode and they actually discussed nulls in a separate episode. Next thing they talked about is an issue that I wasn't aware of is that when you create a table it always does it in a lowercase format. I mean, I knew that, but apparently if you use double quotes you can use any kind of casing you want and it will retain it. So you can have one table called underscore foobar and then make a camel case foobar table or an all caps foobar table. But now you need to use double quotes when referencing that table and I generally consider that more of a pain. But that can definitely catch people. The other thing they mentioned is the importance of analyzing. After you do an upgrade or even a reload of tables into the database system, it's very important to run, analyze to refresh the statistics. And it's not something that PGRestore or PGUpgrade does for you. You have to do it yourself. And the recommendation by Nikolai is basically using a tool like vacuum db if you can and set up the number of jobs as many cores as you have on your machine. And be sure to analyze only because you don't necessarily want vacuum to run, you just want to analyze those statistics and get them updated as fast as you can. They also mentioned another gotcha is having DDL or table changes impacting queries in the system because maybe it can't acquire a lock, even though it may be super fast to do that modification, if it can't obtain the lock, you could cause a lock queue and queries falling behind it because it can't execute. And my number one recommendation with that is whenever you're doing DDL changes, be sure to use a lock timeout. So that should timeout that DDL change and protect your system from causing a bunch of locking. And related to that, when you're creating indexes, be sure to always create them concurrently because it's super easy to forget to do that and then you're locking things up unintentionally and the next thing they mentioned is actually an SQL standard thing, but the fact that deletes and updates can run without a where clause. So I'm sure plenty of people have been caught up by deleting all the rows in the table or updating all the rows in a table unintentionally because they didn't include a where clause. But that's not necessarily Postgres specific. They did mention a few other things, but if so, if you want to learn more about that, you can listen to the episode or watch the YouTube video down here. 


Next Piece of content “Chicago PUG October 15 recording”, this is the Postgres user group of Chicago and the recording looks to have a discussion all about temporal tables. So if this is of particular interest to you, this is the ability to track multiple rows of the table with the same unique identifier, as long as the time range of each of those identifiers is distinct. You may want to find more about that in this presentation, because presumably many of these features are coming in postgres 18. 


Next piece of Content “Hacking Postgres: Francesco Tisiot” this is from tembo.io. and this is actually a podcast or show. It looks like Season 3 Episode 1 of Hacking Postgres where they interviewed Francesco Tizio of Avian. So give this a listen if you have interest in that. 


Next piece of content ”Postgres 17 Presentation” this is from momjian.us, and this is a presentation that Bruce Momgean created about what he thinks are the most important areas of improvement in Postgres 17. So you can check this out if you're interested. 


Next Piece of content “Why Does Everyone Run Ancient Postgres Versions?”, this is from neon.tech and I think a lot of this post discusses about why it's important to do upgrades and all the work required to do it. I know the reason that sometimes for some of my applications, not necessarily my clients but the ones I actively run, I don't keep up on the latest version is because so many other servers with regard to an application or ephemeral. But your database always needs to stay up. So I know even at times I'm guilty of not staying on the latest greatest version of Postgres either. But feel free to check out this content. 


Next Piece of content “Improved not null/null planning in PostgreSQL 17” this is from dbi-services.com and I would say this enhancement is more of a no-op. So for example, if you create a table, insert a million rows into it, analyze it, and say the particular column that you created is not null, so it will never be null. Essentially, if you do a query looking for that column where it's not null In Postgres 16, you'll get a sequential scan, but it's also doing a filtering operation where it's not null. But in Postgres 17 it says, well, we don't need to check that, all we need to do is a sequential scan. There's no reason to add a filter to it because everything is not null. Similarly, when you look for that table where the value is null In Postgres 16, Postgres will run the query. You can see a parallel sequential scan here and it's doing a filter looking for where a is null, but it can never be null. So what Postgres 17 does, it just immediately returns and has a one time filter saying false. So it's basically a no op. So this is a super great enhancement where PostgreSQL doesn't need to do any work. So that's great. 


And the last piece of content “PostgreSQL 17: JSON_TABLE(), MERGE with RETURNING, and Updatable Views” this is from andyatkinson.com and he covers a lot of the major areas of Postgres 17 including SQL, JSON, and the JSON table command. The merge enhancements with returning enhancements to database views, materialized views, as well as Trigger updatable views, which is a new feature I haven't seen yet. And he talks about the NCLAUSE enhancements as well as lower memory consumption for vacuum. So if you want to learn more about that, definitely check out this blog post. 
 

episode_image