Time between Christmas and New Year is always a good quiet time to be able to concentrate and tick off some mouldy TODO list items. For me this time the top item was a script (which turned into a small framework somehow), to download and import some openly available “real world” Postgres datasets to be able to perform some small tests on them.

The background for this “need” was that quite some time ago already when I was asked if the newer “generally good looking” compression methods (lz4, zstd) added in recent Postgres versions are a no-brainer for a typical database or have also some hidden downsides and should yet not serve as the new defaults. As I sadly didn’t have a really good answer back then I had to go with the typical - it depends on the dataset…. But the topic piqued my interest and made me experiment a bit…but hell, how to get hands on that mythical “average” dataset to benchmark against? Well, I guess the only way to get something of sorts would be to try to scrape together some real life datasets and run some tests on them, and see if something stands out consistently!

TLDR; - zstd indeed should be the new default pg_dump compression method for most datasets.

FYI - note that pg_dump is not a great primary backup method against operational threats though!

The test framework

The by-catch framework that I accidentally developed for this compression test can be found on Github

Basically it downloads / extracts / transforms / loads some openly available Postgres datasets and runs a test script on the resulting database that runs a loop of pg_dump with different compression settings. Tested were below settings defined here

METHOD_LVLS="gzip:1 gzip:3 gzip:5 gzip:7 gzip:9 lz4:1 lz4:3 lz4:5 lz4:7 lz4:9 lz4:11 zstd:1 zstd:5 zstd:9 zstd:13 zstd:17 zstd:21"

PS If you plan to run the script yourself, note that it will take half a day and download ~7GB over the internet and require ~100GB of disk space with default settings. If low on storage, set DROP_DB_AFTER_TESTING=1 to drop all restored DBs right after running the compression test.

pg_dump compression test results for gzip, lz4 and zstd

Main evaluation criteria of “goodness”

This required a bit of head-scratching actually…how does best compression look like - fast or with smallest file sizes?
As I’m not exactly a statistician I decided to ignore some more complex statistical models and go for the simplest thing that made sense to - by equally prioritizing speed and output size. In short - I normalized all test runs against both the shortest time and smallest output byte size (in context of each dataset) and just summed the normalized values together to get a “score” or a “grade”, (where lower is better) for a particular method and level, per dataset - meaning a perfect score would be 2 - having both the smallest output and fastest execution.

I do hope this calculation flies, but just in case the SQL for you to check and ping me if very off.

By both speed + size normalized scores avg rank per dataset

Out of 17 different method + level combinations. Note that the score has been turned to a rank here, as they varied quite a bit and would be harder to mentally visualize I guess.

method level avg_per_dataset_rank
zstd 1 1.0
zstd 5 2.2
gzip 1 4.5
lz4 1 5.3
gzip 3 5.7
lz4 3 6.2
zstd 9 7.7
lz4 5 8.0
gzip 5 8.0
lz4 7 9.7
gzip 7 11.0
zstd 13 11.7
lz4 9 12.2
gzip 9 12.8
zstd 17 15.0
lz4 11 15.0
zstd 21 17.0

Fastest per dataset

dataset_name method level time_spent_s avg_time_spent_s dump_size avg_dump_size
imdb lz4 1 19.2 254.1 2004 MB 1315 MB
mouse_genome lz4 1 209.6 1807.9 6644 MB 4211 MB
osm_australia lz4 1 32.4 695.7 5434 MB 3655 MB
pgbench gzip 3 25.3 417.8 277 MB 278 MB
postgrespro_demodb_big zstd 1 5.1 55.7 239 MB 256 MB
stackexchange_askubuntu lz4 1 10.5 259.5 1988 MB 1317 MB

Smallest output per dataset

dataset_name method level time_spent_s avg_time_spent_s dump_size avg_dump_size
imdb zstd 13 225.2 254.1 1035 MB 1315 MB
mouse_genome zstd 21 18540.3 1807.9 2261 MB 4211 MB
osm_australia zstd 17 1829.0 695.7 2356 MB 3655 MB
pgbench gzip 9 59.8 417.8 263 MB 278 MB
postgrespro_demodb_big zstd 21 409.1 55.7 169 MB 256 MB
stackexchange_askubuntu zstd 9 95.7 259.5 1008 MB 1317 MB

Test hardware

Test machine this time was my workstation with an AMD Ryzen 5950X inside, Ubuntu Linux in perf CPU mode and with 32GB of RAM - meaning all datasets except mouse_genome should be fully cached and the results not significantly disk affected.

Summary

After looking at the test (full SQL dump of my results table here) I think one can conclude:

  • Lower compression levels (1-5) for zstd offer the best bang for buck over all tested 6 datasets
  • zstd also has the highest compression rates when using higher compression levels
  • lz4 in lower levels is the fastest option, at the cost of higher dump sizes of course (50-100% on average)
  • Higher compression level really don’t offer much besides warming the planet
  • My personal current Postgres pg_dump default compression of gzip level 3 is still an OK performer both time and size wise, fitting into the top bracket for all tested datasets
  • The only artificial dataset in the bunch (pgbench) stood out well - it was the only one where gzip was the best
  • Note that it was not a multi-core test and things could look differently for pg_dump --format=directory with a few jobs
    • The annoying thing (from benchmarking point of view) with the directory mode is though that we can’t pipe to /dev/null (I actually piped to wc -c though to get the dump size) and the filesystem writing will possibly start to play a role. Something like nullfsvfs looks promising though to bypass that, need to test that
  • Note that I didn’t test the full range of compression levels but with a small stride, to minimize the runtime a bit
  • Note that restore times were not tested as it’s actually a way more rare thing to do, thus less relevant and it also starts to depend a lot more on hardware and a few Postgres config settings and you won’t be typically bottlenecked there on the decompression side. But if you indeed plan to use pg_dump with compression for operational aspects, double-check that for sure.
    • Remember though (as noted also in the intro) - dumps should not be the default backup strategy!
  • The test data showed that with zstd it can happen that higher compression levels actually increase the output size 🤯 Which though to my surprising seems to be “normal”

Request for Postgres open datasets suggestions

Do you know of any cool real life “ready-to-load-into-postgres” datasets available on the interwebs? Please let me know via a Github Issue or even better open a more complete PR in the project repo https://github.com/kmoppel/pg-open-datasets! Thank you!

A notice on availability for consulting

A small FYI to end this one - as of now I’m also up for short-term consulting on all matters around Postgres, please take a look here for some topics I have experience with if feel like needing some help :)