Use Fast Data Algorithms | Joey Lynch's Site (2024)

As an engineer who primarily works with data and databases I spend a lot oftime moving data around, hashing it, compressing it, decompressing it andgenerally trying to shovel it between VMs and blob stores over TLS. I amconstantly surprised by how many systems only support slow, inefficient, andexpensive ways of doing these operations.

In my experience, these poor algorithm choices are orders of magnitude slowerthan modern alternatives. Indeed, using a fast algorithm can often be thedifference between actually doing compression/hashing/encryption and “Eh, I’llskip it”. In the interest of using more of our CPU time to do useful workinstead of melting ice-caps and giving programmers excessively long coffeebreaks, we will explore some faster alternatives in this post.

TLDR

Application	Common Bad Performance Choices	Better Performance Choices	Expected Performance Gain
Trusted data hashing	`md5`, `sha2`, `crc32`	xxhash	~10x
Untrusted data hashing	`md5`, `sha2`, `sha1`	`blake3`	~10x
Fast compression	`snappy`, `gzip` (zlib)	`lz4`	10x over `gzip`, ~2x over `snappy`
Good compression	`gzip` (zlib)	zstd	~2-10x
Best compression	`xz` (lzma)	zstd -10+	~2-10x
Java crypto (`md5`, `aes-gcm`, etc …)	Built-in JVM crypto	`Amazon Corretto Crypto Provider (ACCP)`	~4-10x

Disclaimer: There are lies, damn lies, and benchmarks from some random personon the internet.
If you are considering taking some of the advice in thispost please remember to test your specific workloads, which might havedifferent bottlenecks. Also the implementation quality in your particularsoftware stack for your particular hardware matters a lot.

For this post I’ll be playing with a ~5 GiB real-world JSONdataset on my laptop’sIntel Core i7-8565U pinned to4GHz.Since I want to benchmark the algorithms instead of disks I’ll be pre-touchingthe file into RAM with vmtouch. Remember thaton most modern cloud servers with fast NVMe storage (multiple GiBps) andgood page-caching algorithms your disks are likely not your bottleneck.

$ du -shc yelp_academic_dataset_review.json6.5G yelp_academic_dataset_review.json$ vmtouch -t yelp_academic_dataset_review.json# ...$ vmtouch yelp_academic_dataset_review.json Files: 1 Directories: 0 Resident Pages: 1693525/1693525 6G/6G 100% Elapsed: 0.1256 seconds

Hashing

I would like to check that this blob of data over here is the same as thatdata over there.

Trusted Data Hashes

These hash or checksum functions are used to ensure data integrity and usuallyare defending against bugs/bitrot/cosmic rays instead of malicious attackers.I typically see the following poor choices:

md5: This hash is both weak and slow. It does have the advantage ofbeing one of the fastest slow choices in standard libraries and therefore itis somewhat common. Two of my favorite examples are slow Cassandra QuorumReads and slow S3upload/download (seriously, just try disabling md5 on parts and see howmuch faster S3 is).
crc32 or adler32: I often hear “well crc32 is fast”, but to be honest Ihaven’t in practice come across particularly fast implementations inreal-world deployments. Sure theoretically there are hardwareimplementations, but at least in most Java or Python programs I profilerunning on most servers I encounter these checksums are not particularly fastand only generate 32 bits of output where I absolutely have to care aboutcollisions.
sha2: An oddly common choice for trusted data hashes. Odd because it isslow and you don’t need a cryptographic hash for syncing a filebetween two hosts within an internal network with an authorized transferprotocol (e.g. rsync or via backups to S3 with proper AWSIAMpolicies). If attackers have access to your backups or artifact store youhave a bigger problem than them forging a hash.

Faster Choice

Untrusted Data Hashes

While most of the time the threat model for data transfer is “bugs / cosmicrays”, some of the time people want to defend against bad actors. That’swhere cryptographic hashes come in, most commonly:

md5: It’s slow and not resistant to collisions … everything I love in a hashfunction.
sha1: This hash is at least somewhatperformant and isn’t as completely broken as md5. We should probably stillconsider it on shakygroundthough.
sha2 (specifically sha256): Anubiquitous hash and unlike md5 and sha1 it is still considered secure, sothat is nice. Unfortunately this hash is noticeably slow when used inperformance critical applications.
sha3: The latest hash that NISTstandardized (because they had concerns about SHA-2 that appear at thistime somewhat unfounded). I was not able to find it packaged outside ofopenssl and it doesn’t appear to have great cli or language support atthis time. It’s also still pretty slow.

Faster Choice

Try BLAKE3. Yes it is a new (2020)algorithm and there are concerns about security margin but I’m just so tired ofwaiting for sha256. In practice, this is probably a much better choice thanthe known-to-be-completely-broken md5 so if you’re reaching for md5 overxxHash because you need a cryptographic alternative, consider blake3instead. Also blake3 uses hash trees (merkletrees) which are wonderful whenimplemented correctly and I wish more systems used them.

The one major downside of blake3 in my opinion is that at this time (2021) Ionly know of really good cli, Rust, Go, C, and Python implementationsand I can’t personally vouch for the Java JNI bindings. I’ve only needed touse it so far from streaming pipeline verifications or artifact verification sothe cli and python implementations are good enough for me.

A quick reminder that there are lots of security-sensitive hashing situationswhere you don’t want a fast hash. For example, one situation where you want anintentionally slow algorithm is when dealing with passwords. In such cases youwant a very slow hash likeargon2,bcrypt,PBKFD2 or even just a high number ofrounds of SHA-512.

Show me the Data

Expect xxHash to net about a ~10x improvement on MD5 and ~5-10x improvementon CRC32 depending on your CRC32 implementation (e.g. Java’s is truly terrible).Expect BLAKE3 to be slightly slower than xxHash with a single thread soonly use it if you actually care about cryptographic hashes. A simpleperformance teston hashing a 6616 MiB file confirms that indeed we have 10x performance on the table (note I’m reporting user CPU time since the system time is not really up to the hash)

Algo	Hash Time (seconds)	Hash Speed (MiBps)	Cryptographic?	Bits of Entropy
`MD5`	`9.1s`	`727`	Not really	`128`
`CRC32`	`4.8s`	`1378`	No	`32`
`XXH64`	`0.5s`	`13232`	No	`64`

`SHA256`	`27.5s`	`240`	Yes	`256`
`SHA3-256`	`16.8s`	`393`	Yes	`256`
`SHA1`	`10.7s`	`618`	Yes*	`160`
`BLAKE3`	`1.8s`	`3675`	Yes	`256`

Yes that’s right, ~0.5 seconds user CPU time for xxh64 versus ~27 for sha256 and ~9s for md5.If all you need to do is verify a file transfer you could be doing that 10x faster with xxHash.

The language versions often do make a big deal, e.g. JNI versions that linkto fast native code in Java will often significantly out-perform pure Javaversions. “But Joey”, you say, “I have to use XYZ algorithm from the early ’00sbecause of the specification!”. That is unfortunate, but at least make sureyou’re using fast implementations, for exampleACCP will speedup MD5 on most Java VMs by a factor of ~4 as well as AES-GCM by ~10x whileit is at it. ACCP achieves this by … linking in fast native implementationsof crypto.

Compression

I like my data transfers fast and don’t like paying for lots of disk ornetwork I don’t need. I heard data compression is a thing.

Most data compresses, especially text (e.g. JSON). The three cases where dataprobably does not compress are if your data is random, the data was alreadycompressed, or the data was encrypted. Often in databases the metadata aroundthe data (e.g. write times, schemas, etc …) probably compresses even if thedata doesn’t. There are three primary measures of a compression algorithm:

Compression ratio: How much smaller is the result than the input?
Compression speed: How quickly can I compress data?
Decompression speed: How quickly can I decompress data?

Depending on the use case, developers usually make some tradeoff between thesethree metrics. For example, databases doing page compression care most aboutdecompression speed, file transfers care most about compression speed, archivalstorage cares most ratio, etc …

Fast compression that gives great ratio can significantly improve mostworkloads but slow compression with bad ratio is painful and makes me sad.

Common Poor Choices

gzip (zlib): Atrusty workhorse but also a very slow algorithm. In many cases where yournetwork is fast (1 Gbps+), compressing with gzip can actually hurt your system’sperformance (I have seen this numerous times).
xz (lzma): Analgorithm that has pretty good ratios, but is so slow to compress that inpractice the only potential use cases are single-writemany-read use cases.
snappy: One of thefirst “I’ll trade off ratio for speed” algorithms, snappy was good for itstime. These days it is almost always outperformed by other choices.

Better Choice - I care about ratio

Try zstd. To spend more compressionCPU time for better compression ratio increase the compression level or increase the blocksize. I find that in most database workloads the default level (3) or evenlevel 1 is a good choice for write heavy datasets (getting closer to lz4)and level 10 is good for read heavy datasets (surpassing gzip in everydimension). Note that zstd strictly dominates gzip as it is faster and getsbetter ratio.

Even better: zstd supports training dictionarieswhich can really come in handy if you have lots of individually small butcollectively large JSON data (looking at you tracing systems).

Better Choice - I only care about speed

Try lz4. With near memory speeds and decentratio this algorithm is almost always a safe choice over not compressing atall. It has excellent language support and is exceptionally good for real-timecompression/decompression as it is so cheap.

Better Choice - I want to shovel data from here to there

Try zstd --adapt. This featureautomatically adapts the compression ratio as the IO conditions change to makethe current optimal tradeoff between CPU and “keeping the pipe fed”.

For example, if you are have very little free CPU on your system but a fastnetwork (looking at you i3en instances) zstd --adapt will automaticallycompress with a lower level to minimize total transfer time. If you have a slownetwork and extra CPU it will automatically compress at a higher level.

Show me the Data

Compression is a bit trickier to measure because the read to write ratiomatters a lot and if you can get better ratio that might be worth it to paya more expensive compression step for cheaper decompression.

Historically we had to make tradeoffs between ratio, compression speed anddecompression speed, but as we see with this quickbenchmarkwe no longer need to make tradeoffs. These days (2021), I just reach for zstdwith an appropriate level or lz4 if I really need to minimize CPU cost.

First let’s look at the results for the 6.5GiB review dataset.

Algo	Ratio	Compression Speed (MiBps)	Decompression Speed (MiBps)	Transfer Time
`gzip`	`0.41`	`21`	`168`	`295s`
`lz4`	`0.65`	`361`	`1760`	`36s`
`zstd`	`0.38`	`134`	`730`	`47s`
`xz`	`??`	`??`	`???`	`??`

As xz was estimating 1.5 hours to compress and I didn’t have that kind of time I ranthat on a smaller 380MiB dataset:

Algo	Ratio	Compression Speed (MiBps)	Decompression Speed (MiBps)	Transfer Time
`xz`	`0.18`	`1.34`	`92`	`299s`
`zstd -10`	`0.26`	`15.4`	`713`	`24s`
`zstd -19`	`0.21`	`1.18`	`404`	`24s`

As expected lz4 is the fastest choice by a lot while still cutting thedataset in half, followed by zstd. One of the really useful things aboutzstd is that I am no longer reaching for specialty compressors depending onthe job, I just change the level/block sizes and I can get the trade-off Iwant.

Pipeline

Now that we have fast algorithms, it matters how we wire them together. One ofthe number one performance mistakes I see is doing a single step of a datamovement at a time, for example decrypting a file to disk and thendecompressing it and then checksumming it. As the intermediate products must hit disk andare done sequentially this necessarily slows down your data transfer.

When data transfer is slow there are usually either slow disks or slow Javaheap allocations in the way. I’ve found that if I can structure my pipelines asunix pipelines of fast C programs (or Python with native extensions) with theoutput from one stage always flowing to the input of the other I can much moreefficiently upload and download data by doing everything (IO, decrypt, decompress, checksum)in parallel.

For example, something like the following will outperform almost any Java S3upload at putting 100 files in S3 along with whole object checksums

# Assumption: You have xargs, xxh64sum, zstd and awscli installed# Simple version$ DIR=.$ find $DIR -maxdepth 1 -type f | xargs -IX -P 8 xxh64sum X | sort -k 2 > aws s3 cp - s3://bucket/prefix/checksums$ find $DIR -maxdepth 1 -type f | xargs -IX -P 8 | bash -c 'zstd --adapt X --stdout | aws s3 cp - s3://bucket/prefix/data/X.zst'# Or the full pipeline option (full object hash and compress at the same time)$ find $DIR -maxdepth 1 -type f \| xargs -IX -P 8 \bash -c 'export URL=s3://bucket/prefix; cat X | tee >(xxh64sum - | aws s3 cp - ${URL}/X.zst.xxh) | zstd --adapt - --stdout | aws s3 cp - ${URL}/X.zstd'

Why do fast data algorithms matter ?

It matters because most systems choose the slow option and make routinedevelopment activities take longer in modern cloud infrastructure (fastnetworks). For example:

Backing up data. Backing up 1 TiB of data to a fast blob store (can sinkmultiple GiBps) using gzip and sha256 would take around 15 hours. Doingit with zstd and xxhash takes 2.2 hours. Timely backups are one of themost important properties of a data system.
Software packages. By default debian packages are either not compressed orcompressed with xz. Installing 500 MiB of uncompressed debian packages overa one gig network takes ~5s, with xz compression it actuallyslows down and takes ~1s + ~6s = ~7s, and with zstd -19 compression ittakes ~1s + ~1s = ~2s. If the sha256 is checked that would add another~20s for a total of ~30s versus if we check with blake3 that adds~0.1s for a total of ~3s. I’d take the 3s over 30s any day. Thismatters every time you build container images, bake cloud VM images orbootstrap servers with configuration management (puppet/chef).
Container Images. By default docker uses gzip compression with sha256checksums, which means to decompress and checksum 4GiB of containers I’d needabout 60s of CPU time instead of ~6s with zstd and xxhash (or 6.5swith blake3). This matters when your docker pull is in the startup path ofyour application.

A note about implementation availability

A major disadvantage of using good algorithms is that they may not always showup in your language or OS by default. I’ve had good experiences with thefollowing implementations:

xxHash: cli, python, and java
lz4: cli, python, and java.
zstd: cli, python, and java.
blake3: cli, and python
Pinch is a docker imageI built so I can bring my swiss-army-knife of hashing/compression techniquesto any server that can run docker. I use this a lot on CI/CD systems.