As an engineer who primarily works with data and databases I spend a lot oftime moving data around, hashing it, compressing it, decompressing it andgenerally trying to shovel it between VMs and blob stores over TLS. I amconstantly surprised by how many systems only support slow, inefficient, andexpensive ways of doing these operations.
In my experience, these poor algorithm choices are orders of magnitude slowerthan modern alternatives. Indeed, using a fast algorithm can often be thedifference between actually doing compression/hashing/encryption and “Eh, I’llskip it”. In the interest of using more of our CPU time to do useful workinstead of melting ice-caps and giving programmers excessively long coffeebreaks, we will explore some faster alternatives in this post.
TLDR
Application | Common Bad Performance Choices | Better Performance Choices | Expected Performance Gain |
---|---|---|---|
Trusted data hashing | md5 , sha2 , crc32 | xxhash | ~10x |
Untrusted data hashing | md5 , sha2 , sha1 | blake3 | ~10x |
Fast compression | snappy , gzip (zlib) | lz4 | 10x over gzip , ~2x over snappy |
Good compression | gzip (zlib) | zstd | ~2-10x |
Best compression | xz (lzma) | zstd -10+ | ~2-10x |
Java crypto (md5 , aes-gcm , etc …) | Built-in JVM crypto | Amazon Corretto Crypto Provider (ACCP) | ~4-10x |
Disclaimer: There are lies, damn lies, and benchmarks from some random personon the internet.
If you are considering taking some of the advice in thispost please remember to test your specific workloads, which might havedifferent bottlenecks. Also the implementation quality in your particularsoftware stack for your particular hardware matters a lot.
For this post I’ll be playing with a ~5 GiB real-world JSONdataset on my laptop’sIntel Core i7-8565U
pinned to4GHz.Since I want to benchmark the algorithms instead of disks I’ll be pre-touchingthe file into RAM with vmtouch
. Remember thaton most modern cloud servers with fast NVMe
storage (multiple GiBps) andgood page-caching algorithms your disks are likely not your bottleneck.
$ du -shc yelp_academic_dataset_review.json6.5G yelp_academic_dataset_review.json$ vmtouch -t yelp_academic_dataset_review.json# ...$ vmtouch yelp_academic_dataset_review.json Files: 1 Directories: 0 Resident Pages: 1693525/1693525 6G/6G 100% Elapsed: 0.1256 seconds
Hashing
I would like to check that this blob of data over here is the same as thatdata over there.
Trusted Data Hashes
These hash or checksum functions are used to ensure data integrity and usuallyare defending against bugs/bitrot/cosmic rays instead of malicious attackers.I typically see the following poor choices:
md5
: This hash is both weak and slow. It does have the advantage ofbeing one of the fastest slow choices in standard libraries and therefore itis somewhat common. Two of my favorite examples are slow Cassandra QuorumReads and slow S3upload/download (seriously, just try disablingmd5
on parts and see howmuch faster S3 is).crc32
oradler32
: I often hear “well crc32 is fast”, but to be honest Ihaven’t in practice come across particularly fast implementations inreal-world deployments. Sure theoretically there are hardwareimplementations, but at least in most Java or Python programs I profilerunning on most servers I encounter these checksums are not particularly fastand only generate 32 bits of output where I absolutely have to care aboutcollisions.sha2
: An oddly common choice for trusted data hashes. Odd because it isslow and you don’t need a cryptographic hash for syncing a filebetween two hosts within an internal network with an authorized transferprotocol (e.g.rsync
or via backups toS3
with proper AWSIAMpolicies). If attackers have access to your backups or artifact store youhave a bigger problem than them forging a hash.
Faster Choice
Try xxHash
. It is blazing fast, highquality, and the XXH64
variant is usually sufficient for most data integritychecks. It even performs well on small data inputs, where XXH3
isparticularly impressive. If you find yourself needing 128 bits of entropy (e.g.if you’re hashing data for aDHT
) you can eitheruse the 128 bit variant or xxh3
. If you only have an old version availablethat doesn’t support the new 128 bit variant two XXH64
runs withdifferent seeds is usually still faster than any other choice.
Untrusted Data Hashes
While most of the time the threat model for data transfer is “bugs / cosmicrays”, some of the time people want to defend against bad actors. That’swhere cryptographic hashes come in, most commonly:
md5
: It’s slow and not resistant to collisions … everything I love in a hashfunction.sha1
: This hash is at least somewhatperformant and isn’t as completely broken asmd5
. We should probably stillconsider it on shakygroundthough.sha2
(specificallysha256
): Anubiquitous hash and unlikemd5
andsha1
it is still considered secure, sothat is nice. Unfortunately this hash is noticeably slow when used inperformance critical applications.sha3
: The latest hash that NISTstandardized (because they had concerns aboutSHA-2
that appear at thistime somewhat unfounded). I was not able to find it packaged outside ofopenssl
and it doesn’t appear to have greatcli
or language support atthis time. It’s also still pretty slow.
Faster Choice
Try BLAKE3
. Yes it is a new (2020)algorithm and there are concerns about security margin but I’m just so tired ofwaiting for sha256
. In practice, this is probably a much better choice thanthe known-to-be-completely-broken md5
so if you’re reaching for md5
overxxHash
because you need a cryptographic alternative, consider blake3
instead. Also blake3
uses hash trees (merkletrees) which are wonderful whenimplemented correctly and I wish more systems used them.
The one major downside of blake3
in my opinion is that at this time (2021) Ionly know of really good cli
, Rust
, Go
, C
, and Python
implementationsand I can’t personally vouch for the Java
JNI bindings. I’ve only needed touse it so far from streaming pipeline verifications or artifact verification sothe cli
and python
implementations are good enough for me.
A quick reminder that there are lots of security-sensitive hashing situationswhere you don’t want a fast hash. For example, one situation where you want anintentionally slow algorithm is when dealing with passwords. In such cases youwant a very slow hash likeargon2
,bcrypt
,PBKFD2
or even just a high number ofrounds of SHA-512
.
Show me the Data
Expect xxHash
to net about a ~10x
improvement on MD5
and ~5-10x improvementon CRC32
depending on your CRC32
implementation (e.g. Java’s is truly terrible).Expect BLAKE3
to be slightly slower than xxHash
with a single thread soonly use it if you actually care about cryptographic hashes. A simpleperformance teston hashing a 6616 MiB
file confirms that indeed we have 10x performance on the table (note I’m reporting user
CPU time since the system time is not really up to the hash)
Algo | Hash Time (seconds) | Hash Speed (MiBps) | Cryptographic? | Bits of Entropy |
---|---|---|---|---|
MD5 | 9.1s | 727 | Not really | 128 |
CRC32 | 4.8s | 1378 | No | 32 |
XXH64 | 0.5s | 13232 | No | 64 |
SHA256 | 27.5s | 240 | Yes | 256 |
SHA3-256 | 16.8s | 393 | Yes | 256 |
SHA1 | 10.7s | 618 | Yes* | 160 |
BLAKE3 | 1.8s | 3675 | Yes | 256 |
Yes that’s right, ~0.5 seconds user CPU time for xxh64
versus ~27 for sha256
and ~9s
for md5
.If all you need to do is verify a file transfer you could be doing that 10x faster with xxHash.
The language versions often do make a big deal, e.g. JNI
versions that linkto fast native code in Java will often significantly out-perform pure Javaversions. “But Joey”, you say, “I have to use XYZ algorithm from the early ’00sbecause of the specification!”. That is unfortunate, but at least make sureyou’re using fast implementations, for exampleACCP will speedup MD5
on most Java VMs by a factor of ~4 as well as AES-GCM
by ~10x whileit is at it. ACCP achieves this by … linking in fast native implementationsof crypto.
Compression
I like my data transfers fast and don’t like paying for lots of disk ornetwork I don’t need. I heard data compression is a thing.
Most data compresses, especially text (e.g. JSON
). The three cases where dataprobably does not compress are if your data is random, the data was alreadycompressed, or the data was encrypted. Often in databases the metadata aroundthe data (e.g. write times, schemas, etc …) probably compresses even if thedata doesn’t. There are three primary measures of a compression algorithm:
- Compression ratio: How much smaller is the result than the input?
- Compression speed: How quickly can I compress data?
- Decompression speed: How quickly can I decompress data?
Depending on the use case, developers usually make some tradeoff between thesethree metrics. For example, databases doing page compression care most aboutdecompression speed, file transfers care most about compression speed, archivalstorage cares most ratio, etc …
Fast compression that gives great ratio can significantly improve mostworkloads but slow compression with bad ratio is painful and makes me sad.
Common Poor Choices
gzip
(zlib
): Atrusty workhorse but also a very slow algorithm. In many cases where yournetwork is fast (1 Gbps+
), compressing with gzip can actually hurt your system’sperformance (I have seen this numerous times).xz
(lzma
): Analgorithm that has pretty good ratios, but is so slow to compress that inpractice the only potential use cases are single-writemany-read use cases.snappy
: One of thefirst “I’ll trade off ratio for speed” algorithms, snappy was good for itstime. These days it is almost always outperformed by other choices.
Better Choice - I care about ratio
Try zstd. To spend more compressionCPU time for better compression ratio increase the compression level or increase the blocksize. I find that in most database workloads the default level (3
) or evenlevel 1
is a good choice for write heavy datasets (getting closer to lz4
)and level 10
is good for read heavy datasets (surpassing gzip
in everydimension). Note that zstd
strictly dominates gzip
as it is faster and getsbetter ratio.
Even better: zstd
supports training dictionarieswhich can really come in handy if you have lots of individually small butcollectively large JSON
data (looking at you tracing systems).
Better Choice - I only care about speed
Try lz4
. With near memory speeds and decentratio this algorithm is almost always a safe choice over not compressing atall. It has excellent language support and is exceptionally good for real-timecompression/decompression as it is so cheap.
Better Choice - I want to shovel data from here to there
Try zstd --adapt. This featureautomatically adapts the compression ratio as the IO conditions change to makethe current optimal tradeoff between CPU and “keeping the pipe fed”.
For example, if you are have very little free CPU on your system but a fastnetwork (looking at you i3en
instances) zstd --adapt
will automaticallycompress with a lower level to minimize total transfer time. If you have a slownetwork and extra CPU it will automatically compress at a higher level.
Show me the Data
Compression is a bit trickier to measure because the read to write ratiomatters a lot and if you can get better ratio that might be worth it to paya more expensive compression step for cheaper decompression.
Historically we had to make tradeoffs between ratio, compression speed anddecompression speed, but as we see with this quickbenchmarkwe no longer need to make tradeoffs. These days (2021), I just reach for zstd
with an appropriate level or lz4
if I really need to minimize CPU cost.
First let’s look at the results for the 6.5GiB
review dataset.
Algo | Ratio | Compression Speed (MiBps) | Decompression Speed (MiBps) | Transfer Time |
---|---|---|---|---|
gzip | 0.41 | 21 | 168 | 295s |
lz4 | 0.65 | 361 | 1760 | 36s |
zstd | 0.38 | 134 | 730 | 47s |
xz | ?? | ?? | ??? | ?? |
As xz
was estimating 1.5 hours to compress and I didn’t have that kind of time I ranthat on a smaller 380MiB
dataset:
Algo | Ratio | Compression Speed (MiBps) | Decompression Speed (MiBps) | Transfer Time |
---|---|---|---|---|
xz | 0.18 | 1.34 | 92 | 299s |
zstd -10 | 0.26 | 15.4 | 713 | 24s |
zstd -19 | 0.21 | 1.18 | 404 | 24s |
As expected lz4
is the fastest choice by a lot while still cutting thedataset in half, followed by zstd
. One of the really useful things aboutzstd
is that I am no longer reaching for specialty compressors depending onthe job, I just change the level/block sizes and I can get the trade-off Iwant.
Pipeline
Now that we have fast algorithms, it matters how we wire them together. One ofthe number one performance mistakes I see is doing a single step of a datamovement at a time, for example decrypting a file to disk and thendecompressing it and then checksumming it. As the intermediate products must hit disk andare done sequentially this necessarily slows down your data transfer.
When data transfer is slow there are usually either slow disks or slow Javaheap allocations in the way. I’ve found that if I can structure my pipelines asunix pipelines of fast C
programs (or Python with native extensions) with theoutput from one stage always flowing to the input of the other I can much moreefficiently upload and download data by doing everything (IO, decrypt, decompress, checksum)in parallel.
For example, something like the following will outperform almost any Java S3upload at putting 100 files in S3 along with whole object checksums
# Assumption: You have xargs, xxh64sum, zstd and awscli installed# Simple version$ DIR=.$ find $DIR -maxdepth 1 -type f | xargs -IX -P 8 xxh64sum X | sort -k 2 > aws s3 cp - s3://bucket/prefix/checksums$ find $DIR -maxdepth 1 -type f | xargs -IX -P 8 | bash -c 'zstd --adapt X --stdout | aws s3 cp - s3://bucket/prefix/data/X.zst'# Or the full pipeline option (full object hash and compress at the same time)$ find $DIR -maxdepth 1 -type f \| xargs -IX -P 8 \bash -c 'export URL=s3://bucket/prefix; cat X | tee >(xxh64sum - | aws s3 cp - ${URL}/X.zst.xxh) | zstd --adapt - --stdout | aws s3 cp - ${URL}/X.zstd'
Why do fast data algorithms matter ?
It matters because most systems choose the slow option and make routinedevelopment activities take longer in modern cloud infrastructure (fastnetworks). For example:
- Backing up data. Backing up 1 TiB of data to a fast blob store (can sinkmultiple
GiBps
) usinggzip
andsha256
would take around15
hours. Doingit withzstd
andxxhash
takes2.2
hours. Timely backups are one of themost important properties of a data system. - Software packages. By default debian packages are either not compressed orcompressed with
xz
. Installing 500 MiB of uncompressed debian packages overa one gig network takes~5s
, withxz
compression it actuallyslows down and takes~1s + ~6s = ~7s
, and withzstd -19
compression ittakes~1s + ~1s = ~2s
. If thesha256
is checked that would add another~20s
for a total of~30s
versus if we check withblake3
that adds~0.1s
for a total of~3s
. I’d take the3s
over30s
any day. Thismatters every time you build container images, bake cloud VM images orbootstrap servers with configuration management (puppet/chef). - Container Images. By default
docker
usesgzip
compression withsha256
checksums, which means to decompress and checksum 4GiB of containers I’d needabout60s
of CPU time instead of ~6s
withzstd
andxxhash
(or6.5
swithblake3
). This matters when your docker pull is in the startup path ofyour application.
A note about implementation availability
A major disadvantage of using good algorithms is that they may not always showup in your language or OS by default. I’ve had good experiences with thefollowing implementations: