Best method for checking integrity of 50 GiB files?

Question

I'm making backups of compressed data (using ZPAQ), some compressed archives are large (50 GiB)

I was just going to use sha3sum -a 512, for example:

sha3sum -a 512 "Total_DOS_Collection-RELEASE#17-[1981-1995].zpaq"

Then storing the value and later periodically checking if the value of the backup is still the same

Then I started wondering if there would be a better/safer/more appropriate/effective method...

Could you define what is better, safer, /more appropriate/effective method for you? Where do you store these files? SHA3 is not a recovery for errors. You may need RAID or multiple backups, that is out of CSE. — kelalaka, Oct 24 '19 at 07:35
I know SHA3 is not a recovery for errors, the idea is me being aware when the file has somehow become corrupted, by comparing the actual SHA3-512 to the SHA3-512 of the original file — Charles D. Ward, Oct 24 '19 at 09:00
The best method in my opinion, would be one that allows me to know if the file has become corrupt or not, just that. I think knowing the original SHA3-512 sum of a file would be enough for this, I'm looking for confirmation here. If there would be any kind of modification of the compressed archive (partial corruption, for example), the SHA3-512 sum would be different, right? And if no modification had been done to the compressed archived, the SHA3-512 would always be the same, right? — Charles D. Ward, Oct 24 '19 at 09:01
Tree hashing. Either functions with native support like Blake2 or a generalized method like Merkle tree hashes. The point of tree hashing is to enable parallelism (faster hashing) + ability to detect exactly which sections have been modified. — Natanael, Oct 24 '19 at 11:21
Just use ZFS. Fun fact, it uses SHA-256 and tree hashing. It stores the hashes apart from the data, which makes it less prone to random changes within sectors of data. — Maarten Bodewes, Oct 24 '19 at 16:34
What is your threat model? Why a collision-resistant hash instead of a CRC, which has better error-detection capacity against independent random bit flips? What will you do if you detect an error? — Squeamish Ossifrage, Oct 24 '19 at 16:45
Have no threat model, just want to store backups of important data and be able to notice if eventually there is corruption of any kind
Thought checking if the SHA3-512 sum of a given file was still the same or not would be a goood solution to my problem, was I wrong and should use crc32 instead?

If I detect corruption on, for example, a backup on a Blu-Ray Disc, I would grab another copy of a given backup which is still intact and burn a new Blu-Ray Disc. My backups consists of ZPAQ compressed archives on the cloud, on local hdds (EXT4) and Blu-Ray Discs — Charles D. Ward, Oct 24 '19 at 19:57
@Natanael What tool would you recommend in Linux for tree hashing? — Charles D. Ward, Oct 24 '19 at 21:54
It would appear that bt2sum is a ready-made tool that computes the aforementioned blake2 tree hash using good defaults (namely using the number of CPU cores for parallel processing), if you feel standard b2sum is too slow, give this a try (and only turn to b2sum if the sha3 utility is too slow for you). — SEJPM, Oct 25 '19 at 11:32

score 1 · Answer 1 · answered Oct 25 '19 at 10:42

1

Tree hashing would be a natural way to do it.

However, the data still needs to be read. And in practice the hash computation time is negligible compared to I/O, even with a fast SSD.

So, just use a standard hash function. SHA-3 is slow in software, so you'd better opt for BLAKE2.

answered Oct 25 '19 at 10:42

Frank Denis

2,964
15
17

Dunno, SSD's have become pretty fast of late. I've got an Intel 660p which runs at 1.8 GB/s at full speed. Sure, hashes can be fast, but at those speeds you'll have to make sure that you've got the right implementation at the very least. – Maarten Bodewes Oct 25 '19 at 10:58

Best method for checking integrity of 50 GiB files?

1 Answers1