73

I'm getting this strange result that SHA-512 is around 50% faster than SHA-256. I'm using .net's SHA512Managed and SHA256Managed classes. The code is similar to the one posted here but I'm referring to tests taking caching into account (from the second time reading the file onwards it seems that it's cached completely). I've tested it several times with the same results.

My question is: is this logical or must there be something wrong with my test?

Maarten Bodewes
  • 92,551
  • 13
  • 161
  • 313
ispiro
  • 2,005
  • 2
  • 18
  • 29

6 Answers6

76

This isn't necessarily unexpected. 32-bit platforms vs 64-bit platforms can make a significant difference, as well as the amount of data you're hashing.

$ uname -m
x86_64

$ openssl speed sha256 sha512
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
sha256           29685.74k    79537.52k   148376.58k   186700.77k   196588.36k
sha512           23606.96k    96415.90k   173050.74k   253669.59k   291315.50

As you can see, on my 64-bit machine, SHA-512 beats SHA-256 for hashing anything more than 16 bytes of data at a time. And generally, the more data being hashed at once, the bigger the performance improvement.

Edit: As @MaartenBodewes points out in the comments, there's also SHA-512/256 which does the same computation as normal SHA-512 (with a different initial value) but truncates the output to 256 bits. This is a better option (due to the different IV) than simply truncating the output of SHA-512 to 256 bits by yourself in the case where you need the higher throughput but are limited to 256 bit outputs. Alternatively, if you really need higher throughput, BLAKE2b is an excellent cryptographic hash that is extremely fast and natively supports arbitrarily sized outputs (between 1 and 64 bytes).

Stephen Touset
  • 11,002
  • 1
  • 38
  • 53
  • Thanks. But isn't the amount of calculations larger for SHA512? (Which is why I was surprised.) – ispiro Jun 16 '15 at 20:28
  • 15
    SHA-256 performs 64 rounds of its compression function over 512 bits (its blocks size) at a time. SHA-512 on the other hand performs 80 rounds of the compression function, but over 1024 bits at a time. So yes, SHA-512 performs more calculations in a single invocation, but it does so over a larger quantity of data at a time. – Stephen Touset Jun 16 '15 at 20:38
  • 4
    @StephenTouset A hint about the existence of SHA-512/256 would make a nice addition. – Maarten Bodewes Jun 16 '15 at 20:44
  • 3
    The crossover point should be at 56 bytes, since SHA256 jumps to 2 blocks of 64 bytes each, at that point. SHA256 might get ahead again at 120 bytes, since then it's 3 blocks for SHA256 and 2 blocks (128 bytes each) for SHA512. – CodesInChaos Jun 16 '15 at 20:49
  • @CodesInChaos You're probably right, although the fact that SHA-512 uses 128 bit length encoding could make a slight difference, e.g. for 113 bytes SHA-256 can do with 64 + 56 - 1= 119 bytes input for two blocks while SHA-512 now also needs two blocks. – Maarten Bodewes Jun 16 '15 at 20:59
  • @MaartenBodewes By the time I remembered that, the 5 min edit limit had expired. – CodesInChaos Jun 16 '15 at 21:03
  • 2
    Well, there's three blocks of SHA-256 against two block SHA-512 as well, which I forgot about. Lets just conclude that for small inputs you may want to test which one is faster if you want to go purely for speed. – Maarten Bodewes Jun 16 '15 at 21:07
  • it may also be useful to note that SHA-384 uses the same idea as SHA-512/256 (but with truncation to 384 bits instead of 256), and is a lot more widely supported. – lily wilson Jun 17 '15 at 17:26
  • 2
    This doesn't actually answer the question. The speed difference has more to do with the algorithm specification than 32/64-bit execution. Richie Frame's answer actually explains what's going on internally, and concurs with my experience in implementing both hash functions. – Nayuki Feb 02 '17 at 05:36
50

SHA-512 has 25% more rounds than SHA-256. On a 64-bit processor each round takes the same amount of operations, yet can process double the data per round, because the instructions process 64-bit words instead of 32-bit words. Therefore, 2 / 1.25 = 1.6, which is how much faster SHA-512 can be under optimal conditions.

Of course there is memory overhead, instruction latency, and other factors involved; on an Intel Ivy Bridge processor long message SHA-512 is 1.54 x faster, and on an AMD Piledriver it is 1.48 x faster.

For small messages (less than 448 bits) SHA-512 will be approx 1.25 x slower, because only a single hash iteration is performed. There are also various crossover points where one hash will need to process an extra iteration and the other will not, but these numbers are averages, the actual performance graph will be stepped at the iteration increment point.

Richie Frame
  • 13,097
  • 1
  • 25
  • 42
13

Benchmarks

I would like to see also some real-life measurements here, so I hope you'll like it ;)


Intel Core i7-7700HQ (7th gen = Kaby Lake); RAM (DDR4)

HW / OS configuration:

  • System: Linux Mint 20.2 "Uma" Cinnamon (64-bit); intel-microcode package, as well as the latest UEFI/BIOS patch 1.13.0, were installed.

  • Processor: Intel® Core™ i7-7700HQ (Ark Intel), PassMark, 2.80GHz - 3.80GHz, 4 cores, 8 threads, laptop

  • Memory: 32GiB DDR4 2400MHz (dual-channel)

  • CPU flags (grep flags /proc/cpuinfo):

    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
    

Methodology:

  1. Starting with rebooting the laptop.

  2. For the speed measurement the pv utility (man page) has been used:

    pv --average-rate BigFile24GiBinRAM | sha512sum --binary  # [ 332MiB/s]
    pv --average-rate BigFile24GiBinRAM | sha256sum --binary  # [ 200MiB/s]
    
  3. The BigFile24GiBinRAM was with /dev/urandom generated file.

  4. Of course, all unnecessary services and programs were stopped at test time, like anti-virus solutions, browsers, etc, the only thing which was running actually was the desktop environment, but I have an excuse for that - I have an UltraHD display of only 15.6 inches, it's not readable in a normal terminal (VT), sorry about that.

  5. The file was located in the RAM (tmpfs).

  6. I ran each test 3 times with the results being the same +/- 1.

Test results:

  1. SHA-512 resulted in the speed of 332MiB/s ~ about 66% faster!

  2. SHA-256 resulted in the speed of 200MiB/s.


Intel Core i7-4700HQ (4th gen = Haswell); SSD (SATA)

HW / OS configuration:

  • System: Linux Mint 18.2 Cinnamon 64-bit; intel-microcode package, as well as the latest UEFI/BIOS patch, were installed.

  • Processor: Intel® Core™ i7-4700HQ (Ark Intel), PassMark, 2.40GHz - 3.40GHz, 4 cores, 8 threads, laptop

  • Memory: 16GiB DDR3 1600MHz (dual-channel)

  • CPU flags (grep flags /proc/cpuinfo):

    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
    

Methodology:

  1. Starting with rebooting the laptop.

  2. For the speed measurement the pv utility (man page) has been used:

    pv --average-rate BigFile103GiB | sha512sum --binary
    pv --average-rate BigFile103GiB | sha256sum --binary
    
  3. The BigFile103GiB was a virtual disk containing real data (VirtualBox).

  4. Of course, the virtual machine wasn't running at test time, and all unnecessary services and programs were stopped at test time.

  5. The file was located on a 2.5" SATAIII SSD drive.

  6. I ran each test 3 times with the results being the same +/- 1.

Test results:

  1. SHA-512 resulted in the speed of 275MiB/s ~ about 50% faster!

  2. SHA-256 resulted in the speed of 183MiB/s.


Intel Xeon E3-1225 v3 (4th gen = Haswell); RAM (DDR3)

HW / OS configuration:

  • System: GNU/Linux Debian 9 64-bit; intel-microcode package, as well as the latest UEFI/BIOS patch, were installed.

  • Processor: Intel® Xeon® E3-1225 v3 (Ark Intel), PassMark, 3.20GHz - 3.60GHz, 4 cores, 4 threads, server

  • Memory: 32GiB DDR3 1600MHz (dual-channel) UDIMM ECC

  • CPU flags (grep flags /proc/cpuinfo):

    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
    

Methodology:

  1. Starting with rebooting the server.

  2. For the speed measurement the pv utility (man page) has been used:

    pv --average-rate BigFile24GiBinRAM | sha512sum --binary
    pv --average-rate BigFile24GiBinRAM | sha256sum --binary
    
  3. The BigFile24GiBinRAM was with /dev/urandom generated file.

  4. Of course, all unnecessary services and programs were stopped at test time.

  5. The file was located in the RAM (tmpfs).

  6. I ran each test 3 times with the results being the same +/- 1.

Test results:

  1. SHA-512 resulted in the speed of 315MiB/s ~ about 53% faster!

  2. SHA-256 resulted in the speed of 206MiB/s.

4

SHA-512 (and SHA-384) is usually faster on 64-bit platforms, and SHA-256 is usually faster on 32-bit platforms.

lily wilson
  • 457
  • 3
  • 15
  • Thanks. But isn't the amount of calculations larger for SHA512? (Which is why I was surprised.) – ispiro Jun 16 '15 at 20:29
  • @ispiro SHA512 does do more calculations. That's why you only see 50% improvement. Had the amount of calculations been the same you would have seen close to 100% improvement. – kasperd Jun 17 '15 at 10:46
  • 10
    This answer could do with some explanation and references. – curiousdannii Jun 18 '15 at 04:49
  • 6
    I have to agree with what @curiousdannii noted: currently, this looks more like a comment than an answer. Adding some information about *“why”* and *“under what conditions”* platform bitsizes are able to influence SHA2 speeds would be a good starter (if you need an example, just take a look at the answer by @richie-frame) … see, actually explaining things generally differs answers from comments. – e-sushi Jun 19 '15 at 02:42
1

It seems, on some systems such as my AMD Athlon 3000G running 64-bit openSUSE Linux, sha256 is 3x faster:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256          189198.58k   512668.07k  1043590.40k  1429210.79k  1641379.16k  1680277.50k
sha512           48800.73k   191544.23k   343078.23k   515876.82k   611980.63k   620609.54k
Bernhard M.
  • 129
  • 2
  • 1
    These systems use SHA instructions within the CPU cores as mentioned under another answer here. Those instructions accelerate SHA-1 and SHA-256 but not SHA-512. OpenSSL will generally use hardware acceleration where available. UPDATE: just checked to be sure, SHA-extensions are present in this model. – Maarten Bodewes Jan 09 '21 at 23:10
  • Yup, https://en.wikichip.org/wiki/amd/athlon/3000g is dual-core, Zen 1 microarchitecture, with SHA extensions. – Peter Cordes Apr 12 '21 at 00:29
  • @PeterCordes Do #cores matter as I get the exact opposite result from Bernhard? – Paul Uszak Jan 11 '22 at 13:41
  • @PaulUszak: IDK if number of cores matter. The microarchitecture matters tremendously (whether it has x86 SHA new instructions, hardware acceleration for SHA-1 and SHA256, but not SHA512). And of course if you're using a VM, the VM has to let the guest see those extensions in CPUID. – Peter Cordes Jan 11 '22 at 13:47
  • The code set in the cpu is also important for this example. – Z0OM May 01 '22 at 09:48
0

This is a comparison I did in CentOS 7.7 over a year ago: (2.9GHz Intel i9, 2400 MHz DDR4 PCIe 3.1, SSD: 500 GB)

                # file size: 1.1GB, OS: CentOS 7.7.
            # Secure Hash Algorithm (SHA)
            # ===========================
            #
            # 1. Supported (use the _Digest column)
            # _Digest Program   Status
            # ------- -------   ------
            # sha512  sha512sum CySec Vault(tm) default digest
            #
            # 2. Technology review table
            # _Digest Program     Speed Status
            # ------- -------     ----- ------
            # ck      cksum       1.709
            # md5     md5sum      1.019
            # sha1    sha1sum     0.705
            # sha224  sha224sum   1.404
            # sha256  sha256sum   1.599 Supported
            # sha384  sha384sum   1.005
            # sha512  sha512sum   1.000 Supported
            #
            # 2.9GHz Intel i9, 2400 MHz DDR4 PCIe 3.1, SSD: 500 GB
            # file size: 1.1GB, OS: CentOS 7.7. Timing:
            # $ time sha512sum --binary <file>
            # real    0m1.342s
            # user    0m1.361s
            # sys     0m0.128s

HarriL
  • 11