Why can't one implement bcrypt in Cuda?

Question

I had heard that although it's easy to implement message digest functions like MD5, SHA-1, SHA-256 etc. in CUDA (or any other GPU platform), it is impossible to implement bcrypt there.

bcrypt is different from these hash functions, in that Blowfish is a block cipher, and in order to produce a one way hash a "feed back" is used.

I am not too familiar with the GPU platforms. Does anyone know if bcrypt can be ported to a GPU and if not, why?

For a moment I misread "Cuda" as "Cuba" and thought this was an encryption import problem. — Joe Z., Jan 27 '13 at 21:32

score 34 · Accepted Answer · edited Nov 14 '21 at 08:05

34

It is not impossible, only harder to implement efficiently. This is because of RAM. In a GPU, you have a number of cores which can do 32-bit operations. They will run at one operation per cycle and per core, as long as they operate on their respective registers. RAM access, however, is more troublesome. Each group of cores has access to a small amount of shared RAM, and all cores can read and write the GPU main RAM, but there are access restrictions: not all cores can read from or write to RAM simultaneously (constraints are stricter for main RAM).

Now bcrypt is a variant of the Blowfish key scheduling, which is defined over a table (a few kilobytes) which is constantly accessed and modified throughout the algorithm. Due to the size of the table, each core will have to store it in the GPU main RAM, and they will compete for usage of the memory bus. So bcrypt will run -- but not with full parallelism. At any time, most cores will be stalled, waiting for the memory bus to become free. This comes from the type of elementary operation bcrypt consists in, not from the fact that bcrypt is derived from the key schedule of a block cipher.

For SHA-1 or SHA-256, computation entirely consists in 32-bit operations on a handful of registers, so a password cracker will run without doing any memory access at all, and full parallelism is easily achieved (I did it on my GeForce 9800 GTX+, and I got about 98% of the theoretical maximum speed with a straightforward unrolled SHA-1 implementation).

For details on the programming model in CUDA, have a look at the CUDA C Programming Guide. Also, the author of bcrypt now proposes scrypt (edit: actually that's not the same person; the author of scrypt is Colin Percival, while bcrypt has been designed by Niels Provos and David Mazières), which is even heavier on the memory accesses, exactly so that implementation is hard on GPU and FPGA.

edited Nov 14 '21 at 08:05

forest

15,253
2
48
103

answered Aug 13 '11 at 04:46

Thomas Pornin

86,974
16
242
314

1

scrypt is really cool! What do you think of using memory-hard algorithms for key derivation and password storage? If I am not mistaken PBKDF2 is operation-hard and is there for is optimal for cracking with a GPU or FPGA. – Rook Aug 13 '11 at 07:37
1

So a special purpose hardware implementation with its own 4 KB of (fast) memory per core would be good for bcrypt, wouldn't it? (Not for scrypt, of course.) – Paŭlo Ebermann Aug 13 '11 at 15:57
@Rook: what we want is a password processing algorithm which is as slow as possible for the attacker, and as fast as possible for the "honest user". So it makes sense to optimize it for what the honest user will use, and that's a PC (or similar) with a general purpose CPU which is quite good at accessing lots of RAM. So yes, memory-hard algorithms are a good idea -- until one of your target systems is a memory-constrained embedded system, of course. – Thomas Pornin Aug 13 '11 at 16:06
3

@Paulo: yes, special purpose hardware with fast embedded RAM blocks should be efficient for bcrypt. Newer FPGA (e.g. the Virtex 5 from Xilinx) have such RAM blocks; it is a reason why Percival thought that bcrypt was not memory-hard enough, and thus designed scrypt. – Thomas Pornin Aug 13 '11 at 16:08
When you say " Each group of cores has access to a small amount of shared RAM" how much RAM are we actually talking about? Amounts less than the typical processor's L2 cache? – David Perry Oct 11 '11 at 17:58
@David: for my GPU (Nvidia 9800 GTX+), that's 16 kB of shared RAM for a group of 8 core -- but many more than 8 threads run on such a group (instructions have a high latency so you'd need about 200 threads to achieve optimal throughput). Also, not all cores in a group may access shared RAM simultaneously without constraints. See this document for all the gory details. – Thomas Pornin Oct 11 '11 at 18:24
Oh my, that is pretty limited. I assume bCrypt has similar memory use patterns to sCrypt and I know sCrypt needs at least 128k which is well within reason to ask of a CPU's L2 cache and dramatically more than what GPU cores apparently have available. Thanks! – David Perry Oct 11 '11 at 18:34
If I understand it correctly, using a salt of a few kilobytes (say 8KiB) with enough iterations would have the same effect? – Luc May 15 '16 at 12:26
@Luc: memory access pattern matters a lot in that. – Thomas Pornin May 15 '16 at 16:14
@ThomasPornin Sorry, I don't understand your response. The access pattern would be to go back to the core-group RAM or GPU main RAM (depending on how big the salt is) since it's too big to keep in registers. This would make iterating hash(password + big salt) roughly equivalent to what bcrypt does, right? – Luc May 15 '16 at 16:56
stupid question. do recent CPUs have more shared RAM and could bcrypt be made for those? – My1 Jul 27 '17 at 10:51
Thanks for your very forward-looking answer, Thomas. See my answer for references to efficient FPGA implementations. – nealmcb Jun 13 '20 at 20:42

score 3 · Answer 2 · answered Jun 13 '20 at 20:40

bcrypt has now been implemented for both GPUs and for FPGAs. See Bcrypt password cracking extremely slow? Not if you are using hundreds of FPGAs!.

The GPU implementation described there is just barely faster than the CPU implementation. But the FPGA implementation is much more cost-effective, and takes over an order of magnitude less power. But so far it only seems to run on discontinued FPGA boards.

In particular, first they compare two systems each costing on the order of a thousand dollars: a CPU (AMD EPYC 7401P - 24 core, 3.0 GHz), with a high-end GPU (Nvida RTX-2080Ti). Both are pretty slow for bcrypt using work factor 12 (2^12 hashes), e.g. 197 vs 219 hashes/sec.

But the FPGA implementation (using open source code from Jack the Ripper) can do about a thousand work-factor-12 hashes/sec on a single ZTEX 1.15y board, using just 3-5% of the power.

Why can't one implement bcrypt in Cuda?

2 Answers2

Linked