Best way to hash two values into one?

Question

I'm trying to hash two unique strings together to create a hash. The most obvious way would be simply to concatenate the two and run a hash function on it:

hash = sha256(strA + strB)

But I was wondering if this is how most people do it, and if there is a "standard" way of doing something like this in a most secure way.

I don't really know if it's widely used, but concatenating shouldn't be a problem if, for example, each string has a fixed, known length. Otherwise you might encounter "collisions", since for example (srtA, strB+strC) would have the same hash as (strA+strB, strC) — Daniel, Jan 30 '18 at 16:36
That's a good point about collisions. I guess I'm looking for a "universal" algorithm that can avoid all these problems altogether without having to worry about edge cases, even when the strings are not of fixed length. — Vlad, Jan 30 '18 at 17:08
Prepending the length of each part should be sufficient. As explained in this answer what you want is a unique encoding. — Maeher, Jan 30 '18 at 19:01

score 15 · Accepted Answer · answered Jan 30 '18 at 22:29

15

The most important thing you're missing is that you should combine the two strings with an injective function: a function such that every unique combination of inputs determines a unique output.

Your idea to just concatenate the two strings violates this condition. One example is that "abc" + "def" is the same string as "ab" + "cdef" or "abcd" + "ef". Whether an adversary can exploit this to their advantage depends on the application, but there's no reason to expose oneself to this risk to start with.

Some techniques for combining strings in an injective fashion:

Put an unambiguous delimiter between the strings
Prepend each string with its length

Another tip: whichever method you pick, write a function that can "split" any combined string back into its components, and a bunch of test cases to show that the property split(combine(strA, strB)) == (strA, strB) holds for all values of strA and strB.

answered Jan 30 '18 at 22:29

Luis Casillas

14,468
2
31
53

1

Also any (standard) encoding which allows you to recover the individual strings from the result (such as cantor pairing or ASN.1 encoding) will also do the job – SEJPM Jan 30 '18 at 22:31
1

@SEJPM: The caveat there is that encodings are often not functions of their input because they don't determine a unique string for a given input. There is for example more than one way to JSON-encode the same data, e.g., different whitespace, or different orders of the same set of object fields. – Luis Casillas Jan 30 '18 at 22:42
1

+1 for that last tip, that is a good rule of thumb to follow. If you can't get back your original strings from the encoded ones, then that's a sign that your encoding was ambiguous. – Ella Rose Jan 31 '18 at 21:55

score 8 · Answer 2 · answered Jan 31 '18 at 21:23

Concatenation is not always convenient, and it's ambiguous: it results in hash2("ab", "c") = hash2("a", "bc"). This sort of collision can be a way to attack a system. For example, suppose that a system validates “harmless” pairs of strings, where all pairs where the second string contains only digits are considered harmless. Get the pair ("; system('bin/sh'); #", "1") signed as harmless, then present ("", "; system('bin/sh'); #1") which has the same hash and therefore the same signature.

One way to unambiguously denote the concatenation of strings is to encode strings and add delimiters (quotes), e.g. replace all \ and " by \\ and \" and surround each string by "…". This approach is the one that most text-based encodings take: XML, JSON, etc. The downside is that the escaping can get complicated, depending on the quoting rules (JSON is simple, XML and SQL aren't).

Another way, which is what most binary formats do, is to prefix each string with its length. ASN.1 defines a very complicated way to do this. It's very complicated because it caters to a lot of cases, it's way overkill for just concatenating two strings.

If all you need is to hash a list of strings, then a very simple solution is:

Hash each string.
Concatenate the hashes and hash the result.

For example:

hash2(strA, strB) = hash(hash(strA) || hash(strB))

where || denotes concatenation and hash is any cryptographic hash function.

This requires very little processing and has little risk of errors. It scales easily to any number of strings, even a variable number of strings. It even generalizes to structures that are more complicated than lists, with hash trees. It works because hashes have a fixed size, so there's no room for ambiguity.

score 3 · Answer 3 · answered Jan 30 '18 at 22:47

If you want this to always produce the same result for any combination of identical strings supplied in any order, there is a very simple way to do it

Hash each result individually

H1 = hash(str1)
H2 = hash(str2)
H3 = hash(str3)

Sort those hashes by order of smallest to largest (treat as integers), concatenate, and hash them together

Result = hash(H2+H1+H3)

Because the hashes are sorted, they will always be in the correct order for the final hash, you will also not need to worry about string order collisions when hashing concatenated strings, or choosing some delimiter that may or may not also be in one of the strings. This is more computationally expensive because it is hashed twice, and because of the sorting required, but provides flexibility for string order and content.

Hash each result individually: yes. Sort the hashes: no! That causes f(strA, strB) to be equal to f(strB, strA) which is not what was requested. — Gilles 'SO- stop being evil', Jan 31 '18 at 20:36
@giles hence the "supplied in any order" in the first paragraph, the question did not specify in order or out of order, simply not sorting them fixes the order — Richie Frame, Jan 31 '18 at 22:22

score 0 · Answer 4 · answered Jul 19 '20 at 14:24

To hash a sequence of strings unambiguously, so that any two different sequences yield a different hash, you could prepend every string with its length, e.g. by using decimal representation followed by a separating character (e.g. hyphen). For the strings "abc" and "de" and "" (empty string) and "f" this would look so:

hash("3-abc2-de0-1-f")

This scheme also covers the empty sequence (consisting of 0 strings):

hash("")

To save space, you might represent the lengths in a different base, say 64, using digits and letters and two more printable ASCII-chars.

You may even use all byte values as digits, i.e. represent the lengths in base 256. Then a separation char cannot be used any more, but you can prepend every length by its number of digits. Let $x_i$ be string number $i$ to hash: $$ hash(f(x_0) + f(x_1) + ... + f(x_{n-1})); $$ $$ f(x) = length(length(x)) + length(x) + x $$

$+$ means concatenation and $length(x)$ returns a byte string representing the length of string $x$ in base 256, without leading zeroes. Must return a zero byte when $x$ is the empty string, or it will not work.

This works because the length of a string is reasonably limited and can be represented using less than 256 chars so that $length(length(x))$ is always 1 byte.

To save a bit more space, define $len(x)$ to return the empty string if $x$ is empty and otherwise behave like $length()$. Then: $$ f(x) = length(len(x)) + len(x) + x $$

Best way to hash two values into one?

4 Answers4