Concatenation is not always convenient, and it's ambiguous: it results in hash2("ab", "c") = hash2("a", "bc")
. This sort of collision can be a way to attack a system. For example, suppose that a system validates “harmless” pairs of strings, where all pairs where the second string contains only digits are considered harmless. Get the pair ("; system('bin/sh'); #", "1")
signed as harmless, then present ("", "; system('bin/sh'); #1")
which has the same hash and therefore the same signature.
One way to unambiguously denote the concatenation of strings is to encode strings and add delimiters (quotes), e.g. replace all \
and "
by \\
and \"
and surround each string by "…"
. This approach is the one that most text-based encodings take: XML, JSON, etc. The downside is that the escaping can get complicated, depending on the quoting rules (JSON is simple, XML and SQL aren't).
Another way, which is what most binary formats do, is to prefix each string with its length. ASN.1 defines a very complicated way to do this. It's very complicated because it caters to a lot of cases, it's way overkill for just concatenating two strings.
If all you need is to hash a list of strings, then a very simple solution is:
- Hash each string.
- Concatenate the hashes and hash the result.
For example:
hash2(strA, strB) = hash(hash(strA) || hash(strB))
where ||
denotes concatenation and hash
is any cryptographic hash function.
This requires very little processing and has little risk of errors. It scales easily to any number of strings, even a variable number of strings. It even generalizes to structures that are more complicated than lists, with hash trees. It works because hashes have a fixed size, so there's no room for ambiguity.