4

I have a data set that uses a very simple modulo 10 checksum algorithm which ignores alphabetic characters entirely. Which wasn't a big deal, as the few alphabetic characters present weren't critical.

Changes to the data format are upcoming, including characters which will now be critical to the interpretation of the data. So I'd like to propose a new checksum algorithm that can detect errors in those characters. And something more powerful then modulo 10 wouldn't hurt either.

I've been looking at the Luhn, Verhoeff and Damm and a few other check digit algorithms, but they all only handle digits, not characters. Is there anything out there that does both digits and characters [A-Z.+-] ? I'd rather implement something existing and proven rather then roll my own. I think the check 'digit' could reasonable be any character from the alphabet in use [A-B.+-], not just [0-9]

Thanks!

CoAstroGeek
  • 151
  • 1
  • 6
  • Can you not use the ASCII values of all of the characters? (Including the ASCII values of the digits) – Ben I. Apr 27 '17 at 20:53
  • Do you mean using the numeric ASCII value for that position as the value put into the algorithm? So modulo 10 for the string "3AB" would be (3 + 65 + 66)%10 I suppose you could, but I'm not sure the proven features of these algorithms (detect all single digit errors, detect all adjacent transposition errors, etc.) would still hold. I suppose you could replace every character in the string by it's equivalent ASCII code and run the chosen algorithm on that: "3AB" -> 516566 – CoAstroGeek Apr 27 '17 at 21:09
  • 2
    Just do checksum in base 39 (which is the size of your alphabet). Details left to you. – Yuval Filmus Apr 27 '17 at 21:13
  • @CoAstroGeek, Yes, that is what I had in mind. It just seemed simple that way :) – Ben I. Apr 27 '17 at 21:30

3 Answers3

1

I realize that this is a very old question, but I just came across it as part of my own research into the topic. Hopefully, this will be useful to somebody else who finds it.

There has been a newer proposal by Chen et al. ("A general check digit system based on finite groups", there's an author's copy here) for a system that can also be instantiated for base 36 (see Section 5.3 of that paper) and in that case will not only catch all single errors and transpositions (like the Verhoeff and Damm algorithms), but also all instances of errors such as jump transpositions.

It also seems like this system is being used in some real-world applications.

The Damm algorithm can also be extended to arbitrary bases (as per Damm's PhD thesis - generated operation tables are also given here for a number of orders including 36). It's easier to implement than the aforementioned one, but as I currently understand it, gives only the guarantee of detecting all single errors and adjacent transpositions by construction (though some of the many possible groups for a given order might provide stronger detection capabilities, see for example the table on page 107 of the thesis).

Note that all these systems are optimized for the kinds of errors that happen during transcription by humans. From your question, it's not quite clear if that is what you're going for. If not, a traditional checksum might be more appropriate.

fnl
  • 111
  • 2
1

Check digit

If the check digit needs to be a digit (0-9), here is one solution:

You could take the SHA256 hash of the character string, modulo 10, and use that as your check digit. The advantage is that this has good properties on average, and takes into account all characters.

Arguably, a possible disadvantage is that it is not guaranteed to detect all single-digit errors and all transpositions; those are only detected with probability 9/10. (Some other checksum algorithms for digits achieve the latter guarantee, but they don't work with alphabet characters. Moreover, there is no hope or possibility of achieving both of those guarantees when you have to deal with alphabetic characters: it's not possible to guarantee detection of either kind of error when you have alphabetic characters and a check digit that is restricted to 0-9.) Therefore, I'm not sure it is possible to do much better than this.

Check character

If the check digit can be anything from the character set, here is an alternative solution:

Suppose there are $n$ characters, so each character can be thought of as a number in the range $0,1,2,\dots,n-1$. Let the characters in the string be $x_1,x_2,\dots,x_k$. Then you can use the following as the check sum:

$$x_1 + 2x_2 + 3x_3 + \dots + kx_k \bmod n.$$

If $n$ is prime, this has the following nice guarantee: any one-character error will be detected (e.g., cooking -> cooling), and any transposition of two characters will be detected (e.g., weird -> wierd) (unless the transposed characters are a multiple of $n$ positions apart).

If $n$ is not prime, you get weaker properties, but it's still a reasonable checksum.

Alternatively, you can use the SHA256 hash of the character string, modulo $n$, as the check sum. That has no guarantees but is good on average.

D.W.
  • 159,275
  • 20
  • 227
  • 470
  • Thanks for the input. I don't see any reason why the check digit would be restricted to 0-9 though. The Luhn algorithm I just linked maps it back to the original alphabet. – CoAstroGeek Apr 27 '17 at 21:25
1

Ok, in reading my the links I posted above more carefully, I found one answer: Luhn mod N algorithm Similar to the suggestion Yuval gave above I guess. Suggestions still welcome, but this is a good starting point I think.

CoAstroGeek
  • 151
  • 1
  • 6