-1

FNV-1, Murmur2, and DJB2 are examples of non-cryptographic hashing functions used in actual applications (see Which hashing algorithm is best for uniqueness and speed?). These are all similar in that they have an inner loop that computes the result by using simple operations such as XOR or bit shifting.

Perhaps these algorithms are truly the best available; I don't know. But Pearson Hashing, in particular, seems to be rarely considered. Yet it is the algorithm that would seem to be the best, given no domain information about the keys, because it was designed to spread out (randomize) the range well for any domain.

This means (for character string keys) that whether the keys are single letters (a small domain) or strings of length 128 (a much larger domain), the result (the hashed value) is guaranteed to be spread out well (distributed randomly over the range of the function). The reason that such a good random distribution is desired is that such a distribution would be expected to reduce the number of collisions for random selections of keys (again, assuming no special characteristics of the key (domain) distribution).

Pearson hashing accomplishes this by using a 256-entry array that contains a random permutation having a single cycle. To clarify what I mean, here is a four-entry array specifying such a permutation of the value list [0, 1, 2, 3]:

0:  2
1:  3
2:  1
3:  0

Pearson hashing scans the key string. For each byte (it's okay for a character to span multiple bytes), it XORs the byte into a running sum hash, then looks hash up in its 256-byte permutation array. This lookup result is the next running sum. Since the array contains 256 entries, it will handle any byte. To scale up to a larger range, such as a 16 or 32 bit range, the inner loop is performed more than once. Since the algorithm uses one XOR and one simple array lookup for each byte of the range, it is probably the fastest algorithm possible, particularly if implemented in assembly language. And the size of the Pearson hashing function should be small, not much more than the 256 bytes used by its array. Except in very tiny embedded applications, I can't imagine 256 bytes of memory being an objection to the use of Pearson hashing.

Some examples of Pearson function implementations are given at https://gist.github.com/imdario/4758192 .

I would be very interested to see an analysis of Pearson hashing as compared to other more commonly-used algorithms, such as the three listed above, with input sets such as an English dictionary. I would expect it to do very well.

  • 1
    What's the question? Sum it up. – RubberDuck Jun 11 '16 at 15:51
  • The question was stated clearly: "I would be very interested to see an analysis of Pearson hashing as compared to other more commonly-used algorithms..." It is not a yes or no question. It requires the kind of programming that was done in Question 49550. I post this as a question in case someone can answer it more easily than I can. – David Spector Jun 11 '16 at 16:06
  • 1
    That's not a question. It's a request for a scholarly article to be written here. – RubberDuck Jun 11 '16 at 16:07
  • Was Question 49550 "scholarly"? Good answers were already written here. What is wrong with "scholarly"? Education is the basis of all practical work in software engineering, which is the topic of this Forum. – David Spector Jun 11 '16 at 16:11
  • This isn't a forum @DavidSpector. It's a question and answer site. Your post had the potential to be an excellent question, but you need to narrow it down and ask an answerable question. Good questions have answers. – RubberDuck Jun 11 '16 at 17:14
  • Well, first I wanted to ask the OP to add Pearson Hashing in a comment to his great answer. But I didn't have enough points or whatever to post a comment. Next, I didn't have enough points to give an answer to that question. So I did the next possible thing, which was to ask a question. Why all the objections? – David Spector Jun 11 '16 at 20:27
  • Not a duplicate of other postings, because they do not include Pearson Hashing. I posted the question because Pearson Hashing is fundamentally different from other commonly-used functions. Its unique use of a permutation table reduces collisions no matter what the set of keys. I was hoping that the person who performed the other tests at that "possible duplicate" posting would want to add Pearson Hashing to the list of functions tested.
  • – David Spector Jun 23 '16 at 13:09
  • This is not a discussion forum and I was not calling for a discussion but actual testing. I have never seen Pearson Hashing given a fair trial, only for the reason that the common algorithms are more familiar. Actual comparison testing would reveal whether it is actually superior to the other functions, which actually form one family with one underlying but parameterized algorithm.
  • – David Spector Jun 23 '16 at 13:12
  • Even if Pearson Hashing is slower than other techniques, for very large hash tables, its better collision avoidance might be desired in practical use. How can we know without actual statistics? Pearson Hashing will never have a chance to catch on until it is compared with other functions.
  • – David Spector Jun 23 '16 at 13:14