What exactly is a hash function?

Question

I have no idea how I managed to get this far in life without ever really grasping this but as it happens I'm still very confused on the concept of a hash function. I did some googling/wikipedia-ing, and here's what I get:

hash tables are nice because the index can be stored in a smaller array
hash functions are kind of like uniform random number generators that always have the same seed but not really (and I don't get why)

But here are the parts that are still very ambiguous to me:

Why use a hash function and not an RNG with a set seed and array size?
Why is it ok to have "collisions", which from what I understand is two keys pointing to the same index? I mean don't we rely on that stuff to be reliable?
Why do we mod with array size at the end? Can't we just restrict the range of the hash function?
what is the difference between a hash table and a pointer?
If we know the number of unique keys, why not just use direct addressing? What is so great about this uniform distribution of keys?
in the case that the number of unique keys are not known, exactly how is that calculated in the end? It seems like something really clever is supposed to happen here but short of just counting the number of entries in the table (and somehow accounting for collisions) I don't quite see it. Also, shouldn't we count the number of unique keys before even writing the hash function, since we have to determine the array size?

Thanks for any insight!

The answer below seems correct, but far too advanced for anybody with these questions. A random function should give a different value every time you execute it. ie: h(5)=20, h(5)=12, h(5)=1032, ... If you guarantee that your 'random' function at least gives the same value for the same input then it's a hash, like: h(5)=20, h(5)=20, h(6)=7354, h(6)=7354, h(7)=89, ... — Rob, Apr 29 '15 at 17:34
haha yeah I think you addressed one of my big questions (if the main goal of a hash function is to construct a repeatable uniform distribution then why not use an RNG with a set seed) but understanding the security concerns also helps explain why an actual implementation is more complicated. Thanks for the explanation! — Y. S., Apr 29 '15 at 17:40
The security concerns are about the ability to find collisions. You don't want a hash of "Rob Fielding is allowed to use the bazooka" to hash to the same value as "Mohammad Atta is allowed to use the bazooka". — Rob, Apr 29 '15 at 17:42
As an example of how exactly that scenario could happen... say that your function works on the string from left to right, and the end hash is effectively the sum of the chars "llowed to use the bazooka". Everybody is now allowed to use the bazooka. If I try to change the input message, it should resolve to completely random garbage. It should be too hard to come up with a valid message that hashes with the original. — Rob, Apr 29 '15 at 17:48
Java's string implementation used to hash the first 20 chars for instance. So if the authorization had the name of the person allowed to use the bazooka on the end, then everybody can use the bazooka. (Using the real Java string hash method that existed at the time -- not a cryptographically secure hash). — Rob, Apr 29 '15 at 17:49
hmm ok so let me try to clear up my mental picture now. If I imagine a hash table as implementing a dictionary in python, in this scenario, "is allowed to use a bazooka" is the key and "Rob Fielding" and "Mohammad Atta" are values? In which case they are confusable because they have the same keys? That kind of maks sense. But this seems like a user design choice and not a choice of a good collision-avoiding hash function? — Y. S., Apr 29 '15 at 17:52
So my original understanding of collisions was super muddled because it seemed like collisions were not being handled at all but this new shiny CS book suggests that collisions are dealt with (though not always efficiently), possibly through chaining, which is a little more satisfying. So I can see why a bad collision-happy hash function is computationally bad but I don't see why it's security bad? Knowing the hash function (or knowing the RNG seed) sounds bad for security, but it seems somewhat unrelated to collisions to me.
edit ok I just read your response :) thanks for all your comments! — Y. S., Apr 29 '15 at 17:56
Knowing the hash function is not a problem for security as long as the space is so large that you cannot find collisions at will. — Rob, Apr 29 '15 at 17:57
I see I see, hence the modding with array size to make the actual index storable. Sweet, thanks for your help! — Y. S., Apr 29 '15 at 18:00
Note that 1 key maps to whatever value. "Rob Fielding is allowed..." and "Mohammad Atta is allowed..." are 2 separate keys; and they each map to some value - which are almost certainly different from each other. It is common to use schemes in creating keys to handle keys that have parts in them, like this keying scheme: "$userName $permission $item". — Rob, Apr 29 '15 at 18:13
right, because the values are stored in separate data slots with different indices (hence different from pointers)? — Y. S., Apr 29 '15 at 18:22

score 5 · Accepted Answer · answered Apr 29 '15 at 17:27

A hash function is a pseudorandom function with a constant range. Ideally, one would like two central properties:

The hash function should be easy (fast) to compute.
The probability that two inputs $x,y$ hash to the same value is roughly $2^{-n}$, where $n$ is the output length.

The second property isn't stated rigorously. There are several ways to state it rigorously:

We can consider an ensemble (collection) of hash functions, and then the probability is over the choice of the hash function.
We can consider some distribution on pairs $x,y$.

The main difference between hash functions and pseudorandom number generators is that a hash function gives a unique value for each input. This is important for applications such as hash tables and message verification:

In hash tables, a hash function is used to choose the location at which an input is put. When you search for the same input, it is important that the hash function puts you in the same location. The pseudorandom property ensures that none of the cells gets to keep too many elements.
In message verification, you are sent separately a message and its hash, and you verify the integrity of the message by comparing its hash to the one given to you. Here the pseudorandom property ensures that it is hard to cheat.

Now to address your specific questions:

Why use a hash function and not an RNG with a set seed and array size?

A hash function has the property that each input is hashed to the same value, which is important in some applications (see above).

One way to guarantee the pseudorandom property is to have an ensemble of pseudorandom functions, and choose one at random; this is like choosing a seed at random. In practice, the seed is chosen at random once and for all, since this simplifies things.

Choosing the random seed ahead of time is OK for many applications. The main drawback is that the adversary gets to see the seed ahead of time, and can use this knowledge to try to cheat. When the hash function is easy to "break", this can actually cause problems. For example, it can allow Denial of Service attacks by sending some server some data which is all hashed to the same index, thus slowing things.

In settings when there is no adversary (say, a hash table used in your own program), a hash function with a fixed seed doesn't suffer from this drawback.

Why is it ok to have "collisions", which from what I understand is two keys pointing to the same index? I mean don't we rely on that stuff to be reliable?

Collisions are unavoidable, unless you know the input ahead of time. Indeed, if you have more potential inputs than possible outputs (which is usually the case for hash functions), then collisions are bound to occur.

Consider the idealized hash function in which the hash of any given input is chosen completely randomly (but is fixed); such a function is not efficiently computable, but otherwise is an excellent hash function. Even such a function will have collisions. The goal in constructing a hash function is to imitate this behavior while keeping the function easily computable.

Why do we mod with array size at the end? Can't we just restrict the range of the hash function?

You don't want to construct a specialized hash function for each array size. Rather, you obtain such a hash function by taking a canonical hash function and then modding with the array size.

Another way to look at it is that you are restricting the range of the hash function. You constructed your hash function with restricted range by taking a canonical hash function and reducing its output modulo the array size.

What is the difference between a hash table and a pointer?

A hash table is a data structure for storing entries (whatever they are), supporting operations such as insertion, deletion, and lookup. This is rather different from a pointer.

If we know the number of unique keys, why not just use direct addressing? What is so great about this uniform distribution of keys?

Even if you know the number of unique keys in advance, you don't necessarily know what they are, and without this information you can't use direct addressing.

Even if you did have a list of unique keys in advance, it is not always easy to construct an efficient direct addressing scheme. Imagine that the input is the set of all binary trees on at most 20 vertices. How do you map such a binary tree to an integer? While such schemes are definitely possible, often it is faster to use a hash function; the overhead due to collisions is negligible compared to the cost of computing a direct addressing scheme.

In the case that the number of unique keys are not known, exactly how is that calculated in the end? It seems like something really clever is supposed to happen here but short of just counting the number of entries in the table (and somehow accounting for collisions) I don't quite see it. Also, shouldn't we count the number of unique keys before even writing the hash function, since we have to determine the array size?

If you don't know ahead of time how many unique keys you have, then you use a dynamic hash table instead of a static one. A dynamic hash table can grow in size, though this is an expensive operation. The hash table is initialized at some fixed size, and once it gets too full, it grows in size; if it gets too empty, it shrinks. For an example, look up the Python implementation of dictionaries (dictionaries are the python name for hash tables).

Awesome, this explains most of my confusion! I think my mental picture was totally wrong; I was thinking of a hash function as essentially a lookup table for keys but there is in fact no "table", but rather just a serialization function. This explains why repeatability is so necessary.
I'm also reading Cormen/Leiserson/Rivest/Stein and they mentioned chaining as a way of handling collisions which makes that part more clear (though I guess it's not really a preferred strategy since it has bad worst case complexity?)

Anyway thanks so much for your very detailed answer! — Y. S., Apr 29 '15 at 17:41

What exactly is a hash function?

1 Answers1

Linked