How are hash table's values stored physically in memory?

Question

Question:

How are hash table's values stored in memory such that space if efficiently used and values don't have to be relocated often?

My current understanding (could be wrong):

Let's say I have 3 objects stored in a hash table. Their hash functions generate these values:

0
10
20

I would presume that the pointers of these objects would not be stored at the following memory addresses because there would be huge gaps between them:

startOfHashTable + 0
startOfHashTable + 10
startOfHashTable + 20

The Wikipedia article on hash tables says that the "index" is computed as such:

hash = hashfunc(key)
index = hash % array_size

So in my example, the indices would be:

0 % 3 = 0
10 % 3 = 1
20 % 3 = 2

This gets rid of the huge gaps that I mentioned before. Even with this modulo scheme, there's problems when you add more objects to the hash table. If I add a fourth object to the hash table, I would need to apply % 4 to get the index. Wouldn't that invalidate all the % 3's that I did in the past? Would all those previous % 3's need to be relocated to the % 4 locations?

score 16 · Answer 1 · edited Oct 05 '15 at 06:00

16

The entries of a hash table are stored in an array. However, you have misunderstood the application of the modulo operator to the hash values. If the hash table is stored in an array of size $n$, then the hash function is computed modulo $n$, regardless of how many items are currently stored in the table. So, in your example, if you were storing the items in an array of size 6, the three items with hash values 0, 10 and 20 would be stored at locations 0, 4 and 2, respectively. If you added a fourth element with hash value, say, 31, that would be stored at location 1, without needing to move any of the first three items. If your hash table was becoming full and you wanted to move it into a bigger array, then you would need to recalculate the locations of all the items in the table and move them appropriately.

edited Oct 05 '15 at 06:00

Tom van der Zanden

13,238
1
35
54

answered Feb 27 '15 at 00:47

David Richerby

81,689
26
141
235

1

So you're saying hash tables are created with an estimated potential size and the items are only relocated when you need to increase the size... So it doesn't matter if a hash function has uniform distribution. For example, hash values of 0, 5, and 10 are uniformly distributed, but when inserted into a hash table of potential size 5, they all collide in bucket 0. It would be better to say the hash % table size should be uniformly distributed, not the hash itself. – Pwner Feb 27 '15 at 01:12
@Pwner All of that is correct, yes. – David Richerby Feb 27 '15 at 01:37
1

How is it possible to create a uniformly distributed hash % tableSize when tableSize can change? The hash values of 0, 5, and 10 create many collisions when the table size is 5, but have no collisions when the table size is 20. – Pwner Feb 27 '15 at 02:44
1

@Pwner Keep in mind that hashtables only have expected constant-time operations, if that. But only if the hash function is (approximately) uniform. – Raphael Feb 27 '15 at 07:41
@Pwner Dynamic size may not be studied as intensely because part of the premise is to use arrays which have static size. Afaik, there is no trivial or efficient way to move from a smaller to a larger table so most implementations amortise the cost away by growing rarely, i.e. by doubling size in each step. – Raphael Feb 27 '15 at 07:42
1

@Pwner The distribution isn't literally uniform -- but you would aim for close to uniform. – David Richerby Feb 27 '15 at 08:52
@Raphaek: with certain growing/shrinking schemes (i.e. geometric/exponential resizing), the cost of operations in hash table isn't just an expected cost, but actually amortised cost. The proof is too complicated to include in a comment, but this means the average number of times the elements have to be reinserted due to resizing is a constant multiple of the size of the hashtable over all operations. The maximum and minimum amount of wasted space at any point in time can also be proven to be a constant factor of the size of the hashtable. – Lie Ryan Feb 28 '15 at 03:56

score 9 · Answer 2 · edited Apr 13 '17 at 12:48

Hash-table usually do waste space. Many algorithms do, since time-space trade-offs are common, but they usually hide it better :). Like other algorithms, hash-tables do it to get better time performance.

The first point is that you try to avoid collisions in your hash-table, because that keeps the access time cost constant (but collisions are usually allowed and can be dealt with, thus allowing several items to be in the same entry, at time cost). The second point is that you try to avoid large unused gaps because that costs memory. The third point is that you avoid changing your hashing function (hence also the table size) because it requires reorganizing the whole table, which has a large time cost.

Unfortunately, the less gaps you have, the more likely a new hash entry will cause a collision. A good hash function, for a given data set, will limit the likelyhood of collision even with better use of available index space.

Actually, you should consider that there are two kinds of hash tables: static ones and dynamic ones.

For static ones, the data to be hashed does not change, so you can try to find a hash function with no collision at all for that data set. That is called a perfect hash. But the best is a minimal perfect hash, which achieves the result without gaps.

But that is not feasible when the data to be hashed changes dynamically, within a large set of possibilities. Then you cannot avoid collisions, but you try to limit them by having enough gaps.

There are a variety of techniques to manage that differently, adapting the table size to the number of values being hashed, growing the table when there are many collisions, or reducing it when there are too large gaps. But this has to be handled very carefully, using exponential table variations, so as to limit the impact of table reorganization on the overall cost of using the hash-table.

This is intended as an intuitive introduction. For more technical details, and references, you may look at answers to this question: (When) is hash table lookup O(1)?. Hash-tables and hashing is an important topic, with many variations.

score 3 · Answer 3 · answered Feb 27 '15 at 08:51

A good way to look at hash tables is like a lookup table with infinite index range (well, not really infinite, you're still constrained by the value limit of the key you're using).

Lets say you're trying to store some specific values of sqrt(x) in a lookup table where X is an integer, it would go something like this:

[1] = 1
[3] = 1.732
[10000] = 100

This makes for very cheap square rooting since instead of the expencive calculation, you can simply fetch the value from the array. It is however, very inefficient use of memory because [2] and [4 - 9999] are empty.

To the rescue comes the hash function, the purpose of a hash function in this context is to transform the index into something that actually fits in a reasonably sized array so, for example it could do this:

(1) = [5] = 1
(3) = [2] = 1.732
(10000) = [3] = 100

now all 3 values fit in an array the size of 6.

How does the hash function achieve this? The most basic hash function is (Index % ArraySize), the modulo operator divides the Index you chose by the size of the array and gives you the remainder which is always smaller than the array size.

But what if multiple indexes hash to the same result? This is called a hash collision and there are different ways of dealing with it. The simplest of which is storing each value along with its original Index in the array, if that array slot is taken, go forward by 1 until an empty slot is found. When retrieving the value, go to the location given by hash function and loop through the elements until the one with suitable original index is found.

This is why a good hash function is also great at dispersing the data so that whether the indexes coming in are sequential or random, the hash result should be as widely dispersed as possible to keep the cost of accessing data relatively constant.

Of course the bigger the underlying array, the less collisions you're going to get so its a tradeoff between speed and size efficiency. Modern hash tables usually fill up to ~70% while having less than 10 collisions per access. Along with the hash function, this means each data fetch costs ~20 cycles which is (for some purposes) a good compromise between speed(lookup table) and efficiency(list).

How are hash table's values stored physically in memory?

Question:

My current understanding (could be wrong):

3 Answers3