(This is part 3 of a series on the bag/multiset data structure. Check out
parts 1 and 2.)
Improving the hash function
So our bag works per our tests. Let’s take a look at the naïve implementation
of hashing, at least for strings:
// This is the equivalent of `impl`ing a trait in Rust.
type String =Hashable;
// https://stackoverflow.com/a/8076436/3396324
String.prototype.hashCode=function ():number {
return [...this].reduce((hash, character) => {
hash= (hash<<5) -hash+character.charCodeAt(0);
returnhash&hash;
}, 0);
};
String.prototype.minimumHashValue=-0x80000000;
String.prototype.maximumHashValue=0x7fffffff;
exportdefault String;
One bit of good news is this: this is not baked in to the bag’s
implementation. We already discussed how it was desirable from an Open–Closed
Principle point of view to make this the responsibility of each type that
would like to be stored in a bag.
I don’t know enough about hashing algorithms to eyeball it in order to
evaluate whether this one is collision-resistant in common applications for
strings. But we can test it out empirically using a large SCOWL
dictionary.
With our hashCode() ported from Java per Stack Overflow, let’s bucket our
657,769 dictionary entries in 100 buckets and count how many values are in
each bucket. The mean is, of course, about 6,578. But the median is 5,830 and
the standard deviation is about 3,090. The smallest bucket has 4,677 entries
while the largest has 26,835 entries.
Not great. Let’s try this with the MurmurHash3 algorithm. I will
simply steal the minified JavaScript from
here.
Now we have a median of about 6,571, a standard deviation of about 72, and a
minimum and maximum of about 6,424 and 6,781, respectively. That seems
better.
And this reinforces our decision to have the data we want to put in our bag
extend Hashable: We could test this particular hashing algorithm against a
dictionary of English words without worrying about the behavior for other
kinds of data. For other data types, some other means of hashing might be
better.
Improving load factor
If you have 100 buckets and 75 entries, the load factor is the ratio of
those two, 0.75. Different languages use slightly different values, but 0.75
to at most about 0.90 appear to be the maximum load factors allowable for
best performance.
One of those surprising statistical factoids is that if you’re hosting a
party and want to ensure a 50% chance that two guests share the same
birthday, you need only invite 23 people; with 70 people, there’s a 99.9%
chance of a shared birthday (in each case assuming birthdays are uniformly
distributed).1 And so it is with hash tables, where collisions
become a problem sooner than you’d think.
To make sure load factor is in good shape, we need to pick a good starting
value, then resize our hashtable when the load factor gets too high.
Buckets based on modulus
My naïve implementation added unneeded complexity by basing what bucket
chosen on calculating the percentile of the hash in the range of possible
hash values and multiplying that by the number of buckets. That’s more
operations than are needed. Using hashResult % numBuckets should distribute
the result evenly across the hash buckets. This also lets us simplify our
Hashable implementation.
Linked lists
My naïve implementation uses an array of buckets, with each bucket itself an
array.2 The usual implementation is to use a linked list. It is a
bit unclear from my reading whether that’s desirable per se, or if it is
simply because Donald Knuth said to do it that way in an example.3 I
think it’s because linked lists don’t do spooky things like reallocate
unexpectedly, and by the time a bucket’s contents would need to reallocate,
we should be re-hashing.
Open addressing
Why it’s good
Another way of building a hashtable is to use open addressing.
Remember how our buckets hold linked lists or arrays rather than the values
themselves? That’s called “chaining.” Doing that means you have to go to some
other place in the computer’s memory, and so can’t take advantage of
“locality of reference.” That term means that computers are clever about
caching things and that accessing something close in memory to something else
you recently accessed can be more efficient if the second thing happened to
be cached when you looked up the first thing.
So in practice, it is often faster to store the data directly in the array of
buckets. That is, you can have the buckets that make up the array actually
hold the values rather than holding pointers to linked lists or other arrays.
Dealing with collisions
But this raises its own problem. Remember that with an array or linked list,
collisions are easy to deal with: we just add a new value to the collection
stored in each bucket. But when there’s just that one array, we have to find
a different place to stick the value that had a collision, and we have to
think about how we’ll delete that value.
Insertion
Before:
We want to insert Z, which hashes to 20.
Current Value | A | Q | M | empty |
------------------------------------------
Bucket | 20 | 21 | 22 | 23 |
Standard insertion
The usual approach is to put the value in the next empty slot. Then when it’s
time to go looking for a value, you hash it, go to the appropriate bucket,
and if it’s not in that bucket, keep going along the array till you find the
value or an empty bucket. In other words, go to the bucket where you’d
expect the value to be if there wasn’t a collision, then check the buckets
following that one, where you might have put it if there was a collision.
Finding an empty slot means you didn’t stick the value in the next available
slot after a collision—because you wouldn’t have left a gap in that case.
After:
Current Value | A | Q | M | Z |
---------------------------------------
Bucket | 20 | 21 | 22 | 23 |
Robin Hood hashing
You can also do what’s called Robin Hood hashing. In short, it means that if
you have a collision, as you look for an empty spot for your value, you keep
checking which value is more out of place—the one you’re inserting or the
one that’s already there. We rob from the rich to give to the poor, which is
where the name comes from.
Here’s the same example, but with some extra information:
Before:
We want to insert Z, which hashes to 20.
Current Value | A | Q | M | empty |
Correct bucket | 19 | 19 | 21 | N/A |
How out of place? | 1 | 2 | 1 | N/A |
-------------------------------------------------
How far off would Z be? | 0 | 1 | 2 | 3 |
-------------------------------------------------
Bucket | 20 | 21 | 22 | 23 |
We are trying to insert the value Z, which hashes to 20. Using Robin Hood
hashing would have us try to insert the value in Bucket 20. But that bucket
is occupied. And its current value A really should be in Bucket 19, so it’s
already one bucket out of place. Meanwhile, Z is zero buckets out of
place here at Bucket 20, so we aren’t going to exchange Z and A. A is
“poorer” because it is more out of place than Z is. So we don’t rob from
the poor to pay the rich here.
Now we come to Bucket 21, which contains Q, which is two buckets out of
place. Z is now one bucket out of place. No exchange there: again, it
would be robbing from the poor to pay the rich.
We come next to Bucket 22, which contains M, which is one bucket out of
place. But now Z is two buckets out of place. Let’s put Z in Bucket 22
and carry M forward. We rob from the richer M, which is only one bucket
out of place, to pay Z, which is two buckets out of place.
Finally, we get to Bucket 23, which is empty, so M goes there. The final
result is this:
After:
Current Value | A | Q | Z | M |
Correct bucket | 19 | 19 | 20 | 21 |
How out of place? | 1 | 2 | 2 | 2 |
---------------------------------------
Bucket | 20 | 21 | 22 | 23 |
By following this algorithm, values will not wind up wildly out of place,
thereby reducing the search cost following a collision, at little extra cost
(either way you must probe to the first empty spot; but by switching the
values, you make searches faster on average).
Deletion
There’s still one hitch (and bear with me till the example in the next
paragraph): what if you insert a value, then have a collision and insert
another value in the next available bucket, but then delete the first value
(the one in its proper bucket)? Or, for that matter, delete any value that
displaced another value because of a collision?
Imagine B hashes to Bucket 22. Then H also hashes to Bucket 22, so it
goes in Bucket 23.
Before:
We want to delete B, which hashes to 22.
Current Value | B | H |
-----------------------------
Bucket | 22 | 23 |
Now you delete B. Finally, you check to see if H is in the hashtable. It
hashes to bucket 22, but that bucket is empty. So you incorreclty conclude
that H isn’t in the hashtable, and report that back to the user.
↓ Problem!
Current Value | empty | H |
---------------------------------
Bucket | 22 | 23 |
There are two approaches often used in this situation.
Tombstones
First, you might replace the deleted entry with a tombstone, a special flag
that means “I deleted something here.” So you treat that bucket exactly as
you would a bucket that was filled with a value that didn’t match the one we
are searching for: it’s not empty, and so you look at the next one. This is
easy, but it means load factor doesn’t go down when you delete values: you
might have a hashtable with lots of tombstones just wasting space.
After:
Current Value | TOMBSTONE | H |
-------------------------------------
Bucket | 22 | 23 |
Left shift
Alternatively, you might shift values backward to fill in the gap. This
avoids using tombstones, but adds to the time complexity of deletion.
After:
Current Value | H | empty |
---------------------------------
Bucket | 22 | 23 |
Conclusion
So we have discussed some fixes here that will improve our implementation:
Improve the hash function.
Rehash above a specified load factor.
Use modulus in bucketing rather than a more complex algorithm.
Use open addressing with Robin Hood hashing and left shift on deletion.
I will implement those changes and describe updating the code in my next
post.
When I say the array is “in” the bucket, I actually mean that the bucket holds a pointer to the array, which is on the heap. So it’s really a pointer that is “in” the array. ↩︎