In my previous post, I talked about a bag
data structure, which is like a set that allows repeated occurrences of the
same value.
In this post I will discuss my naïve implementation that has the
goal of O(1) best case insertion, search, and deletion. You can follow
along by viewing the code at https://www.github.com/ethan826/bag.
In the next post I’ll discuss improvements to consider, mostly around
improving the hash function and considering load factor. For now we are using
techniques you wouldn’t want to use in production: among other things, we set
a numBuckets value to 100 and never adjust that; and our hashing method may
not be robust against collisions. I will also discuss testing in a future
post.
One other preliminary note: the easiest way to do this would be to wrap a
hashmap/object.
classBag<T> {
privatebag: Object<T,number>;
constructor() {
this.bag= {};
}
insert(value: T) {
if (this.bag.hasOwnProperty(value)) {
this.bag[value] +=1;
} else {
this.bag[value] =1;
}
}
// And so on...
}
I wanted to learn a bit about how hashtable-based data structures worked, so
I’ve rolled my own using arrays.
Generic bag
We want the bag to accept generic types, but we want to make sure that any
type we use can be hashed. For the time being, it makes the most sense to
require types to satisfy this requirement themselves: I don’t care what kind
of type you are so long as you implement the behavior we ask for. So
throughout this project, the type signature is T extends Hashable, where
Hashable is this:
// src/Hashable/index.ts
/**
* Represents a type that can be hashed to an integer value.
*/exportdefaultinterfaceHashable {
hashCode: () =>number;
maximumHashValue: number;
minimumHashValue: number;
}
So a type that extends Hashable is one that (1) can hash itself, and (2)
can tell you where that hashed value fits in with the range of possible
outputs from the hashing function. This is an example of asking “who needs to
know this?” Should Bag and its collaborators know how to hash every
possible type? Should a person creating a new type have to write code in
Bag or one of its collaborators? No. This is also an example of the O in
SOLID.
That allows the HashTable class to work out what bucket the value belongs
in:
// src/Bag/HashTable.ts
/**
* Determine what bucket—meaning the index of `hashTable`—the `value` belongs
* in. This is the result of taking the hashcode for `value` and determining
* proprotionately how far between the minimum and maximum `hashCode` values
* that `value` falls, then finding what integer that falls into from zero to
* `numBuckets`.
*
* @param value - The value to determine the bucket for.
* @returns The bucket number, meaning the index for `hashTable`.
*/privatecomputeBucket(value: T):number {
constsizeOfRange=value.maximumHashValue-value.minimumHashValue;
constdistanceFromFloor= (value.hashCode() -value.minimumHashValue) /sizeOfRange;
return Math.floor(distanceFromFloor*this.numBuckets);
}
This is also an example of “describing what kinds of arguments are
appropriate in precisely the terms that make sense” that I referred to in
an earlier post.
Design
Recall that hash tables work a bit like libraries. You look up a book by call
number, which lets you end up standing in front of the right bookshelf. Then
you look at the books that are present to find the one you want. For our
implementation,
Bag is the library: it contains the bookshelves and the card catalog, and
keeps track of the overall status of things (for example, it might track
how many total books are present).
HashTable is the card catalog: it helps you go from knowing what book
title you want to standing in front of the correct shelf.
Bucket is the shelf: it’s what actually holds a small subset of the
library’s books, and the thing from which books are actually inserted and
removed.
Bag
Bag exposes methods like insert(value: T), delete(value: T),
deleteAll(value: T), and count(value: T). Bag—after doing its own
bookkeeping (for example increasing the value returned by
length()—delegates all these methods to the same methods on Bucket. Bag
also directly delegates toArray() to HashTable.
HashTable
HashTable maintains the array of buckets and knows how to fetch a Bucket
based on a value. It simply organizes buckets and passes them up to Bag.
It calls hashCode() on values, then figures out proportionately where that
fits given the number of buckets (if the value hashes to 9,999 in a range of
10,000 possible hash codes, and there are ten buckets, HashTable figures
out that we want the tenth bucket).
Bucket
In an ideal world a Bucket would contain only a few values. If there were
only one bucket or if the hashing algorithm always returned the same value,
all entries in the HashTable (and therefore the Bag) would be in a single
bucket. That’s why the worst-case performance for common operations is
O(n).
Bucket exposes a similar public API as Bag does, and that’s because Bag
uses HashTable to find the right Bucket, then asks Bucket to do the
work of inserting, deleting, counting, etc.
Bucket also hides the detail mentioned in the previous
post, namely avoiding the presumption that something is intended by the difference between a value that’s missing from the data structure versus a value that reports zero occurrences.1
Injectability
Bag allows us to inject the HashTable implementation. A good refactor
would be to define HashTable as an interface and make the parameter generic
over HashTable interfaces. Similarly, HashTable allows us to inject
Bucket’s constructor. The same refactor applies here.
The goal with this is to satisfy the D of SOLID, which brings along
benefits when testing, as I shall take up in a future post.
I arrived at this overall architecture by applying the extract class
refactor two times. What drove me to that was that I realized it was
difficult but important to test aspects that would have been part of the
private API of Bag if it had been one giant class. The fact that something
important to test is inside a private API is a sign that you should extract
that logic into collaborator and probably inject that dependency.
Equality
One other thing to notice here is that we are not preserving object identity.
We are treating objects that are equal and that hash to the same bucket as
being indistinguishable. In other words, as long as two objects have
structural equality, we will allow all copies to be garbage collected. This
is the same thing that happens in a set, but not the same thing that happens
in an array. Using DDD terminology, a bag of this kind would be an
inappropriate way to store entities but would be suitable for value objects.
See my earlier post about avoiding differences that don’t mean anything. ↩︎