Implementing a bag

In my previous post, I talked about a bag data structure, which is like a set that allows repeated occurrences of the same value.

In this post I will discuss my naïve implementation that has the goal of O(1) best case insertion, search, and deletion. You can follow along by viewing the code at https://www.github.com/ethan826/bag.

In the next post I’ll discuss improvements to consider, mostly around improving the hash function and considering load factor. For now we are using techniques you wouldn’t want to use in production: among other things, we set a numBuckets value to 100 and never adjust that; and our hashing method may not be robust against collisions. I will also discuss testing in a future post.

One other preliminary note: the easiest way to do this would be to wrap a hashmap/object.

class Bag<T> {
  private bag: Object<T, number>;

  constructor() {
    this.bag = {};
  }

  insert(value: T) {
    if (this.bag.hasOwnProperty(value)) {
      this.bag[value] += 1;
    } else {
      this.bag[value] = 1;
    }
  }

  // And so on...
}

I wanted to learn a bit about how hashtable-based data structures worked, so I’ve rolled my own using arrays.

Generic bag

We want the bag to accept generic types, but we want to make sure that any type we use can be hashed. For the time being, it makes the most sense to require types to satisfy this requirement themselves: I don’t care what kind of type you are so long as you implement the behavior we ask for. So throughout this project, the type signature is T extends Hashable, where Hashable is this:

// src/Hashable/index.ts

/**
 * Represents a type that can be hashed to an integer value.
 */
export default interface Hashable {
  hashCode: () => number;
  maximumHashValue: number;
  minimumHashValue: number;
}

So a type that extends Hashable is one that (1) can hash itself, and (2) can tell you where that hashed value fits in with the range of possible outputs from the hashing function. This is an example of asking “who needs to know this?” Should Bag and its collaborators know how to hash every possible type? Should a person creating a new type have to write code in Bag or one of its collaborators? No. This is also an example of the O in SOLID.

That allows the HashTable class to work out what bucket the value belongs in:

// src/Bag/HashTable.ts

/**
  * Determine what bucket—meaning the index of `hashTable`—the `value` belongs
  * in. This is the result of taking the hashcode for `value` and determining
  * proprotionately how far between the minimum and maximum `hashCode` values
  * that `value` falls, then finding what integer that falls into from zero to
  * `numBuckets`.
  *
  * @param value - The value to determine the bucket for.
  * @returns The bucket number, meaning the index for `hashTable`.
  */
private computeBucket(value: T): number {
  const sizeOfRange = value.maximumHashValue - value.minimumHashValue;
  const distanceFromFloor =
    (value.hashCode() - value.minimumHashValue) / sizeOfRange;
  return Math.floor(distanceFromFloor * this.numBuckets);
}

This is also an example of “describing what kinds of arguments are appropriate in precisely the terms that make sense” that I referred to in an earlier post.

Design

Recall that hash tables work a bit like libraries. You look up a book by call number, which lets you end up standing in front of the right bookshelf. Then you look at the books that are present to find the one you want. For our implementation,

  • Bag is the library: it contains the bookshelves and the card catalog, and keeps track of the overall status of things (for example, it might track how many total books are present).
  • HashTable is the card catalog: it helps you go from knowing what book title you want to standing in front of the correct shelf.
  • Bucket is the shelf: it’s what actually holds a small subset of the library’s books, and the thing from which books are actually inserted and removed.

Bag

Bag exposes methods like insert(value: T), delete(value: T), deleteAll(value: T), and count(value: T). Bag—after doing its own bookkeeping (for example increasing the value returned by length()—delegates all these methods to the same methods on Bucket. Bag also directly delegates toArray() to HashTable.

HashTable

HashTable maintains the array of buckets and knows how to fetch a Bucket based on a value. It simply organizes buckets and passes them up to Bag. It calls hashCode() on values, then figures out proportionately where that fits given the number of buckets (if the value hashes to 9,999 in a range of 10,000 possible hash codes, and there are ten buckets, HashTable figures out that we want the tenth bucket).

Bucket

In an ideal world a Bucket would contain only a few values. If there were only one bucket or if the hashing algorithm always returned the same value, all entries in the HashTable (and therefore the Bag) would be in a single bucket. That’s why the worst-case performance for common operations is O(n).

Bucket exposes a similar public API as Bag does, and that’s because Bag uses HashTable to find the right Bucket, then asks Bucket to do the work of inserting, deleting, counting, etc.

Bucket also hides the detail mentioned in the previous post, namely avoiding the presumption that something is intended by the difference between a value that’s missing from the data structure versus a value that reports zero occurrences.1

Injectability

Bag allows us to inject the HashTable implementation. A good refactor would be to define HashTable as an interface and make the parameter generic over HashTable interfaces. Similarly, HashTable allows us to inject Bucket’s constructor. The same refactor applies here.

The goal with this is to satisfy the D of SOLID, which brings along benefits when testing, as I shall take up in a future post.

I arrived at this overall architecture by applying the extract class refactor two times. What drove me to that was that I realized it was difficult but important to test aspects that would have been part of the private API of Bag if it had been one giant class. The fact that something important to test is inside a private API is a sign that you should extract that logic into collaborator and probably inject that dependency.

Equality

One other thing to notice here is that we are not preserving object identity. We are treating objects that are equal and that hash to the same bucket as being indistinguishable. In other words, as long as two objects have structural equality, we will allow all copies to be garbage collected. This is the same thing that happens in a set, but not the same thing that happens in an array. Using DDD terminology, a bag of this kind would be an inappropriate way to store entities but would be suitable for value objects.


  1. See my earlier post about avoiding differences that don’t mean anything. ↩︎