# Efficiency analysis of single-dimensional hashing

- Due Jul 20 by 11:59pm
- Points 100
- Available May 30 at 12am - Jul 20 at 11:59pm about 2 months

### Overview

In this offering of CMSC420, we do not have projects on hashing. There is a multitude of mostly academic reasons for which the instructor has decided against those, most of which stem from the fact that the Java programming language (as well as others) has pretty much solved the question of hashing through built-in constructs:

- The Java Object class that lies at the base of its hierarchy contains a method called hashcode() which, when Object is subclassed, is overloaded approprlately to meet specifications of different classes that have been appropriately vetted from subject matter experts for decades now. So the language solves the first question of hashing, that of generating an approximately uniform pseudo-random hash function.
- The Java programming language also contains an implementation of the Map ADT known as a HashMap. Coupled with TreeMaps, a Java programmer will never have to delve deeper than that level in order to achieve amortized constant time for lookups, insertions and deletions (worse than constant in the case of a TreeMap). Similarly, a Python programmer (particularly when coupled with Cython) will be very happy with dicts. So this means that the collision resolution part of hashing has also been solved with tools that the students are very likely to be using as black boxes in their jobs.
- The implementation of a unit-testable project for students' submissions would imply abstracting away Hash Tables with raw arrays, and using hash functions for custom or ready classes that are very unlikely to have as desirable properties as Object.hashcode(), for instance. This makes the project uninteresting in terms of data structures when compared to other objects we can fit into a summer session.

### Problem Specification

We are therefore interested in the implementation of a benchmark for various different methodologies of hashing. We want to generate timing graphs for inputs of various sizes, and compare load factors. The baseline would be linear probing, and then the runtime of different numbers of insertions, deletions and lookups will be compared to double / quadratic / exponential hashing and ordered hashing methodologies. The results would need to be graphed, and multiple different measurements should be made: Increase in operation efficiency as a function of the initial load factor (e.g 0.7 vs 0.6 vs 0.5 vs ... with a single load factor adjustment policy in place (e.g doubling), differences in operation efficiency across different load factor adjustment policies (static increase vs doubling vs tripling vs squaring vs ...), potential different overridings of hashcode(), etc

### Suggested workflow

- First, implement linear probing with whatever hash code your programming language gives you by default. Make sure that the insertions of new keys move along the array as required by the linear probing collision resolution strategy (you might need debugger output for this). Make sure that the load factor increase happens and does not cause problems (i.e unit test, again and again and again...)
- Then, implement double hashing. Use slight adjustments of your built-in programming language's hash function. Unit-test, again and again.
- Start your comparisons between those two for different data types. Build a custom type, whose hash function will somehow combine the contained types' hash functions.
- Expand to quadratic and exponential hashing. Perform a bunch of experiments across those four different collision-resolution techniques. Generate graphs.
- For appropriate queries, compare the output of your currently best collision resolution strategy with ordered hashing or Brent's method. Graph the results.

### Possible Extensions

- Thorough comparison (time
**and**space) of Brent's method against the Gonnet-Munro algorithm for collision chain re-organization. - Investigation of Object.hashcode() and similar built-in hash functions: given a collision-resolution strategy, why have the developers of the languages chosen those particular functions? What are their implementations? (python and the OpenJDK are open source, as is stdC++ on Unix/Linux) Given a custom data type, with various different contained built-in types, how should we override hashCode() to take advantage of the individual hashCode()s?

### Grading

Roughly as follows:

- Thorough and tested implementation of linear probing and double hashing: 10%
- Thorough comparison of the two with various graphs of operation efficiency as a function of multiple factors: 40%
- Expansion of project to quadratic / exponential hashing: 80%
- Comparison of at least one operation with ordered hashing collision resolution methodologies: 100%