Hash Tables
djb2 hash → bucket index → average O(1)
Hash table · djb2 hash mod cap
init
Try it: Insert keys; the orange flash is the probe sequence. Toggle chaining ↔ linear probing to see two very different collision strategies. Then resize to watch every key get rehashed.
Why it matters
Hash tables achieve the holy grail: O(1) average-case lookup, insert, and delete. They're behind every dictionary, set, cache, and database index. Understanding hashing is essential for efficient algorithms.
The idea
Imagine a huge library where books are shelved by the first letter of the author's name. Finding "Knuth" means going straight to the 'K' section - no searching through everything.
A hash table works the same way, but smarter. A hash function converts any key into an array index:
hash("Knuth") → 42 → store at index 42
The magic: this conversion is O(1). You compute the index directly instead of searching.
The catch: collisions. Two different keys might hash to the same index. Solutions:
- Chaining: Each bucket is a linked list of colliding items
- Open addressing: Find the next empty slot
With a good hash function and low load factor, operations stay O(1) on average.
Key takeaways
- Average O(1) insert, lookup, delete
- Requires a good hash function for even distribution
- Collisions are inevitable - must be handled
- Load factor affects performance - resize when too full
Invariants
- Hash-table invariants
- Function purity: the hash function is deterministic — the same key always maps to the same slot. (If it weren't, you couldn't find anything.)
- Bucket integrity: every stored key
kresides at a slot reachable fromhash(k) mod capacityvia the chosen collision policy (chaining: same bucket; probing: starting bucket, then the deterministic probe sequence). - Load factor bound:
α = n / capacitystays below the resize threshold (typically0.7for probing,1.0for chaining). When it crosses, the table rehashes — same amortized story as array doubling.
Going deeper
Hash function families — pick by use case
Cryptographic hashes (SHA-256, BLAKE3) are overkill: they're designed to be unforgeable, not fast. Hash tables care about speed + uniform distribution, not collision-resistance against an attacker.
| Family | Speed | Notes |
|---|---|---|
| FNV-1a | very fast | tiny code, great for short keys; the demo's djb2 is in the same family |
| MurmurHash3 | fast, well-distributed | the default in many language runtimes (Java 8+, older Go) |
| xxHash3 | extremely fast | best raw throughput; used by zstd, LZ4, dataframe libs |
| CityHash / FarmHash | fast, Google-origin | similar tier to MurmurHash |
| SipHash | medium-fast, keyed | DoS-resistant default in Rust's HashMap, Python 3.4+, Perl |
| SHA-256 | slow (crypto-grade) | cryptographic; for HashMap, never; for content-addressed storage, yes |
Why the load factor matters — the birthday paradox
With m buckets and n keys distributed uniformly, the probability of any collision crosses 50% around n ≈ √(π·m/2) — the same math as the birthday paradox (collisions are surprisingly likely with far fewer items than buckets).
By the time α = n/m = 0.7, even a perfect hash function has many chains of length 2 or 3. Probing degrades faster than chaining — at α = 0.9, linear probing's expected probe count is ~5.5, and at α = 1.0 it diverges. The 0.7 cutoff is a convention chosen so probe count stays around 2 on average.
Hash-collision DoS attacks
If an attacker can choose your inputs and predict your hash function, they can craft keys that all land in the same bucket — collapsing your O(1) to O(n). Java 7 had a famous incident: HashMap<String, ...> exploded when CGI parameters were adversarial. The fix in modern runtimes is keyed hashing: at process start, a random secret is generated, and the hash function (SipHash) mixes it into every computation. The attacker can't predict where keys land without the secret.
If you're writing a library that takes untrusted input as hash keys (HTTP parsers, JSON deserializers, database query layers), make sure your hashmap uses keyed hashing. Rust's default HashMap does this; C++ std::unordered_map does not.
Advanced collision strategies
Beyond chaining and linear probing, two patterns trade complexity for better worst-case guarantees:
- Robin Hood hashing — during probing, if you find a slot whose current occupant probed a shorter distance than you have, you swap with it. Distributes probe distances evenly and bounds the worst case.
- Cuckoo hashing — two hash functions; if both slots are full, evict the existing occupant and reinsert it at its other hash position. Lookups are guaranteed
O(1)worst case (check two slots, done). - Hopscotch hashing — keeps each key within a bounded neighborhood of its home bucket, so lookup is cache-friendly even under collisions.
These show up in production: hopscotch in some Java libraries, Robin Hood in Rust's pre-2018 default, cuckoo in network-router lookup tables.
Math details
- Time Complexity (average case)
Insert: O(1)Lookup: O(1)Delete: O(1)- Worst case (many collisions)
- O(n)
- Load Factor
α = n / capacityResize when $\alpha > 0.7$ typically
Implementation
HashMap in Rust
use std::collections::HashMap;
let mut map = HashMap::new();
map.insert("key", 42); // O(1) avg
let val = map.get("key"); // O(1) avg
map.remove("key"); // O(1) avg
// Iteration is O(n)
for (k, v) in &map {
println!("{}: {}", k, v);
}
When to Use
- Dictionary/map (key-value store)
- Set membership testing
- Caching and memoization
- Counting frequencies
Practice
Three problems that show why hash tables are the Swiss army knife of algorithms:
- Group Anagrams — medium · "canonicalize then bucket" — the template for clustering by equivalence
- LRU Cache — medium · hash + doubly-linked list — the canonical two-structures-together design
- Longest Substring Without Repeating Characters — medium · sliding window + "last-seen index" hash — the O(n) string-window pattern