Lessons · #8 of 11

Hash Tables

djb2 hash → bucket index → average O(1)

Hash table · djb2 hash mod cap

size0

cap8

load0.00

init

key value

Try it: Insert keys; the orange flash is the probe sequence. Toggle chaining ↔ linear probing to see two very different collision strategies. Then resize to watch every key get rehashed.

Why it matters

Hash tables achieve the holy grail: O(1) average-case lookup, insert, and delete. They're behind every dictionary, set, cache, and database index. Understanding hashing is essential for efficient algorithms.

The idea

Imagine a huge library where books are shelved by the first letter of the author's name. Finding "Knuth" means going straight to the 'K' section - no searching through everything.

A hash table works the same way, but smarter. A hash function converts any key into an array index:

hash("Knuth") → 42 → store at index 42

The magic: this conversion is O(1). You compute the index directly instead of searching.

The catch: collisions. Two different keys might hash to the same index. Solutions:

Chaining: Each bucket is a linked list of colliding items
Open addressing: Find the next empty slot

With a good hash function and low load factor, operations stay O(1) on average.

Key takeaways

Average O(1) insert, lookup, delete
Requires a good hash function for even distribution
Collisions are inevitable - must be handled
Load factor affects performance - resize when too full

Invariants

Hash-table invariants: Function purity: the hash function is deterministic — the same key always maps to the same slot. (If it weren't, you couldn't find anything.); Bucket integrity: every stored key k resides at a slot reachable from hash(k) mod capacity via the chosen collision policy (chaining: same bucket; probing: starting bucket, then the deterministic probe sequence).; Load factor bound: α = n / capacity stays below the resize threshold (typically 0.7 for probing, 1.0 for chaining). When it crosses, the table rehashes — same amortized story as array doubling.

Going deeper

Hash function families — pick by use case

Cryptographic hashes (SHA-256, BLAKE3) are overkill: they're designed to be unforgeable, not fast. Hash tables care about speed + uniform distribution, not collision-resistance against an attacker.

| Family | Speed | Notes | |---|---|---| | FNV-1a | very fast | tiny code, great for short keys; the demo's djb2 is in the same family | | MurmurHash3 | fast, well-distributed | the default in many language runtimes (Java 8+, older Go) | | xxHash3 | extremely fast | best raw throughput; used by zstd, LZ4, dataframe libs | | CityHash / FarmHash | fast, Google-origin | similar tier to MurmurHash | | SipHash | medium-fast, keyed | DoS-resistant default in Rust's HashMap, Python 3.4+, Perl | | SHA-256 | slow (crypto-grade) | cryptographic; for HashMap, never; for content-addressed storage, yes |

Why the load factor matters — the birthday paradox

With m buckets and n keys distributed uniformly, the probability of any collision crosses 50% around n ≈ √(π·m/2) — the same math as the birthday paradox (collisions are surprisingly likely with far fewer items than buckets).

By the time α = n/m = 0.7, even a perfect hash function has many chains of length 2 or 3. Probing degrades faster than chaining — at α = 0.9, linear probing's expected probe count is ~5.5, and at α = 1.0 it diverges. The 0.7 cutoff is a convention chosen so probe count stays around 2 on average.

Hash-collision DoS attacks

If an attacker can choose your inputs and predict your hash function, they can craft keys that all land in the same bucket — collapsing your O(1) to O(n). Java 7 had a famous incident: HashMap<String, ...> exploded when CGI parameters were adversarial. The fix in modern runtimes is keyed hashing: at process start, a random secret is generated, and the hash function (SipHash) mixes it into every computation. The attacker can't predict where keys land without the secret.

If you're writing a library that takes untrusted input as hash keys (HTTP parsers, JSON deserializers, database query layers), make sure your hashmap uses keyed hashing. Rust's default HashMap does this; C++ std::unordered_map does not.

Advanced collision strategies

Beyond chaining and linear probing, two patterns trade complexity for better worst-case guarantees:

Robin Hood hashing — during probing, if you find a slot whose current occupant probed a shorter distance than you have, you swap with it. Distributes probe distances evenly and bounds the worst case.
Cuckoo hashing — two hash functions; if both slots are full, evict the existing occupant and reinsert it at its other hash position. Lookups are guaranteed O(1) worst case (check two slots, done).
Hopscotch hashing — keeps each key within a bounded neighborhood of its home bucket, so lookup is cache-friendly even under collisions.

These show up in production: hopscotch in some Java libraries, Robin Hood in Rust's pre-2018 default, cuckoo in network-router lookup tables.

Math details

Time Complexity (average case): Insert: O(1); Lookup: O(1); Delete: O(1)
Worst case (many collisions): O(n)
Load Factor: α = n / capacity; Resize when $\alpha > 0.7$ typically

Implementation

HashMap in Rust

use std::collections::HashMap;

let mut map = HashMap::new();
map.insert("key", 42);       // O(1) avg
let val = map.get("key");    // O(1) avg
map.remove("key");           // O(1) avg

// Iteration is O(n)
for (k, v) in &map {
    println!("{}: {}", k, v);
}

When to Use

Dictionary/map (key-value store)
Set membership testing
Caching and memoization
Counting frequencies

Practice

Three problems that show why hash tables are the Swiss army knife of algorithms:

Group Anagrams — medium · "canonicalize then bucket" — the template for clustering by equivalence
LRU Cache — medium · hash + doubly-linked list — the canonical two-structures-together design
Longest Substring Without Repeating Characters — medium · sliding window + "last-seen index" hash — the O(n) string-window pattern

→ Browse all practice problems

🪣 Hash Tables