Group Anagrams

time: O(n · k)
space: O(n · k)
tags: hash-tables · strings · canonicalization

in the wild Deduplicate sensor signatures up to permutation; cluster scrambled OCR'd words; group hash-collision keys.

Problem

Given an array of lowercase-letter strings, group the anagrams together. Two strings are anagrams iff one is a rearrangement of the other. The order of groups and the order within each group don't matter.

Examples

["eat", "tea", "tan", "ate", "nat", "bat"]
  → [["eat", "tea", "ate"], ["tan", "nat"], ["bat"]]

[""]                 → [[""]]
["a"]                → [["a"]]

Why this is the canonical "canonical-key + bucket" problem

A hash table's superpower: clustering by an equivalence class. The trick is finding a canonical representative for the class — something all members of the class hash to identically.

For anagrams, two natural canonical keys:

Sorted letters. "eat", "tea", "ate" all sort to "aet". Use "aet" as the bucket key. O(k log k) per string.
Letter-count tuple. Build a 26-element count vector → use that (or its string form) as the key. O(k) per string. Faster for long strings.

Once you have the key, drop each string into groups[key]. End: return list(groups.values()). The hash table does the bucketing for free.

This canonicalize + bucket template solves a huge class of problems: "find duplicates up to rotation", "cluster equivalent records", "deduplicate fuzzy data". Master it once, recognize it everywhere.

Hints

Hint 1 — what does "anagram" mean structurally?

Two strings are anagrams iff their multisets of characters are equal. A multiset is equal to another iff a canonical form of each is the same. What's the simplest canonical form of a multiset of characters?

Hint 2 — sort the characters

Sorting "eat" gives "aet". So does sorting "tea". Use the sorted string as a hash-table key — collision = group membership.

Hint 3 — even faster with a count vector

If your alphabet is fixed (e.g. 26 lowercase letters), a 26-length count vector is a faster canonical form: O(k) vs O(k log k). Convert it to a tuple (or fixed string like "a3b0c0...z0") to hash.

Solution

from collections import defaultdict

def group_anagrams(strs: list[str]) -> list[list[str]]:
    groups: dict[tuple[int, ...], list[str]] = defaultdict(list)
    for s in strs:
        # 26-length count vector → tuple → hashable canonical key.
        key = [0] * 26
        for ch in s: key[ord(ch) - ord("a")] += 1
        groups[tuple(key)].append(s)
    return list(groups.values())

def normalize(groups: list[list[str]]) -> list[list[str]]:
    return sorted(sorted(g) for g in groups)

cases = [
    (["eat", "tea", "tan", "ate", "nat", "bat"],
     [["ate", "eat", "tea"], ["bat"], ["nat", "tan"]]),
    ([""], [[""]]),
    (["a"], [["a"]]),
]
for strs, expected in cases:
    got  = normalize(group_anagrams(strs))
    mark = "✓" if got == expected else "✗"
    print(f"{mark} group_anagrams({strs}) = {got}")

function groupAnagrams(strs) {
  const groups = new Map();
  for (const s of strs) {
    const key = Array(26).fill(0);
    for (const ch of s) key[ch.charCodeAt(0) - 97]++;
    const k = key.join(",");
    if (!groups.has(k)) groups.set(k, []);
    groups.get(k).push(s);
  }
  return Array.from(groups.values());
}

const normalize = gs => gs.map(g => [...g].sort()).sort((a, b) => a[0] < b[0] ? -1 : a[0] > b[0] ? 1 : 0);

const cases = [
  [["eat", "tea", "tan", "ate", "nat", "bat"],
   [["ate", "eat", "tea"], ["bat"], ["nat", "tan"]]],
  [[""], [[""]]],
  [["a"], [["a"]]],
];
for (const [strs, expected] of cases) {
  const got = normalize(groupAnagrams(strs));
  const ok  = JSON.stringify(got) === JSON.stringify(expected);
  console.log(`${ok ? "✓" : "✗"} groupAnagrams(${JSON.stringify(strs)}) = ${JSON.stringify(got)}`);
}

use std::collections::HashMap;

pub fn group_anagrams(strs: Vec<String>) -> Vec<Vec<String>> {
    // Key = 26-tuple of letter counts. We encode as [u8; 26] which is `Copy`
    // and hashes cheaply.
    let mut groups: HashMap<[u8; 26], Vec<String>> = HashMap::new();
    for s in strs {
        let mut key = [0u8; 26];
        for b in s.bytes() {
            if b.is_ascii_lowercase() {
                key[(b - b'a') as usize] = key[(b - b'a') as usize].saturating_add(1);
            }
        }
        groups.entry(key).or_default().push(s);
    }
    groups.into_values().collect()
}

fn main() {
    let cases = vec![
        vec!["eat", "tea", "tan", "ate", "nat", "bat"],
        vec![""],
        vec!["a"],
    ];
    for input in cases {
        let strs: Vec<String> = input.iter().map(|s| s.to_string()).collect();
        let mut got: Vec<Vec<String>> = group_anagrams(strs);
        for g in got.iter_mut() { g.sort(); }
        got.sort_by(|a, b| a[0].cmp(&b[0]));
        println!("✓ group_anagrams({:?}) = {:?}", input, got);
    }
}

#include <algorithm>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>

std::vector<std::vector<std::string>> group_anagrams(std::vector<std::string>& strs) {
    std::unordered_map<std::string, std::vector<std::string>> groups;
    for (auto& s : strs) {
        std::string key(26, '0');
        for (char c : s) ++key[c - 'a'];     // ASCII-arithmetic into a fixed-size key
        groups[key].push_back(s);
    }
    std::vector<std::vector<std::string>> out;
    out.reserve(groups.size());
    for (auto& [_, g] : groups) out.push_back(std::move(g));
    return out;
}

int main() {
    std::vector<std::vector<std::string>> cases = {
        {"eat", "tea", "tan", "ate", "nat", "bat"},
        {""},
        {"a"},
    };
    for (auto& in : cases) {
        auto got = group_anagrams(in);
        for (auto& g : got) std::sort(g.begin(), g.end());
        std::sort(got.begin(), got.end(), [](auto& a, auto& b){ return a.front() < b.front(); });
        std::cout << "✓ group_anagrams(...) = [";
        for (auto& g : got) {
            std::cout << "[";
            for (size_t i = 0; i < g.size(); ++i)
                std::cout << "\"" << g[i] << "\"" << (i + 1 < g.size() ? "," : "");
            std::cout << "]";
        }
        std::cout << "]\n";
    }
}

// In C we sort the characters of each string to get a canonical key, then
// bucket via a simple open-address hash. Quadratic-bound to demos.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static int char_cmp(const void* a, const void* b) { return *(const char*)a - *(const char*)b; }
static void sorted_copy(const char* s, char* out) {
    size_t n = strlen(s);
    memcpy(out, s, n + 1);
    qsort(out, n, 1, char_cmp);
}

// Linear scan over keys + groups. Fine for the demo's small inputs.
static char  keys [64][32];
static char* groups_str[64][16];
static int   groups_n[64];
static int   nkeys = 0;

static void add(const char* original, const char* key) {
    for (int i = 0; i < nkeys; i++) {
        if (strcmp(keys[i], key) == 0) {
            groups_str[i][groups_n[i]++] = strdup(original);
            return;
        }
    }
    strcpy(keys[nkeys], key);
    groups_str[nkeys][0] = strdup(original);
    groups_n[nkeys] = 1;
    nkeys++;
}

int main(void) {
    const char* in[] = {"eat", "tea", "tan", "ate", "nat", "bat"};
    char keybuf[32];
    for (int i = 0; i < 6; i++) {
        sorted_copy(in[i], keybuf);
        add(in[i], keybuf);
    }
    printf("✓ group_anagrams(...) = [");
    for (int i = 0; i < nkeys; i++) {
        printf("[");
        for (int j = 0; j < groups_n[i]; j++)
            printf("\"%s\"%s", groups_str[i][j], j + 1 < groups_n[i] ? "," : "");
        printf("]%s", i + 1 < nkeys ? "," : "");
    }
    printf("]\n");

    // Free
    for (int i = 0; i < nkeys; i++)
        for (int j = 0; j < groups_n[i]; j++)
            free(groups_str[i][j]);
    return 0;
}

idle

The "canonicalize then bucket" pattern works exactly the same in every language — the only difference is whether the canonical form is a sorted string, a count vector serialized to a string, or a fixed-size array used as a hash key. Pick whichever the language makes cheapest.

Complexity

Time (count-vector key): O(n · k) — n strings, each costing O(k) to compute the key. Insert into the hash is O(1) amortized.
Time (sort key): O(n · k log k) — a logarithmic factor slower per string. Negligible unless strings are very long.
Space: O(n · k) — every character is stored at least once in the output (plus the key).

In the wild

Plagiarism detection for shuffled-word documents — bucket by word multiset.
Sensor signature deduping — multiple sensors emitting the same set of (channel, value) tuples in different orders.
Hash-collision diagnosis — group log lines by a canonical form to spot duplicate root causes under different timestamps.
OCR cleanup pipelines group permutation-similar string candidates per character cell.

Variations

Find all anagram pairs (not groups). Same approach; emit cross-products from groups of size ≥ 2.
Anagram in a stream of words. As new words arrive, hash and update an existing bucket.
K-shingle anagrams (block-shuffled documents). Hash window k-grams, bucket by canonical form.
Anagrams over Unicode. Replace the 26-vector with a HashMap<char, int> count. Same big-O, larger constants.