Group Anagrams
in the wild Deduplicate sensor signatures up to permutation; cluster scrambled OCR'd words; group hash-collision keys.
Problem
Given an array of lowercase-letter strings, group the anagrams together. Two strings are anagrams iff one is a rearrangement of the other. The order of groups and the order within each group don't matter.
Examples
["eat", "tea", "tan", "ate", "nat", "bat"]
→ [["eat", "tea", "ate"], ["tan", "nat"], ["bat"]]
[""] → [[""]]
["a"] → [["a"]]
Why this is the canonical "canonical-key + bucket" problem
A hash table's superpower: clustering by an equivalence class. The trick is finding a canonical representative for the class — something all members of the class hash to identically.
For anagrams, two natural canonical keys:
- Sorted letters.
"eat","tea","ate"all sort to"aet". Use"aet"as the bucket key.O(k log k)per string. - Letter-count tuple. Build a 26-element count vector → use that (or its string form) as the key.
O(k)per string. Faster for long strings.
Once you have the key, drop each string into groups[key]. End: return list(groups.values()). The hash table does the bucketing for free.
This canonicalize + bucket template solves a huge class of problems: "find duplicates up to rotation", "cluster equivalent records", "deduplicate fuzzy data". Master it once, recognize it everywhere.
Hints
Hint 1 — what does "anagram" mean structurally?
Two strings are anagrams iff their multisets of characters are equal. A multiset is equal to another iff a canonical form of each is the same. What's the simplest canonical form of a multiset of characters?
Hint 2 — sort the characters
Sorting "eat" gives "aet". So does sorting "tea". Use the sorted string as a hash-table key — collision = group membership.
Hint 3 — even faster with a count vector
If your alphabet is fixed (e.g. 26 lowercase letters), a 26-length count vector is a faster canonical form: O(k) vs O(k log k). Convert it to a tuple (or fixed string like "a3b0c0...z0") to hash.
Solution
from collections import defaultdict
def group_anagrams(strs: list[str]) -> list[list[str]]:
groups: dict[tuple[int, ...], list[str]] = defaultdict(list)
for s in strs:
# 26-length count vector → tuple → hashable canonical key.
key = [0] * 26
for ch in s: key[ord(ch) - ord("a")] += 1
groups[tuple(key)].append(s)
return list(groups.values())
def normalize(groups: list[list[str]]) -> list[list[str]]:
return sorted(sorted(g) for g in groups)
cases = [
(["eat", "tea", "tan", "ate", "nat", "bat"],
[["ate", "eat", "tea"], ["bat"], ["nat", "tan"]]),
([""], [[""]]),
(["a"], [["a"]]),
]
for strs, expected in cases:
got = normalize(group_anagrams(strs))
mark = "✓" if got == expected else "✗"
print(f"{mark} group_anagrams({strs}) = {got}")function groupAnagrams(strs) {
const groups = new Map();
for (const s of strs) {
const key = Array(26).fill(0);
for (const ch of s) key[ch.charCodeAt(0) - 97]++;
const k = key.join(",");
if (!groups.has(k)) groups.set(k, []);
groups.get(k).push(s);
}
return Array.from(groups.values());
}
const normalize = gs => gs.map(g => [...g].sort()).sort((a, b) => a[0] < b[0] ? -1 : a[0] > b[0] ? 1 : 0);
const cases = [
[["eat", "tea", "tan", "ate", "nat", "bat"],
[["ate", "eat", "tea"], ["bat"], ["nat", "tan"]]],
[[""], [[""]]],
[["a"], [["a"]]],
];
for (const [strs, expected] of cases) {
const got = normalize(groupAnagrams(strs));
const ok = JSON.stringify(got) === JSON.stringify(expected);
console.log(`${ok ? "✓" : "✗"} groupAnagrams(${JSON.stringify(strs)}) = ${JSON.stringify(got)}`);
}use std::collections::HashMap;
pub fn group_anagrams(strs: Vec<String>) -> Vec<Vec<String>> {
// Key = 26-tuple of letter counts. We encode as [u8; 26] which is `Copy`
// and hashes cheaply.
let mut groups: HashMap<[u8; 26], Vec<String>> = HashMap::new();
for s in strs {
let mut key = [0u8; 26];
for b in s.bytes() {
if b.is_ascii_lowercase() {
key[(b - b'a') as usize] = key[(b - b'a') as usize].saturating_add(1);
}
}
groups.entry(key).or_default().push(s);
}
groups.into_values().collect()
}
fn main() {
let cases = vec![
vec!["eat", "tea", "tan", "ate", "nat", "bat"],
vec![""],
vec!["a"],
];
for input in cases {
let strs: Vec<String> = input.iter().map(|s| s.to_string()).collect();
let mut got: Vec<Vec<String>> = group_anagrams(strs);
for g in got.iter_mut() { g.sort(); }
got.sort_by(|a, b| a[0].cmp(&b[0]));
println!("✓ group_anagrams({:?}) = {:?}", input, got);
}
}#include <algorithm>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
std::vector<std::vector<std::string>> group_anagrams(std::vector<std::string>& strs) {
std::unordered_map<std::string, std::vector<std::string>> groups;
for (auto& s : strs) {
std::string key(26, '0');
for (char c : s) ++key[c - 'a']; // ASCII-arithmetic into a fixed-size key
groups[key].push_back(s);
}
std::vector<std::vector<std::string>> out;
out.reserve(groups.size());
for (auto& [_, g] : groups) out.push_back(std::move(g));
return out;
}
int main() {
std::vector<std::vector<std::string>> cases = {
{"eat", "tea", "tan", "ate", "nat", "bat"},
{""},
{"a"},
};
for (auto& in : cases) {
auto got = group_anagrams(in);
for (auto& g : got) std::sort(g.begin(), g.end());
std::sort(got.begin(), got.end(), [](auto& a, auto& b){ return a.front() < b.front(); });
std::cout << "✓ group_anagrams(...) = [";
for (auto& g : got) {
std::cout << "[";
for (size_t i = 0; i < g.size(); ++i)
std::cout << "\"" << g[i] << "\"" << (i + 1 < g.size() ? "," : "");
std::cout << "]";
}
std::cout << "]\n";
}
}// In C we sort the characters of each string to get a canonical key, then
// bucket via a simple open-address hash. Quadratic-bound to demos.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int char_cmp(const void* a, const void* b) { return *(const char*)a - *(const char*)b; }
static void sorted_copy(const char* s, char* out) {
size_t n = strlen(s);
memcpy(out, s, n + 1);
qsort(out, n, 1, char_cmp);
}
// Linear scan over keys + groups. Fine for the demo's small inputs.
static char keys [64][32];
static char* groups_str[64][16];
static int groups_n[64];
static int nkeys = 0;
static void add(const char* original, const char* key) {
for (int i = 0; i < nkeys; i++) {
if (strcmp(keys[i], key) == 0) {
groups_str[i][groups_n[i]++] = strdup(original);
return;
}
}
strcpy(keys[nkeys], key);
groups_str[nkeys][0] = strdup(original);
groups_n[nkeys] = 1;
nkeys++;
}
int main(void) {
const char* in[] = {"eat", "tea", "tan", "ate", "nat", "bat"};
char keybuf[32];
for (int i = 0; i < 6; i++) {
sorted_copy(in[i], keybuf);
add(in[i], keybuf);
}
printf("✓ group_anagrams(...) = [");
for (int i = 0; i < nkeys; i++) {
printf("[");
for (int j = 0; j < groups_n[i]; j++)
printf("\"%s\"%s", groups_str[i][j], j + 1 < groups_n[i] ? "," : "");
printf("]%s", i + 1 < nkeys ? "," : "");
}
printf("]\n");
// Free
for (int i = 0; i < nkeys; i++)
for (int j = 0; j < groups_n[i]; j++)
free(groups_str[i][j]);
return 0;
}The "canonicalize then bucket" pattern works exactly the same in every language — the only difference is whether the canonical form is a sorted string, a count vector serialized to a string, or a fixed-size array used as a hash key. Pick whichever the language makes cheapest.
Complexity
- Time (count-vector key)
- O(n · k) — n strings, each costing O(k) to compute the key. Insert into the hash is O(1) amortized.
- Time (sort key)
- O(n · k log k) — a logarithmic factor slower per string. Negligible unless strings are very long.
- Space
- O(n · k) — every character is stored at least once in the output (plus the key).
In the wild
- Plagiarism detection for shuffled-word documents — bucket by word multiset.
- Sensor signature deduping — multiple sensors emitting the same set of (channel, value) tuples in different orders.
- Hash-collision diagnosis — group log lines by a canonical form to spot duplicate root causes under different timestamps.
- OCR cleanup pipelines group permutation-similar string candidates per character cell.
Variations
- Find all anagram pairs (not groups). Same approach; emit cross-products from groups of size ≥ 2.
- Anagram in a stream of words. As new words arrive, hash and update an existing bucket.
- K-shingle anagrams (block-shuffled documents). Hash window k-grams, bucket by canonical form.
- Anagrams over Unicode. Replace the 26-vector with a
HashMap<char, int>count. Same big-O, larger constants.