Arrays & Hashing

The bread and butter — arrays, strings, hashmaps, frequency counting, prefix sums.

arrays-hashing.mdreadme

01 — Arrays & Hashing

Neetcode Week 1. Half of all interview problems are dressed-up array/hash problems. Internalizing this category is the single biggest leverage point for the first month.

Table of Contents

  1. Arrays — the underlying memory
  2. Strings — what they are and aren't
  3. Hash tables — the killer DS
  4. Hash sets vs hash maps
  5. Frequency counting pattern
  6. Prefix sums
  7. Difference arrays (range updates)
  8. Walked-through problems
  9. Pitfalls
  10. Practice list

1. Arrays — the underlying memory

An array is a contiguous block of memory of fixed-size elements indexed from 0.

Index:    0    1    2    3    4    5
        ┌────┬────┬────┬────┬────┬────┐
Memory: │ 7  │ 3  │ 9  │ 2  │ 4  │ 8  │
        └────┴────┴────┴────┴────┴────┘
        ↑
        Base address `p`. Element i is at `p + i * sizeof(T)`.

Because elements are equal-sized and contiguous:

  • arr[i] is O(1) — just arithmetic on the pointer.
  • Iteration is cache-friendly (CPU prefetches the next chunk).
  • Insert/delete in the middle is O(n) because everything must shift.

Static vs dynamic

Static arrayDynamic array (vector / list)
C int arr[10]C++ vector, Python list, Java ArrayList
Fixed size at compile timeGrows at runtime by reallocation
No append costappend is O(1)* amortized (Section 4 of foundations)

In interviews you almost always use the dynamic version. Just remember append/push_back is amortized constant.

Python list isn't a "real" array

Python list is a dynamic array of pointers to Python objects, not a packed buffer. So:

  • arr[i] is still O(1), but slightly slower than C arrays.
  • list[int] doesn't pack ints contiguously. For real packed arrays use array.array or numpy.

This rarely matters in interviews — just know it for the "why is Python slow?" question.


2. Strings

A string is conceptually an array of characters, but with critical differences:

Python strings: immutable

s = "hello" s[0] = 'H' # TypeError! s = "H" + s[1:] # creates new string each time — O(n)

This is why building strings in a loop with += is O(n²) — each += copies the whole string. Always:

parts = [] for c in stuff: parts.append(c) result = ''.join(parts) # O(n) total

C++ strings: mutable

string s = "hello"; s[0] = 'H'; // OK: now "Hello" s += 'X'; // amortized O(1), like vector

Common ops

# Python s.upper(), s.lower() s.split(' '), s.split() # split() handles any whitespace '-'.join(['a', 'b']) # 'a-b' s.find('ll') # index, -1 if not found s.count('l') ord('a') - ord('a') # 0 → useful for char-to-bucket-index
// C++ s.length(); s.substr(start, len); s.find("ll"); // string::npos if not found to_string(42); stoi("42"); char c = 'a' + 3; // 'd'

Char-to-index trick (lowercase letters)

freq = [0] * 26 for c in s: freq[ord(c) - ord('a')] += 1

Cleaner than a hashmap for "only lowercase a-z" problems and slightly faster.


3. Hash tables

A hash table stores key-value pairs by:

  1. Computing hash(key) % capacity to pick a bucket
  2. Storing the entry in that bucket (chained list or open addressing)
            keys                        buckets (mod capacity = 8)
          ┌───────┐
   "cat" ─┤hash() ├──→ 5 ─→ ┌───┐    ┌─5─┐
          │       │         │ 0 │    │"cat":3 → "tom":1│
   "dog" ─┤hash() ├──→ 2 ─→ │ 1 │    │   │
          │       │         │ 2 │    │"dog":7│
   "tom" ─┤hash() ├──→ 5 ─→ │ 3 │    │   │
          └───────┘         │ 4 │    │   │
                            │ 5 │    │   │
                            │ 6 │    │   │
                            │ 7 │    │   │
                            └───┘

Why O(1) average?

If hash function spreads well and load factor (n / capacity) is below ~0.7, each bucket has a tiny constant chain. Lookup, insert, delete are O(1) on average.

Worst case is O(n) when all keys collide into one bucket. In interviews, just say "average O(1), worst case O(n) for hostile inputs / poor hash."

Python dict / set

d = {} d['key'] = 42 d['key'] # 42 'key' in d # True del d['key'] d.get('missing', -1) # -1 (no KeyError) # Iterate for k in d: ... for k, v in d.items(): ... for v in d.values(): ... # defaultdict — auto-default for missing key from collections import defaultdict graph = defaultdict(list) graph[u].append(v) # graph[u] auto-becomes [] if absent

C++ unordered_map / map

unordered_map<string, int> m; m["key"] = 42; m.count("key"); // 0 or 1 m.find("key"); // iterator, m.end() if absent map<int, int> tree; // ordered, O(log n), useful for range queries tree.lower_bound(x); // first >= x tree.upper_bound(x); // first > x

Use unordered_map (O(1) avg) by default. Only use map (O(log n) tree-backed) when you need ordering or lower_bound.


4. Hash sets vs hash maps

  • Set — only keys, "is this in the collection?"
  • Map — key + value, "given the key, what's stored?"

Both are O(1) avg. Choose based on whether you need a value or just membership.

seen = set() seen.add(x) x in seen # O(1) count = {} count[x] = count.get(x, 0) + 1

5. Frequency counting pattern

Trigger: "anagram", "duplicate", "majority element", "unique character", "k most frequent"

The pattern is: build a hashmap of element → count, then scan it.

from collections import Counter # Are two strings anagrams? def is_anagram(s, t): return Counter(s) == Counter(t) # Most common element def majority(nums): c = Counter(nums) return c.most_common(1)[0][0] # Top K frequent def top_k(nums, k): c = Counter(nums) return [item for item, _ in c.most_common(k)] # O(n log k) with heap internally

Bucket sort variant (when k is small relative to n)

def top_k_frequent(nums, k): count = Counter(nums) buckets = [[] for _ in range(len(nums) + 1)] # bucket[i] = elems with freq i for num, cnt in count.items(): buckets[cnt].append(num) result = [] for i in range(len(buckets) - 1, -1, -1): # highest freq first for num in buckets[i]: result.append(num) if len(result) == k: return result

This is O(n) instead of O(n log k) — a clean optimization when frequencies are bounded by n.


6. Prefix sums

Trigger: "sum of subarray", "range sum query", "find subarray with sum = k"

The prefix-sum array P lets you compute any range sum in O(1) after O(n) preprocessing:

arr:    [3,  1,  4,  1,  5,  9]
P:      [0,  3,  4,  8,  9, 14, 23]
        ↑
        P[0] = 0 (empty prefix)
        P[i] = arr[0] + arr[1] + ... + arr[i-1]

sum(arr[l..r]) = P[r+1] - P[l]
                       ↑
   sum of arr[2..4] = 4+1+5 = 10
                    = P[5] - P[2] = 14 - 4 = 10 ✓

Build it

def prefix_sum(arr): P = [0] * (len(arr) + 1) for i, x in enumerate(arr): P[i+1] = P[i] + x return P

Subarray sum = k (Hashmap + prefix sum)

Classic problem: find number of subarrays whose sum is exactly k.

def subarray_sum_equals_k(nums, k): count = 0 prefix = 0 seen = {0: 1} # prefix value 0 has been seen once (empty prefix) for x in nums: prefix += x # we want to know: is there an earlier prefix p' such that prefix - p' = k? # i.e. p' = prefix - k. Look it up. if prefix - k in seen: count += seen[prefix - k] seen[prefix] = seen.get(prefix, 0) + 1 return count

Line by line:

  • seen = {0: 1} — there's one empty prefix with sum 0; lets us count subarrays starting at index 0.
  • prefix += x — running prefix sum.
  • if prefix - k in seen — we want subarray sum = k. Sum from any earlier index j+1..i equals prefix[i] - prefix[j], so look for prefix - k among earlier prefixes.
  • count += seen[prefix - k] — add how many such earlier prefixes there were.
  • Update seen for the next iteration.

Time: O(n). Space: O(n).


7. Difference arrays

Trigger: "apply k range updates [l, r, +x], then query final array"

Naive: each update is O(r-l+1) → O(n·k) total. Difference array: each update is O(1) → O(n+k) total.

arr:  [0, 0, 0, 0, 0]
update [1, 3, +5]:
diff[1] += 5
diff[4] -= 5     ← the index AFTER r

diff: [0, 5, 0, 0, -5]
prefix-sum diff:  0, 5, 5, 5, 0
                  → arr = [0, 5, 5, 5, 0]  ✓
def apply_updates(n, updates): diff = [0] * (n + 1) for l, r, x in updates: diff[l] += x diff[r + 1] -= x # prefix sum to recover the array arr = [0] * n cur = 0 for i in range(n): cur += diff[i] arr[i] = cur return arr

This is also the foundation of segment-tree-like range-update problems.


8. Walked-through problems

Problem 1: Two Sum (Easy)

Given nums and target, return indices of two numbers that sum to target.

Brute: O(n²) — try all pairs. Hash: O(n) — for each element, check if target - x is already seen.

def two_sum(nums, target): seen = {} # value → index for i, x in enumerate(nums): complement = target - x if complement in seen: # found earlier return [seen[complement], i] seen[x] = i # remember this index for future

Why this works: by the time we look for complement, we've already inserted every previous element. So if a valid pair exists, when we hit the second element of the pair, the first is in seen.

Pitfall: insert after the lookup, otherwise nums = [3, 3], target = 6 would match 3 to itself.

Problem 2: Group Anagrams (Medium)

Group strings that are anagrams of each other.

Key insight: anagrams share the same sorted characters or the same frequency tuple.

def group_anagrams(strs): groups = defaultdict(list) for s in strs: key = tuple(sorted(s)) # ("a","b","t") for "bat", "tab" groups[key].append(s) return list(groups.values())

Optimization: use a length-26 tuple of counts instead of sorting. O(n·m) instead of O(n·m log m).

def group_anagrams_fast(strs): groups = defaultdict(list) for s in strs: cnt = [0] * 26 for c in s: cnt[ord(c) - ord('a')] += 1 groups[tuple(cnt)].append(s) return list(groups.values())

Problem 3: Longest Consecutive Sequence (Medium)

Given unsorted nums, find the length of the longest run of consecutive integers. O(n) required.

[100, 4, 200, 1, 3, 2]   →   4   (sequence 1,2,3,4)

Naive: sort → O(n log n). Required: O(n).

Hash trick: put everything in a set. For each number, check if it's the start of a run (i.e., n-1 is not in set). If so, count up.

def longest_consecutive(nums): s = set(nums) longest = 0 for n in s: if n - 1 in s: continue # not the start of a run # n is the start; count up cur = n length = 1 while cur + 1 in s: cur += 1 length += 1 longest = max(longest, length) return longest

Why O(n)? Each number is the inner of while only when it's part of a run we entered from its start. Each element is visited at most twice across the whole algorithm. So total work is O(n), not O(n²).

This "only start runs from boundaries" trick generalizes to many problems.

Problem 4: Product of Array Except Self (Medium)

Return out[i] = product of all nums[j] where j != i. O(n) time, no division.

nums = [1, 2, 3, 4]
out  = [24, 12, 8, 6]

Two-pass technique: for each index, the answer is prefix_product * suffix_product.

def product_except_self(nums): n = len(nums) out = [1] * n # Pass 1: out[i] = product of all nums[0..i-1] prefix = 1 for i in range(n): out[i] = prefix prefix *= nums[i] # Pass 2: multiply by product of all nums[i+1..n-1] suffix = 1 for i in range(n - 1, -1, -1): out[i] *= suffix suffix *= nums[i] return out

Trace nums = [1,2,3,4]:

After pass 1: out = [1, 1, 2, 6]    (prefix products)
              prefix sequence: 1 → 1·1=1 → 1·2=2 → 2·3=6
After pass 2: from right
  i=3: out[3] *= 1  → 6;   suffix = 1·4 = 4
  i=2: out[2] *= 4  → 8;   suffix = 4·3 = 12
  i=1: out[1] *= 12 → 12;  suffix = 12·2 = 24
  i=0: out[0] *= 24 → 24
Final: [24, 12, 8, 6]  ✓

9. Pitfalls

Hashing custom objects

In Python, only hashable objects can be dict keys. Lists are unhashable; tuples are. So:

key = tuple(sorted(s)) # OK key = sorted(s) # TypeError: unhashable type: 'list'

For C++, custom types in unordered_map need a custom hash<T> specialization — easier to avoid by using pair<int,int>-of-primitives or string.

Iterating and modifying

Don't modify a dict while iterating:

for k in d: if d[k] == 0: del d[k] # RuntimeError: dictionary changed size during iteration # Fix: for k in list(d): # iterate over a snapshot of keys if d[k] == 0: del d[k]

Counter subtraction surprise

Counter("aabb") - Counter("ab") # Counter({'a':1, 'b':1}) — not negative!

Counter's - only keeps positive counts. Use c1.subtract(c2) for pure subtraction.

Pre-sorting cost

If your "hash" approach involves sorting tuples per element (e.g., group anagrams), state the true cost: O(n · m log m) where m is string length. Often you can replace with bucket counts (O(n·m)).


10. Practice list

In Neetcode order — solve in this sequence.

#ProblemConceptDifficulty
1Two SumHash lookupEasy
217Contains DuplicateHash setEasy
242Valid AnagramFrequency countEasy
49Group AnagramsHashmap with tuple keyMedium
347Top K Frequent ElementsBucket sort / heapMedium
271Encode and Decode StringsString designMedium
238Product of Array Except SelfPrefix/suffix arraysMedium
36Valid SudokuHash set of "row=N val=v"Medium
128Longest Consecutive SequenceHash set + boundary scanMedium

Bonus advanced practice


TL;DR ✅

  • Arrays are O(1) random access, O(n) insert-mid. Dynamic arrays' append is O(1)*.
  • Python strings are immutable — never += in a loop. Build a list, then ''.join.
  • Hash tables = O(1) avg, O(n) worst. The single most-used DS.
  • Use Counter / defaultdict liberally; they cut 5-10 lines off most problems.
  • Frequency counting solves anagram-class problems.
  • Prefix sums solve range-sum-class problems in O(1) per query after O(n) preprocessing.
  • Hash + prefix sum solves "subarray with sum = k" in O(n).
  • For Top K, bucket sort gives O(n) when frequencies are bounded.

Next: 02-two-pointers.md.

External resources

main
UTF-8·typescript