The Truth About Hash Tables (C from Scratch)

A hash table is often described as a “fast key–value data structure,” but that description hides the most important truth: a hash table is nothing more than an array in memory with a strict set of rules layered on top.

There is no special data structure provided by the language, no hidden runtime magic, and no built-in intelligence. It is just a contiguous block of RAM and a math function.

The Problem with Arrays

If you have an array Person data[100], you can access data[5] instantly ($O(1)$). But you don't want to look up people by *number*—you want to look them up by *name* (e.g., "Alice").

The entire goal of a Hash Table is to turn a String ("Alice") into a Number (5) so you can use it as an array index.

1. The Hash Function (djb2)

We need a function that scatters strings randomly across our array indices. If we just mapped "A" to 1 and "B" to 2, we’d get clusters.

We use djb2 by Dan Bernstein, a legendary hashing algorithm known for its simplicity and good distribution.

unsigned long hash(const char *str) {
    unsigned long hash = 5381;
    int c;

    while ((c = *str++))
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash % TABLE_SIZE;
}

* Magic Number 5381: Chosen empirically because it results in fewer collisions for common ASCII strings.

* Bitwise Shift (`<< 5`): This effectively multiplies by 32. Adding hash makes it $\times 33$. Bitwise operations are significantly faster than multiplication on older hardware (and still very fast today).

2. The Internal Memory Structure

In high-level languages like Python or JavaScript, a dictionary grows endlessly. In C, we allocate fixed memory.

We explicitly track which slots are occupied. Why? Because in C, uninitialized memory is just garbage values. We can't tell if data[5] holds a real person or just random noise from RAM.

typedef struct {
    char name[MAX_NAME];
    int age;
} Person;

typedef struct {
    Person data[TABLE_SIZE];
    bool occupied[TABLE_SIZE]; // The "meta" layer
} HashTable;

3. Insertion & Collisions (Linear Probing)

What happens if hash("John") is 42 and hash("Jane") is also 42? This is a Collision.

We use Linear Probing: "If the seat is taken, find the next empty chair."

bool ht_insert(HashTable *ht, const char *name, int age) {
    unsigned int index = hash(name);

    for (int i = 0; i < TABLE_SIZE; i++) {
        // Linear Probe: index, index+1, index+2... 
        unsigned int probe = (index + i) % TABLE_SIZE; 

        // 1. Found empty slot? Insert.
        if (!ht->occupied[probe]) {
            strcpy(ht->data[probe].name, name);
            ht->data[probe].age = age;
            ht->occupied[probe] = true;
            return true;
        }

        // 2. Key already exists? Update it.
        // (Don't allow duplicates of the same key)
        if (strcmp(ht->data[probe].name, name) == 0) {
            ht->data[probe].age = age;
            return true;
        }
    }
    return false; // Table is 100% full
}

4. Lookup Logic (The Part Most People Forget)

Lookup mirrors insertion perfectly. You can't just check data[hash("John")]. "John" might have been bumped to the next slot because of a collision!

You must replay the probe sequence:

Jump to hash("John").

2. Is it John? Yes -> Return.

3. Is it Empty? Yes -> John doesn't exist (Stop!).

4. Is it someone else? -> Move to next slot.

int ht_find(HashTable *ht, const char *name) {
    unsigned int index = hash(name);

    for (int i = 0; i < TABLE_SIZE; i++) {
        unsigned int probe = (index + i) % TABLE_SIZE;

        // STOP condition 1: Found him!
        if (ht->occupied[probe] && strcmp(ht->data[probe].name, name) == 0) {
            return ht->data[probe].age;
        }

        // STOP condition 2: Found an empty wall. 
        // If he were here, he would have been in this slot or before it.
        if (!ht->occupied[probe]) {
            return -1; // Not found
        }
    }
    return -1;
}

5. The "Deletion" Problem

You cannot simply set occupied[i] = false to delete an item.

Why? Imagine a chain: A -> B -> C. They all hashed to index 10 and collided.

* Entry A is at 10.

* Entry B is at 11.

* Entry C is at 12.

If you delete B (mark 11 as empty), and then try to look up C, the search starts at 10 (A, nope), moves to 11 (Empty!), and stops. The algorithm assumes C doesn't exist because it hit an empty slot.

Solution: You need a third state called DELETED (or a "Tombstone"). The lookup algorithm treats "DELETED" as "Keep looking", but the insertion algorithm treats "DELETED" as "Overwrite me".

Summary

Building a hash table in C forces you to confront the reality of memory.

* O(1) isn't magic; it's just math.

* Collisions aren't edge cases; they are the norm.

* Strings are just bytes to be crunched.

View Source Code on GitHub