The first time I cared about memory, I was in college doing competitive programming.
Every problem came with two budgets, a time limit and a memory limit. The judge would happily reject a correct answer that took 1.1 seconds when the limit was 1.0, or used 65 megabytes when the cap was 64. Half the fun of competitive programming, and most of the pain, lived in that gap. You’d write a solution that worked on the samples, submit it, and watch the verdict come back: TLE. MLE. Wrong test case 47, runtime error. Suddenly an algorithmic problem became a memory problem too.
That’s where I first read about malloc, about the heap, about how an int[] of a million elements is four megabytes and where those four megabytes actually live. Almost every memory bug I’ve debugged since has rhymed with that early lesson: the memory you allocate has to live somewhere, somebody has to give it back, and somebody (your runtime, the operating system, your future self) has to know what’s still in use.
Kent Beck has a line for this.
Make it work. Make it right. Make it fast. — Kent Beck
Most of the engineering time I see these days goes into the first step. Some into the second. The third, making it fast, making it small, making it actually fit, is rare. Abundant hardware and forgiving runtimes hide a lot of sins. But when “make it fast” does come up, it almost always means understanding the rooms underneath the runtime. The first two steps are about logic. This one is about machinery.
Most of what I knew about memory in college was bookish. The rest I picked up later, slowly, from working on real systems and watching tools disagree with each other about what was happening. This post is the part both versions agree on: what an operating system gives a program, what an allocator does, what a garbage collector does, and how the layers under your runtime fit together.
What does the OS give a program?
When you run ./my-program, the operating system creates a process and hands it something called an address space.
An address space is just a long list of byte slots, numbered from 0 up to a very large number. On a 64-bit system that number is 2^48 — 256 terabytes of addressable bytes. Your program does not have 256 terabytes of memory. It has the illusion of 256 terabytes of memory. The rest of this post is mostly about the gap between that illusion and reality.
The illusion is called virtual memory. Each process gets its own private address space. When the program reads or writes to address 0x7fff_a823_4000, the CPU translates that address — through hardware called the MMU — into a real physical address in your RAM. The translation table is called a page table, and the unit of translation is a page, usually 4 kilobytes.
A few things follow from this design that matter for everything below.
First, a virtual address is not real memory until it’s used. The kernel can hand you a billion bytes of address space and not back a single one of them with physical RAM. Pages get backed on demand, the first time you read or write to them. This is why a program can ask for a huge mmap and the OS shrugs and says yes.
Second, two processes with the same virtual address point to different physical bytes. The translation is per-process. This is why processes can’t trample each other’s memory.
Third, the kernel can move pages around. It can swap them to disk if memory is tight, share them between processes if they’re identical (think glibc, loaded once, mapped into every process), or unmap them entirely. From the program’s point of view, the address stays the same.
That’s the foundation. Now let’s look at what’s actually inside the address space.
The five regions of a process
Every Unix-like process is laid out roughly the same way. The address space is divided into a handful of regions, each with its own purpose, growth direction, and rules.
text
The text region holds your program’s compiled instructions. When you run a binary, the kernel maps the code section of the file into this region and marks it read-only and executable. It’s fixed size, doesn’t grow, and you generally don’t think about it. If you ever see Segmentation fault from writing through a stale function pointer, you tried to write here.
data and bss
The data segment holds globals and statics that have an initial value:
int counter = 42; // initialized — goes in .data
int buffer[1024]; // uninitialized — goes in .bss
static char *name; // uninitialized — also .bss
.bss (Block Started by Symbol — the name is historical and pointless) holds uninitialized globals. The kernel zero-fills them at startup so you never observe garbage in them. Both regions are fixed size.
heap
The heap is where your program asks for memory at runtime. It’s the interesting region — most of what we’ll discuss in this post happens here. We’ll come back to it in the next two sections.
mmap region
The mmap region is where the kernel maps in things that are too big or too special to live on the heap. Three main occupants:
- Shared libraries. Every
.soyour program links against (libc.so,libssl.so, the JVM, etc.) ismmaped into this region. They live here because they can be shared across processes — the kernel maps the same physical pages into many address spaces. - Large heap allocations. When you ask for, say, a 10 MB buffer, allocators usually skip the heap and ask the kernel for a fresh
mmapdirectly. We’ll see why in a moment. - Memory-mapped files. When you
mmapa file, the kernel makes the file’s bytes appear as part of your address space.
stack
The stack holds function call state. Every time a function is called, a new stack frame is pushed: arguments, local variables, the return address, saved registers. When the function returns, the frame is popped. This is automatic, fast, and completely safe — you never free a stack variable.
The stack has a fixed size limit (on Linux, 8 MB by default). If you blow past it, you get a stack overflow and the process dies. You blow past it by recursing too deeply or allocating something huge on the stack:
void boom() {
char buffer[16 * 1024 * 1024]; // 16 MB on the stack
// segfault before this line ever executes
}
The stack and the heap grow toward each other. The stack starts at high addresses and grows down. The heap starts low and grows up. The unmapped gap between them shrinks as your program does more work.
The stack in detail
Function calls don’t happen by magic. When you call a function, the caller and the callee follow a contract called the calling convention — a set of rules about which registers hold what, where return values go, and what gets pushed onto the stack.
A typical stack frame looks like this:
Two properties make the stack fast and limited.
It’s fast because allocation is one instruction: subtract from the stack pointer. Deallocation is one instruction: add to it. There’s no list of free chunks to search, no metadata to update.
It’s limited because the lifetime of a stack variable is the lifetime of the function call. The moment a function returns, every byte in its frame is reclaimed. If you need memory that outlives the function — to return to the caller, to store in a long-lived data structure, to pass to another thread — the stack is the wrong place. You need the heap.
int *bad() {
int x = 42;
return &x; // returning a pointer to a stack variable
} // x is destroyed here. The pointer is dangling.
int *good() {
int *x = malloc(sizeof(int));
*x = 42;
return x; // heap-allocated, survives the return
}
That contrast is the whole reason the heap exists.
The heap in detail
The heap is region of address space that your allocator manages. The allocator is a library — usually the C standard library’s malloc/free, or one of its drop-in replacements like jemalloc, tcmalloc, or mimalloc. It is not the kernel. It’s a userspace component that sits between your program and the kernel.
When your program calls malloc(size), here’s what happens.
The allocator keeps a cache of memory it has already gotten from the kernel. When you ask for 64 bytes, it looks in its cache, finds a free slot, hands you a pointer. No system call needed.
When the cache runs dry, the allocator asks the kernel for more memory. There are two ways to ask:
brk/sbrk— moves the top of the heap up. This is the classic Unix way: the heap has a single contiguous range, andsbrkextends it. Cheap, but the heap is one piece.mmap— gets a fresh, separate region of pages somewhere in the address space. More flexible, but each region is its own thing.
Modern allocators use both. Small allocations come from a heap extended by sbrk. Large ones (typically over 128 KB on glibc) skip the heap and live in their own mmap regions, so they can be returned to the kernel independently when freed.
Inside the heap, the allocator keeps free lists — linked lists of chunks that are not currently in use, organized by size. When you malloc(64), the allocator picks a free chunk that fits, splits it if it’s bigger, returns the rest to a free list. When you free(ptr), the chunk goes back on a free list, possibly merged with neighbors.
This is also where fragmentation happens. Your program allocates and frees many things over time. Eventually the free space is broken into small islands, none of them big enough for the next big allocation, even though the total free memory is plenty. Allocators spend a lot of effort fighting this — bin sizes, slab caches, coalescing — but they can never eliminate it completely.
If you ever wonder why a long-running program keeps getting bigger even though “the heap should be the same size by now,” fragmentation is one of the answers.
Managed runtimes layer on top
Now the second floor of the building.
Languages with garbage collection — Java, Python, Go, JavaScript, C# — don’t expose malloc to you. You write new Foo() or [1, 2, 3] or {}, and an object appears. There’s no free. So where does the memory come from?
It comes from mmap. The runtime asks the kernel for a giant region of memory — usually tens or hundreds of megabytes at a time — and then sub-allocates inside it using its own bookkeeping. Each runtime has its own scheme.
- The JVM has generational heaps — Eden, Survivor, Old, Metaspace — each its own arena, each with its own collector strategy.
- V8 and JSC (the engines behind Node.js, Chrome, Bun) split the heap into a young generation (small, fast, frequent collection) and an old generation (larger, slower).
- CPython has a small-object allocator on top of
malloc, with arenas and pools. - The Go runtime has an
mheapthat’s structured into spans, each carved into objects of a fixed size class.
The point isn’t the specifics. It’s the shape: when your runtime says “the heap is 200 MB,” it means its arenas hold 200 MB of live objects. The kernel might see your process holding 800 MB of mmap’d pages — the rest is arena overhead, free space inside the arenas, fragmentation, and other runtime structures. The OS heap and the runtime heap are not the same thing, and they almost never report the same number.
This is the first place where memory tooling starts to lie to you. process.memoryUsage().heapUsed reports the third box. ps -o rss reports the first. They are different numbers and disagreeing is normal.
What “leak” actually means
A memory leak is, loosely, memory your program no longer needs but has not given back. The reason that loose definition is unsatisfying is that the underlying mechanism differs by language. There are three flavors.
Flavor 1: forgotten free (C, C++)
This is the classic. You allocated, you didn’t free, you lost the pointer.
char *make_greeting() {
char *msg = malloc(64);
sprintf(msg, "hello");
return msg; // caller is supposed to call free()
}
int main() {
for (int i = 0; i < 1000000; i++) {
make_greeting(); // pointer thrown away
} // 64 MB lost; nothing can reach it
}
After the loop, 64 MB of heap chunks exist, every one of them allocated, none of them reachable from anything in your program. The allocator doesn’t know they’re unreachable — it just knows they’re not free. The OS doesn’t know either. They sit there, billed to your process, until the process exits.
This flavor is impossible in a managed language because there’s no malloc/free for you to forget. Which brings us to:
Flavor 2: reachable orphan (Java, Go, Python, JavaScript)
In a managed language, the garbage collector traces from a set of roots (the call stacks of all threads, global variables, registers) and finds every object that’s still reachable. Anything not reachable is recycled. So flavor 1 cannot happen.
What can happen instead is the opposite: memory that is reachable, by accident, that you no longer need.
const cache = new Map();
function handle(req) {
cache.set(req.id, req.payload);
return cache.get(req.id);
}
Every entry in cache is reachable from a module-level global. As far as the garbage collector can tell, every entry is in use. It can’t help you. The leak is not “I forgot to free”; it’s “I remembered too well.” A cache that doesn’t evict, an event listener that’s never unregistered, a closure that pins a large object alive — these are the modern leaks.
Flavor 3: cycle (Python, Swift, anything refcounted)
Some languages don’t trace from roots. They keep a reference count on every object. When the count hits zero, the object is freed immediately. Python, Swift, Objective-C, COM, all use refcounting (Python with a tracing collector layered on top).
Refcounting is fast and predictable. It’s also blind to cycles.
class Node:
def __init__(self): self.peer = None
a = Node()
b = Node()
a.peer = b
b.peer = a
del a, b
# refcounts: a is held by b.peer (1), b is held by a.peer (1)
# nothing reaches 0; both objects stay alive
Python ships a separate cycle collector that periodically walks the object graph to find and break these cycles. Swift doesn’t, which is why Swift programmers learn the words weak and unowned.
So: same word “leak,” three different mechanisms, three different fixes. The mechanism you have determines the tool you reach for.
Garbage collection in detail
Since two of the three leak flavors live inside garbage-collected runtimes, it’s worth a little time on how they actually work.
A tracing collector does what we just described: starts at a set of roots (stack frames, globals, registers), walks every reference it can follow, marks each object it visits, and at the end of the walk reclaims everything that wasn’t marked. This is the mark-and-sweep algorithm, and almost every modern GC is a variant of it.
The basic algorithm has problems in practice. It walks every reachable object, which can be millions on a real heap. While it walks, the program has to pause — a “stop-the-world” GC. For a long time, GC pauses were the main reason JVM apps got a bad reputation in latency-sensitive shops.
That reputation is mostly historical. ZGC, on heaps in the tens of gigabytes, hits sub-millisecond max pauses. The “stop the world” name has lasted longer than the pauses themselves.
Modern collectors fix this with two big ideas.
The first is the generational hypothesis: most objects die young. A typical web request creates a flurry of short-lived objects (request bodies, JSON parses, intermediate strings) that die before the next request arrives. A small number of objects (caches, connections, the HTTP server itself) live forever. So why scan everything every time? Modern collectors split the heap into a small young generation and a larger old generation. They scan the young generation often and quickly. They scan the old one rarely.
The second is concurrency. A concurrent collector does most of its work alongside the program, with only short pauses for the parts that absolutely need the program to be still. ZGC and Shenandoah on the JVM, Go’s collector, V8’s Orinoco — all concurrent.
A refcounting collector — Python, Swift — is a totally different shape. Every object has an integer counter. Every assignment that creates a reference bumps the counter up; every reassignment or scope exit bumps it down. When it hits zero, the object is freed immediately. No tracing, no pauses, very predictable. The cost is that it can’t reclaim cycles, and the per-assignment overhead is real.
Most modern runtimes are hybrids. Python is refcounting plus a cycle collector. Modern JVMs are tracing plus generations plus concurrency. The shape of your runtime’s GC determines the shape of the leaks you’ll have.
The hidden room: native memory
Here’s the part most engineers don’t internalize until they’ve been bitten.
Every managed runtime has a C or C++ underbelly. The runtime itself is written in C++ (V8, JSC, the JVM, CPython, Go’s runtime). Many libraries you use are also written in C and called via the runtime’s foreign function interface. All of these can — and do — allocate memory using malloc or mmap directly, without going through the runtime’s heap.
That memory exists. The OS sees it (it’s part of your RSS). But the runtime’s heap counter doesn’t know about it, because the runtime didn’t allocate it.
Examples of native allocation, by ecosystem:
- Java:
DirectByteBufferallocates off-heap memory for I/O. JNI calls allocate whatever the native code asks for. Netty uses pooled direct buffers heavily. - Node.js:
Buffer.allocUnsafe(N)allocatesNbytes directly viammap, outside V8’s heap. Every TCP socket, every file read, every stream chunk passes through these. Internal V8 queues forWritableStream,TransformStream, and friends are also C++ side. - Python: Every C extension (numpy, lxml, cryptography) allocates with
malloc, completely invisible to Python’s allocator counters. - Go:
cgocalls into C. Anything allocated by the C side is outside Go’s heap.
const buffers = [];
setInterval(() => {
buffers.push(Buffer.allocUnsafe(10 * 1024 * 1024));
}, 100);
// process.memoryUsage().heapUsed: barely moves
// process.memoryUsage().rss: climbs forever
This split is why memory debugging in managed languages can be frustrating. Your runtime gives you a heap snapshot — it’s a beautiful tool, you can see every object, every reference, every retainer chain. But if the leak is on the native side, the heap snapshot will tell you nothing. The whole leak is invisible to it.
What different layers report
There are three layers that can answer the question “how much memory is my process using.” They almost never agree, and the disagreement is the signal you actually want.
| layer | what it sees | how you ask |
|---|---|---|
| OS | resident memory, virtual size | ps, /proc/<pid>/smaps_rollup |
| runtime | its own managed heap | process.memoryUsage, MemoryMXBean |
| allocator | native pages, fragmentation, arenas | jemalloc stats, mallinfo |
ps -o pid,rss,vsz -p $$
cat /proc/$$/status | grep -E '^Vm'
cat /proc/$$/smaps_rollup
console.log(process.memoryUsage());
// {
// rss: 18923520, // OS-level — total resident
// heapTotal: 6328320, // V8's heap reservation
// heapUsed: 4587392, // V8's live objects
// external: 781072, // C++ side allocations V8 tracks
// arrayBuffers: 17890 // ArrayBuffer payloads
// }
A few useful relationships fall out of these layers.
heapUsed is always less than heapTotal (live objects fit inside the runtime’s reservation). heapTotal is part of rss. rss is heapTotal plus external plus arrayBuffers plus all the native allocations the runtime doesn’t track plus the code, the stack, the loaded .so files, and so on.
When rss and heapUsed grow together, your memory pressure is on the runtime side. When rss grows and heapUsed stays flat, the pressure is somewhere the runtime cannot see — the room your tool does not look in.
This is the part that surprises most people, and it’s where the gnarliest memory bugs live: a runtime that swears everything is fine, sitting inside an operating system that is in the process of killing the container.
Going further
Three pieces I’d reach for if any of this caught your interest:
- Ulrich Drepper, What Every Programmer Should Know About Memory — the canonical deep dive on caches, virtual memory, and what the CPU actually does. Long, but ages remarkably well. lwn.net/Articles/250967
- Julia Evans, Bite Size Linux and Memory Allocation zines — the most readable, most genuinely fun primers I know on this material. wizardzines.com
- Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective — the textbook version. The chapters on virtual memory and dynamic allocation are worth the whole book.
Every program runs on the same machinery. The languages on top hide different parts of it. Knowing the parts is the difference between being a guest in your runtime and being at home in it.