Skip to content

Thread Pool Starvation: From Blocking recv() to Reactor I/O

A C++ in-memory full-text search engine stopped responding in production.

1.7 million indexed documents. Real-time sync via MySQL binlog replication. Active/Standby HA with Keepalived. Three upstream web application servers connecting over TCP through a VIP.

The OpenResty reverse proxy was throwing 5xx errors continuously. netstat showed over 100 ESTABLISHED TCP connections piling up. The process was alive. Deadlock?

I checked every thread's wait state via /proc/<pid>/task/*/wchan. All workers were blocked normally — futex_wait_queue (condition variable wait) or inet_csk_accept (accept wait). Not a deadlock. The process was alive. New connections got responses. Yet existing requests went nowhere.

Why Two Connections Killed the Server

Three factors combined.

Factor 1: One Connection = One Worker, Blocking I/O

The TCP server assigned one worker thread per connection, blocking on recv() for the connection's entire lifetime.

cpp
// Old: HandleConnection — worker owns the connection for its full lifetime
void HandleConnection(int client_fd) {
    while (!shutdown) {
        ssize_t n = recv(client_fd, buf, sizeof(buf), 0);  // ← blocks here
        if (n <= 0) break;
        std::string response = dispatcher->Dispatch(parse(buf));
        send(client_fd, response.data(), response.size(), 0);
    }
}
What is a persistent connection?

In HTTP/1.0, each request opened and closed a new TCP connection. HTTP/1.1 made it standard to reuse a single TCP connection for multiple requests — a persistent (keep-alive) connection. Less connection overhead, but the server must hold onto connections that may sit idle indefinitely.

Even when a persistent connection's client is idle, the assigned worker stays blocked on recv(). The worker is freed only when the client disconnects or SO_RCVTIMEO (default 60 seconds) fires.

Factor 2: Raw hardware_concurrency()

The worker count auto-sizing logic looked like this:

cpp
ThreadPool::ThreadPool(size_t num_threads) {
    if (num_threads == 0) {
        num_threads = std::thread::hardware_concurrency();  // → 2 on 2-vCPU VM
    }
}

The production VM had 2 vCPUs. Workers = 2. Two persistent connections was all it took — every worker stuck on an idle recv. Subsequent requests queued up with nobody to process them.

Factor 3: Half-Open TCP Connections

What is a TCP half-open connection?

A state where one side of a TCP connection disappears without notifying the other. A client process crash, or a middlebox dropping the session, leaves the server thinking the connection is alive — stuck in recv() forever. SO_KEEPALIVE makes the OS send periodic probe packets, but by default it's either off or set to probe intervals too long to matter.

The upstream web application wasn't setting SO_KEEPALIVE. TCP connections were left half-open, and the server-side Linux kernel default keepalive (2 hours idle + 9 probes × 75 seconds ≈ 2 hours 11 minutes) couldn't detect them in practice. Stale ESTABLISHED sockets lingered for days, eating workers.

The Bug Was Always There

This blocking I/O model was baked into the design from the start. Previously, the engine ran on a large VM with 128GB RAM where hardware_concurrency() returned 8–16. Workers wouldn't run out until idle persistent connections hit that number. The web application mostly used per-request connections, so it rarely got there.

When memory usage grew and the engine was moved to a dedicated 2 vCPU VM, the threshold dropped to 2. On the same day, VictoriaMetrics + Grafana went live, making the OpenResty 5xx rate visible for the first time.

Migration was the trigger. Monitoring was the discovery. Both happened on the same day — not by coincidence, but as parts of a single VM separation plan.

Stopping the Bleeding

Once the cause was clear, the fix was simple. Stop using hardware_concurrency() raw and set a floor on the worker count. Also made worker_threads configurable via YAML, along with per-connection timeouts and keepalive settings.

This only raised the threshold for worker exhaustion. Exceed the new limit with persistent connections, and the same symptoms return.

Moving to Reactor I/O

The web server world solved this long ago. Apache hit the concurrent connection wall with its thread/process-per-connection model. nginx replaced it with an epoll-based event-driven architecture — the C10K problem, early 2000s.

What is the C10K problem?

A challenge posed by Dan Kegel in 1999: "Can a single server handle 10,000 concurrent connections?" Under the prevailing thread/process-per-connection model, 10,000 threads would saturate a server on context switches and memory alone. The arrival of I/O multiplexing APIs like epoll (Linux 2.6) and kqueue (FreeBSD 4.1) broke through that wall.

Redis was designed with a single-threaded event loop from day one. Thread-per-connection was never on the table.

This wasn't C10K. Two connections brought the server down. I had assumed "a handful of connections at most" when designing it, and that assumption broke. The fix was the same — replace the blocking recv() loop with an epoll/kqueue-based event-driven model.

Decoupling Connections from Workers

What is the reactor pattern?

An event-driven I/O design pattern where a single event loop monitors multiple I/O sources and dispatches only the ones that are ready. Formalized by Doug Schmidt in 1995. Node.js internals (libuv), nginx, and Redis are canonical implementations. The polar opposite of thread-per-connection — a small number of threads handles a large number of connections.

What is an fd (file descriptor)?

An integer that Unix-like operating systems use to identify I/O resources such as files and sockets. Each TCP connection gets one fd. In the reactor model, these fds are registered with epoll/kqueue for event monitoring. There's a per-process limit (default 1024, expandable to tens of thousands), which becomes the theoretical ceiling for concurrent connections.

This is the core of the reactor model. Idle persistent connections exist only as entries in the event loop's fd table — no worker thread consumed. A worker is claimed only when a complete command frame arrives and gets submitted as a task.

The constraint "connections ≤ workers" disappeared.

Connections : 10,000+ OK (fd table only)
Workers     : CPU×2 or so (for CPU-bound dispatch)
Event loop  : 1 thread

One Task per Connection

A single event-loop thread monitors all fds for readiness, reads data from readable fds via non-blocking recv(), and hands complete frames to worker threads. Workers handle dispatch + response only.

What is CAS (Compare-And-Swap)?

An atomic operation: "If the current value matches the expected value, replace it with a new value." Enables lock-free coordination between threads. In C++, it's std::atomic::compare_exchange_strong().

Each connection holds an atomic flag. CAS ensures at most one task per connection in flight at any time. When a worker finishes its task, it clears the flag and rechecks for new frames that may have arrived in the meantime. This "clear-then-recheck" idiom prevents dropped frames.

Non-Blocking Writes and Backpressure

What is backpressure?

When a sender pushes data faster than the receiver can consume it, buffers overflow or memory runs out. Backpressure is the umbrella term for mechanisms that slow or stop the sender when the receiver falls behind. TCP has its own backpressure via window control, but applications often add their own limits on top.

What is EAGAIN?

An error code returned when calling send() or recv() on a non-blocking socket and there's no data to read or no buffer space to write. In blocking mode, the OS would wait. In non-blocking mode, it returns EAGAIN immediately and gives control back.

Writes went non-blocking too. EnqueueResponse() first tries an inline send(). On EAGAIN, it queues the data and enables EPOLLOUT monitoring. The event loop's OnWritable() sends queued responses as the socket becomes writable.

Each connection has a max_write_queue_bytes cap (default 16 MiB). Exceed it, and the connection is force-closed. This prevents a single slow reader from exhausting server memory.

shared_ptr Ownership Model

ReactorConnection is managed via std::enable_shared_from_this. The event loop's connection map and the worker's task co-own each connection. If the event loop unregisters a connection on EPOLLHUP first, the worker's task can still finish writing its response.

cpp
auto self = shared_from_this();
bool submitted = thread_pool_->Submit([self]() { self->DrainTask(); });

Lock Hierarchy

Deadlocks are the scariest part of concurrent programming. The reactor has two locks: a per-connection write lock and a reactor-wide lifecycle lock. The rule is simple: fix the acquisition order.

The event loop only touches the lifecycle lock, never the write lock. Workers can take the lifecycle lock (shared) while holding the write lock. Reverse is forbidden. This constraint is documented in a thread-safety contract comment in the header file.

Platform Abstraction

Level-triggered vs. edge-triggered

epoll and kqueue offer two notification modes. Level-triggered means "notify me every time there's data." Edge-triggered means "notify me only when the state changes." Edge-triggered looks more efficient, but risks missing data that wasn't fully read in one pass. Level-triggered re-notifies on the next Poll() if data remains, making the implementation safer.

An EventMultiplexer abstract class provides the interface. EpollMultiplexer for Linux (level-triggered, EPOLLRDHUP always monitored) and KqueueMultiplexer for macOS/BSD. Tests inject a MockEventMultiplexer for deterministic verification of event-loop logic.

Tests

LayerWhat it testsCount
EventMultiplexerParameterized readability/writability/hangup/batch on mock + real kqueue/epoll26
IoReactor unitStart/Stop lifecycle, Register after Stop rejection, close callback6
ReactorConnection unitFrame parse, task ordering, write queue cap, partial send tracking12
IntegrationE2E, 100+ concurrent clients, disconnect mid-request, graceful shutdown, write backpressure, rate limit, UDS18+
Starvation regression128 persistent idle + 1 late client → late client must respond < 500ms3
Blocking-mode negative controlSame conditions confirm blocking path reproduces worker exhaustion2

The starvation regression test mattered most. Hold 128 persistent idle connections open, then have one late client send a query. It must get a response within 500ms. Under blocking mode, the same test reliably times out. That's the negative control.

70+ reactor-related tests. 2,164 total tests, all passing.

No Rush, Still Done in 8 Hours

The bleeding had stopped with the worker count floor. There was no time pressure on the reactor design and implementation. No rush at all. Discovery to release: 8 hours.

I would have walked away from this before. epoll/kqueue platform abstraction, non-blocking write queue, shared_ptr ownership model, lock hierarchy design, 70+ tests — not in a day. Not in three months. I handed the whole thing to Claude Code.

During integration testing, I found that the rate limiter was silently disabled on the reactor path, and the UDS acceptor failed to start due to a double bind. Wrote reproduction tests for each before fixing. The hours spent testing paid for themselves.

Results

MetricBeforeAfter
5xx rateContinuous0
Max concurrent connections (theoretical)worker_threads (= 2)fd limit (tens of thousands)
Worker thread roleWaiting on recv (occupied even when idle)Dispatch only (µs per task)
Cost of an idle connection1 thread (several MB stack)1 fd + a few KB heap
Half-open detectionLinux default 2h + 11m120 seconds (60s idle + 3×20s probe)
Production code+2,077 lines (reactor core)
Test code+2,949 lines

Three Lessons

Don't use hardware_concurrency() directly as thread pool size. VM environments routinely have 1–2 vCPUs. In a one-connection-per-worker model, the auto-sized count becomes the ceiling for concurrent clients.

C++ has no standard reactor I/O library. The Networking TS was shelved in 2021. std::execution (P2300) is under discussion for C++26 but doesn't cover I/O. The practical options are standalone Asio, libuv, libevent, or rolling your own. I chose to implement it from scratch to avoid adding dependencies (about 2,000 lines for the epoll + kqueue platform abstraction). If your project already uses Boost.Asio, building on top of it is the lower-maintenance choice.

Never use RPM %systemd_postun_with_restart for an HA in-memory database. rpm -Uvh triggers systemctl try-restart automatically after package replacement. Convenient for stateless services, but an in-memory database loses its entire index on restart. If Keepalived's health check only looks at "does the process respond," the instance holds the VIP with an empty index and serves production traffic. Use %systemd_postun (no auto-restart) and let the operator control when to restart.

nginx solved this 20 years ago. Redis never had the problem. I hit it in 2026. A humbling reminder of how much the people who came before got right.