Scars in the Source: Lessons from Claude Code's Leaked Comments
On March 31, 2026, the full source code of Anthropic's Claude Code leaked via a source map file left in their npm registry. Around 1,900 TypeScript files. Over 512,000 lines. Runtime: Bun. Terminal UI: React + Ink.
I grepped it. Picked through comments, traced structures, read what caught my eye.
The agent loop was surprisingly simple — call the LLM, run tools, return results, repeat. Tool definitions directly determined agent quality. A Fail-Open philosophy that never let broken auxiliary features block the main flow. Enormous effort spent on prompt cache control. A terminal renderer that diffs at cell granularity. All worth studying.
But this was an unintended leak. Digging into specific implementation patterns or design decisions doesn't feel fair.
Instead, I want to talk about the comments.
Buried in those 512,000 lines are comments everywhere. Dated incident records. BigQuery analysis results. A security team's account of being bypassed five times. A frank admission that a design is "inherently fragile." The comments speak louder than the implementation itself. Scars carved into the code by engineers who bled in production.
If you're building something similar — an LLM agent tool — these scars are not someone else's problem.
What's an LLM agent tool?
Software that gives a large language model "tools" like file operations and command execution, letting it autonomously choose which tools to call while completing tasks. Claude Code, GitHub Copilot CLI, and Cursor are representative examples.
If you're building on LLM APIs, you write retries like breathing. 429? Wait and resend. 5xx? Back off and retry. Standard practice.
Past a certain scale, retries themselves start amplifying the problem.
One comment records how an automatic context compression feature, when it failed, kept retrying infinitely.
1,279 sessions had 50+ consecutive failures (up to 3,272) in a single session, wasting ~250K API calls/day globally.
Discovered through BigQuery analysis. The fix was simple. Give up after three consecutive failures. A circuit breaker.
Circuit breaker
Same idea as an electrical circuit breaker. When failures pile up consecutively, stop sending requests entirely for a period. Prevents hammering a broken dependency and making things worse.
Another comment. For retries during capacity pressure, foreground and background work were explicitly separated.
during a capacity cascade each retry is 3-10x gateway amplification
Background work — summary generation, title suggestions — keeps retrying invisibly, dragging the entire system down. So only foreground work (where the user is waiting) gets retried. Background work fails immediately.
Think of retries as binary — do or don't — and one day a data analysis tells you 250K calls have been evaporating daily. LLM API calls translate directly to money.
LLM responses arrive via streaming. Tokens flow one by one, and users watch text generate in real time.
The streaming connections behind this experience have a nasty property. SDK and HTTP client timeouts usually only cover connection establishment. Once the connection is up, if it silently drops — the timeout never fires. The session hangs forever.
Claude Code addressed this with a watchdog that measures time since the last chunk arrived, force-killing the connection when a threshold is exceeded.
WebSocket has the same problem.
~98% of ws_closed never recover via poll alone
Data-backed numbers.
If you're building on streaming APIs, the "connection succeeded but silently died midway" scenario is unavoidable. Tweaking SDK timeout settings won't help — they often don't cover the streaming phase at all. Whether to add your own idle watchdog depends on whether infinite hangs are acceptable.
Claude Code runs on Bun. The comments record multiple landmines hit in production.
File watching deadlocks between the main thread and the event delivery thread. Error events fire on successful responses. TCP sockets don't auto-buffer partial writes. Each documented with Bun-specific issue numbers.
Memory leaks outside the V8 heap show up too. TLS and socket buffers live outside V8's garbage collector. Without explicit release, process RSS keeps growing. AbortSignal.timeout() timers were leaking about 2.4KB of native memory per request. V8 heap snapshots show nothing abnormal. The leak is happening outside the heap.
A sync-to-async filesystem API migration caused unexpected performance degradation.
sequential for { await pathExists } loops add ~1-5ms of event-loop overhead per iteration
1–5 milliseconds per await. Multiply by the number of plugins times the number of component types, and it balloons to hundreds of milliseconds. The naive expectation that replacing sync APIs with async makes things faster falls apart inside loops.
This isn't Bun-specific. In any runtime, documented behavior and behavior under production load are different things. Runtime bugs are found not by unit tests but by long-running production sessions.
ALLOWLIST, not blocklist. 5 rounds of bypass patches taught us that a value-dependent blocklist is structurally fragile
They patched a Bash special variable attack with a blocklist approach five times. Bypassed all five times. Shell scoping models are complex, and a custom parser's behavior inevitably diverges from the actual shell. A blocklist means "enumerate every attack pattern" — a battle you can't win by design.
On the sixth attempt, they switched entirely to an allowlist. Everything not explicitly permitted is rejected.
Blocklist vs. allowlist
A blocklist enumerates dangerous items and blocks them. Every new attack pattern requires an update. An allowlist is the reverse — only explicitly safe items are permitted. Anything not on the list is rejected, making it resilient to unknown attacks.
Did it really take five defeats to change course? Probably the first bypass looked like a fluke. The second looked like bad luck. The third made them nervous. By the fifth, they admitted it was structural. Everyone knows the general principle that blocklists are fragile. Admitting it about your own code takes five rounds of evidence.
Elsewhere, a comment records randomly generated 5-character confirmation codes accidentally spelling inappropriate words.
5 random letters can spell things
A blocklist was added, along with a joke shared in an internal chat. These problems are invisible until they happen.
Large-scale products need kill switches — ways to disable features when things go wrong. Claude Code has several. Pausing auto-updates, load-shedding background jobs, disabling auto mode.
Feature flags and kill switches
Feature flags let you toggle features on and off via server-side configuration without redeploying code. Kill switches are a specialized use — instantly disabling a specific feature during incidents. GrowthBook and LaunchDarkly are common platforms.
But kill switches themselves can break.
The feature flag cache was frozen at its process-startup snapshot. The ops team could flip a kill switch on the server, but already-running sessions never received the update.
Startup performance compounded the issue.
tengu_startup_perf p99 climbed to 76s
Startup p99 had reached 76 seconds. Over 40 plugin connectors were the cause. A kill switch is useless if the process doesn't restart. If restarting takes 76 seconds, that's a problem too.
"Hit this switch and it stops" is an assumption you can't trust without testing. Scenarios where kill switches fail to take effect — long-running sessions, slow restarts, frozen caches — tend to surface only in production.
What stood out most in this codebase was the sheer density of concrete numbers embedded in comments.
p99 / p99.99 (percentiles)
p99 is "the 99th-slowest out of 100 cases." If p99 = 76s, 99% of requests complete within 76 seconds, but 1% are slower. p99.99 captures the top 0.01% extremes. Used to measure worst-case user experience that averages hide.
Every magic number had its rationale beside it. One threshold had a p99.99 measurement. An output token reservation size had a p99 measurement with the note that the default over-reserved by 8–16x. Another threshold was explicitly marked as derived from a baseline of N=26.3M — 26.3 million samples.
Dated BigQuery analyses were cited in multiple places. Cache detection false positive rates, API call waste volumes, WebSocket recovery rates. Every decision backed by data.
changes the tool description on every query() call, invalidating the cache prefix and causing a 12x input token cost penalty.
Without the "12x" number, you can't grasp how severe this is. Cost optimization for LLM products is a war against exactly this kind of invisible cache invalidation.
Prompt caching
LLM APIs can cache "always-the-same" parts of a request — system prompts, tool definitions — on the server side, dramatically reducing input token costs. For the cache to hit, the beginning of the request (the prefix) must match the previous one. If something that "should be the same every time," like a tool description, changes subtly, the prefix no longer matches, and the entire cache is invalidated.
Internal soak testing was an established practice.
Ant soak: 1,734 dedup hits in 2h, no Read error regression.
1,734 hits in a two-hour test, with error rates compared against a baseline. Not "it worked" but "confirmed no regression against baseline."
Embedding numbers in code is a gift to your future self and your colleagues. Why this value? Based on what data? Without that, six months later someone trying to change that number faces a binary choice: change it without justification, or be too afraid to touch it.
What I liked most was the honesty scattered throughout the codebase.
inherently fragile — each pass can create conditions a prior pass was meant to handle
Acknowledging that a message normalization pipeline is "inherently fragile." A multi-pass transformation where each pass can break the preconditions of a prior pass.
the error message is fragile and will break if the API wording changes
A retry decision based on parsing API error message strings. The author acknowledging it will break if the API changes its wording.
IMPORTANT: Do not add any more blocks for caching or you will get a 400
A blunt warning to future developers. Not an explanation of why — just the fact that a landmine exists.
These aren't "embarrassing code." They're the mark of a culture that doesn't hide technical debt. "This is fragile." "This is a compromise." Stating it explicitly so the next person who touches this code doesn't fall into the same hole.
Striving for perfect code is right. But honestly recording imperfections helps the team more in the long run.
Write the rationale next to every magic number. "p99 = 4,911 tokens." "N=26.3M." "BQ 2026-03-10." Dates, sample sizes, percentiles. Just having that makes it possible to judge, six months later, whether it's safe to change the value.
My own code is full of magic numbers too. Record the reasoning while you still remember why you chose a threshold. For values you picked by gut feel, write "no basis, needs measurement." Honesty is part of the number.
When writing retry logic, you think about how many times to retry and how long to wait. "Is the user actually waiting on this retry?" is the question that slips through if you're not deliberate about it.
If the user is sitting in front of the screen, the retry is worth it. Background summary generation or metadata updates? Nobody notices if they fail. Adding retries there just amplifies load during outages. Deciding retry limits matters. But deciding whether to retry at all matters just as much.
"Inherently fragile." "Will break if the API wording changes." Being able to write that is strength.
Technical debt comments aren't embarrassing to write — not writing them is embarrassing. Whether the next person to touch this code — including yourself in three months — falls into the same hole depends on that one line. "IMPORTANT: Do not add any more blocks for caching or you will get a 400." One comment like that erases half a day of debugging.
Comments in production code contain a kind of knowledge that never makes it into documentation or blog posts. The record of retries running wild. The night a kill switch didn't reach. The decision to change course after five bypasses. The rationale behind a threshold derived from tens of millions of samples.
Design documents don't preserve it. Code review comments disappear. Only comments carved into the source code preserve the judgment and context of that moment.
When building a similar LLM agent, the odds of stepping on the same landmines are high. Silent streaming deaths, retry avalanches, cache traps. The scars left by those who stepped on them first become a map.
Scars in the Source: Lessons from Claude Code's Leaked Comments
On March 31, 2026, the full source code of Anthropic's Claude Code leaked via a source map file left in their npm registry. Around 1,900 TypeScript files. Over 512,000 lines. Runtime: Bun. Terminal UI: React + Ink.
I grepped it. Picked through comments, traced structures, read what caught my eye.
The agent loop was surprisingly simple — call the LLM, run tools, return results, repeat. Tool definitions directly determined agent quality. A Fail-Open philosophy that never let broken auxiliary features block the main flow. Enormous effort spent on prompt cache control. A terminal renderer that diffs at cell granularity. All worth studying.
But this was an unintended leak. Digging into specific implementation patterns or design decisions doesn't feel fair.
Instead, I want to talk about the comments.
Buried in those 512,000 lines are comments everywhere. Dated incident records. BigQuery analysis results. A security team's account of being bypassed five times. A frank admission that a design is "inherently fragile." The comments speak louder than the implementation itself. Scars carved into the code by engineers who bled in production.
If you're building something similar — an LLM agent tool — these scars are not someone else's problem.
What's an LLM agent tool?
Software that gives a large language model "tools" like file operations and command execution, letting it autonomously choose which tools to call while completing tasks. Claude Code, GitHub Copilot CLI, and Cursor are representative examples.
Retries Are Not Your Friend
If you're building on LLM APIs, you write retries like breathing. 429? Wait and resend. 5xx? Back off and retry. Standard practice.
Past a certain scale, retries themselves start amplifying the problem.
One comment records how an automatic context compression feature, when it failed, kept retrying infinitely.
Discovered through BigQuery analysis. The fix was simple. Give up after three consecutive failures. A circuit breaker.
Circuit breaker
Same idea as an electrical circuit breaker. When failures pile up consecutively, stop sending requests entirely for a period. Prevents hammering a broken dependency and making things worse.
Another comment. For retries during capacity pressure, foreground and background work were explicitly separated.
Background work — summary generation, title suggestions — keeps retrying invisibly, dragging the entire system down. So only foreground work (where the user is waiting) gets retried. Background work fails immediately.
Think of retries as binary — do or don't — and one day a data analysis tells you 250K calls have been evaporating daily. LLM API calls translate directly to money.
Connections Die Silently
LLM responses arrive via streaming. Tokens flow one by one, and users watch text generate in real time.
The streaming connections behind this experience have a nasty property. SDK and HTTP client timeouts usually only cover connection establishment. Once the connection is up, if it silently drops — the timeout never fires. The session hangs forever.
Claude Code addressed this with a watchdog that measures time since the last chunk arrived, force-killing the connection when a threshold is exceeded.
WebSocket has the same problem.
Data-backed numbers.
If you're building on streaming APIs, the "connection succeeded but silently died midway" scenario is unavoidable. Tweaking SDK timeout settings won't help — they often don't cover the streaming phase at all. Whether to add your own idle watchdog depends on whether infinite hangs are acceptable.
Verify Your Runtime's Promises
Claude Code runs on Bun. The comments record multiple landmines hit in production.
File watching deadlocks between the main thread and the event delivery thread. Error events fire on successful responses. TCP sockets don't auto-buffer partial writes. Each documented with Bun-specific issue numbers.
Memory leaks outside the V8 heap show up too. TLS and socket buffers live outside V8's garbage collector. Without explicit release, process RSS keeps growing.
AbortSignal.timeout()timers were leaking about 2.4KB of native memory per request. V8 heap snapshots show nothing abnormal. The leak is happening outside the heap.A sync-to-async filesystem API migration caused unexpected performance degradation.
1–5 milliseconds per await. Multiply by the number of plugins times the number of component types, and it balloons to hundreds of milliseconds. The naive expectation that replacing sync APIs with async makes things faster falls apart inside loops.
This isn't Bun-specific. In any runtime, documented behavior and behavior under production load are different things. Runtime bugs are found not by unit tests but by long-running production sessions.
What Five Defeats Teach You
The most striking security comment was this.
They patched a Bash special variable attack with a blocklist approach five times. Bypassed all five times. Shell scoping models are complex, and a custom parser's behavior inevitably diverges from the actual shell. A blocklist means "enumerate every attack pattern" — a battle you can't win by design.
On the sixth attempt, they switched entirely to an allowlist. Everything not explicitly permitted is rejected.
Blocklist vs. allowlist
A blocklist enumerates dangerous items and blocks them. Every new attack pattern requires an update. An allowlist is the reverse — only explicitly safe items are permitted. Anything not on the list is rejected, making it resilient to unknown attacks.
Did it really take five defeats to change course? Probably the first bypass looked like a fluke. The second looked like bad luck. The third made them nervous. By the fifth, they admitted it was structural. Everyone knows the general principle that blocklists are fragile. Admitting it about your own code takes five rounds of evidence.
Elsewhere, a comment records randomly generated 5-character confirmation codes accidentally spelling inappropriate words.
A blocklist was added, along with a joke shared in an internal chat. These problems are invisible until they happen.
When the Kill Switch Breaks
Large-scale products need kill switches — ways to disable features when things go wrong. Claude Code has several. Pausing auto-updates, load-shedding background jobs, disabling auto mode.
Feature flags and kill switches
Feature flags let you toggle features on and off via server-side configuration without redeploying code. Kill switches are a specialized use — instantly disabling a specific feature during incidents. GrowthBook and LaunchDarkly are common platforms.
But kill switches themselves can break.
The feature flag cache was frozen at its process-startup snapshot. The ops team could flip a kill switch on the server, but already-running sessions never received the update.
Startup performance compounded the issue.
Startup p99 had reached 76 seconds. Over 40 plugin connectors were the cause. A kill switch is useless if the process doesn't restart. If restarting takes 76 seconds, that's a problem too.
"Hit this switch and it stops" is an assumption you can't trust without testing. Scenarios where kill switches fail to take effect — long-running sessions, slow restarts, frozen caches — tend to surface only in production.
A Culture of Embedding Numbers
What stood out most in this codebase was the sheer density of concrete numbers embedded in comments.
p99 / p99.99 (percentiles)
p99 is "the 99th-slowest out of 100 cases." If p99 = 76s, 99% of requests complete within 76 seconds, but 1% are slower. p99.99 captures the top 0.01% extremes. Used to measure worst-case user experience that averages hide.
Every magic number had its rationale beside it. One threshold had a p99.99 measurement. An output token reservation size had a p99 measurement with the note that the default over-reserved by 8–16x. Another threshold was explicitly marked as derived from a baseline of N=26.3M — 26.3 million samples.
Dated BigQuery analyses were cited in multiple places. Cache detection false positive rates, API call waste volumes, WebSocket recovery rates. Every decision backed by data.
Without the "12x" number, you can't grasp how severe this is. Cost optimization for LLM products is a war against exactly this kind of invisible cache invalidation.
Prompt caching
LLM APIs can cache "always-the-same" parts of a request — system prompts, tool definitions — on the server side, dramatically reducing input token costs. For the cache to hit, the beginning of the request (the prefix) must match the previous one. If something that "should be the same every time," like a tool description, changes subtly, the prefix no longer matches, and the entire cache is invalidated.
Internal soak testing was an established practice.
1,734 hits in a two-hour test, with error rates compared against a baseline. Not "it worked" but "confirmed no regression against baseline."
Embedding numbers in code is a gift to your future self and your colleagues. Why this value? Based on what data? Without that, six months later someone trying to change that number faces a binary choice: change it without justification, or be too afraid to touch it.
Honest Scars
What I liked most was the honesty scattered throughout the codebase.
Acknowledging that a message normalization pipeline is "inherently fragile." A multi-pass transformation where each pass can break the preconditions of a prior pass.
A retry decision based on parsing API error message strings. The author acknowledging it will break if the API changes its wording.
A blunt warning to future developers. Not an explanation of why — just the fact that a landmine exists.
These aren't "embarrassing code." They're the mark of a culture that doesn't hide technical debt. "This is fragile." "This is a compromise." Stating it explicitly so the next person who touches this code doesn't fall into the same hole.
Striving for perfect code is right. But honestly recording imperfections helps the team more in the long run.
Three Habits Worth Adopting
After grepping through 512,000 lines of comments, three things I want to bring back to my own work.
Embed Numbers in Comments
Write the rationale next to every magic number. "p99 = 4,911 tokens." "N=26.3M." "BQ 2026-03-10." Dates, sample sizes, percentiles. Just having that makes it possible to judge, six months later, whether it's safe to change the value.
My own code is full of magic numbers too. Record the reasoning while you still remember why you chose a threshold. For values you picked by gut feel, write "no basis, needs measurement." Honesty is part of the number.
Ask "Who's Waiting?" Before Every Retry
When writing retry logic, you think about how many times to retry and how long to wait. "Is the user actually waiting on this retry?" is the question that slips through if you're not deliberate about it.
If the user is sitting in front of the screen, the retry is worth it. Background summary generation or metadata updates? Nobody notices if they fail. Adding retries there just amplifies load during outages. Deciding retry limits matters. But deciding whether to retry at all matters just as much.
Make Fragility Explicit
"Inherently fragile." "Will break if the API wording changes." Being able to write that is strength.
Technical debt comments aren't embarrassing to write — not writing them is embarrassing. Whether the next person to touch this code — including yourself in three months — falls into the same hole depends on that one line. "IMPORTANT: Do not add any more blocks for caching or you will get a 400." One comment like that erases half a day of debugging.
What Comments Leave Behind
Comments in production code contain a kind of knowledge that never makes it into documentation or blog posts. The record of retries running wild. The night a kill switch didn't reach. The decision to change course after five bypasses. The rationale behind a threshold derived from tens of millions of samples.
Design documents don't preserve it. Code review comments disappear. Only comments carved into the source code preserve the judgment and context of that moment.
When building a similar LLM agent, the odds of stepping on the same landmines are high. Silent streaming deaths, retry avalanches, cache traps. The scars left by those who stepped on them first become a map.