Letting Claude Code Build the Servers

This is not a cloud story. I rent a rack in a datacenter, put physical servers in it, run VMs on top, and serve traffic from there — on-prem in the old sense. No AWS console. No aws_instance in Terraform. What shows up instead: physical hosts, hypervisors, VLANs, VRRP, firewalls, IPMI, and the reality that "if it really comes to it, someone has to drive over and stand in front of the rack." Honestly, standing in front of the rack is one of the parts of this job I least want to do.

I inherited that on-prem environment from a predecessor. There was a handover, but you can't extract years of accumulation in a few sessions. Documentation is fragmented, some of the topology diagrams are out of date, and the reasons behind certain choices live only in the previous person's head. A dozen-odd physical hosts, several dozen VMs on top, a handful of network devices. You don't get to stop production traffic.

Today, I have handed the building, the operation, and the incident investigation of this rack over to Claude Code. Discussing designs, deciding direction, writing roles, spinning up VMs, configuring the OS, opening firewall holes, wiring monitoring in, adding DNS records, shifting production traffic in stages, querying metrics across services in the middle of the night to triage a page — the agent does the actual hand-moving. I sit on the side that decides direction and approves diffs. I don't think this is the best answer. It's going to change shape many times. But I want to write down, once, the methodology that's actually running on a single inherited rack.

What happened

Result first.

In a little over a week, single-handedly, I replaced about fifteen production VMs. Roughly: 2 SLBs, 2 internal DNS, 2 VPN gateways, 1 bastion, 2 DB replicas, 3 Valkey nodes, 2 reverse proxies, 1 job server. All of them carry user-facing services. If one of them goes, something users can see goes with it. By the old reckoning, this is the kind of thing a multi-person ops team would budget months for.

SLB (Software Load Balancer)

A load balancer built from software (HAProxy, nginx, Envoy and so on) running on commodity servers, instead of a hardware appliance. Typically deployed as a pair (or more) using VRRP, where one node holds the VIP and the other takes over if it falls.

That set includes the load balancer replacement. SLB swaps used to be a project — new hardware, connectivity tests, protocol compatibility checks against everything upstream and downstream, HA failover drills, cutover rehearsals, scheduling a maintenance window. A multi-week event for several people. With the agent alongside, I went from plan to production in days. The agent enumerated the compatibility test cases, collected the result logs, and judged pass/fail. Adjusting the staged rollout weights, watching the traffic — the agent remembered the order so I didn't have to.

But you can't have the agent design the compatibility scenarios alone, or the traps you've actually stepped on disappear. This is where humans bring experience. For a reverse proxy swap, items like "the cache key composition differs per domain," "this one domain has a strange Vary header," "make sure the cache purge API doesn't put extra load on the upstream origin" — only people who got burned by them in the past think to write them down. Once you teach them to the agent, it lays them out as a verification checklist, mechanically, every time.

SLB failover drills are the same. Whether VRRP flips correctly when you down a node is the most basic check there is. But in practice you trip over the integration with the monitoring stack: the failover ran, but the alert never fired; the alert fired, but it never reached the Discord channel; the notification arrived, but the recovery notification never came. So the failover-drill checklist always includes "confirm that the notification actually arrived" and "confirm that the recovery notification arrived after recovery." Agents tend to think failover drills are about LB-side behavior. You have to spell it out.

So the division comes out to: the human brings the perspectives into the verification scenarios; the agent does the execution and the summarization of results. The perspective side grows incrementally, one incident at a time. That's part of the "fold it back into knowledge" loop I'll get to later.

It wasn't that I suddenly got more competent. I handed building, operations, and investigation to Claude Code, and concentrated on judgment and approval.

What "letting it build the servers" actually means

Less abstract, more concrete. When I want to add a single VM, here's what the agent does in order:

Calculate capacity (which physical host has headroom in CPU, RAM, storage)
Call the hypervisor CLI to spin up a VM from a template
Pick a hostname, IP, VLAN (cross-checked against the host inventory to avoid collisions)
Attach NICs
Run the OS bootstrap Ansible role (packages, users, SSH, time sync, log forwarding…)
Write and apply the firewall rules
Add the record to internal DNS, update the zone file, distribute it to both DNS servers
Register the host with monitoring (install the exporter, add it to VictoriaMetrics scrape config, add a Grafana panel)
Update the host inventory, the per-host page, the cross-service pages, the topology file, and the Ansible inventory in one shot
Bundle all of the above into one commit and push to Git

What I do is tell it "I want a VM that does X" at the start, and review the diff and approve at a couple of points along the way. My hands don't move. There's nothing for a typo to land in.

For service swaps it gets bigger. The agent spins up the new host with the steps above, shifts production traffic in stages (5% → 20% → 50% → 100%), checks metrics at each stage, advances if things look fine, and finally stops the old host and reclaims its resources. The agent remembers the order. Even if I step away in the middle, when I come back I can ask "what's next?" and get the right next step.

That's how about fifteen boxes got swapped in just over a week.

The scaffolding that makes it work

Naturally, if you ask Claude Code to "build a server" with no preparation, things go wrong fast. The more capable the agent, the more the track has to be paved before you turn it loose.

The scaffolding splits into two simple halves. How do you record the work after it lands — IaC, single repo, skills, document structure — that's one. How do you design the path the agent takes to reach production — guardrails, VPN, an SSH wrapper, credential management — that's the other. Without the first you lose reproducibility. Without the second you have accidents. You need both before you can hand the construction over.

Six pieces, in turn.

1. Lift everything into code

Ansible's --check mode

A dry-run mode that doesn't actually apply the playbook — it only shows you what would change. You can compare the host's current state against the goal state encoded in the role. If the diff is zero, the role fully describes reality.

The inherited environment started in a "running but not in code form" state. The first job was to observe the current state, transcribe it into Ansible roles, and reconcile with --check until the diff was zero.

Skip this and the rest collapses. If even one host can't be described in code, automation starts routing around it, and conditional branches for the special case start piling up in your roles. Ask the agent to "apply X to all hosts" and it'll halt at the special case. So the exceptions get resolved first.

From there, exactly one rule: all changes go through a role or playbook. If the agent is the one constructing things, the agent's code becomes the record. The moment you leave room for hand-edits, the agent's "model of reality" starts to drift from reality.

2. Put everything in one Git repository

Documentation (Markdown), config files, Ansible roles and playbooks, inventory, skill definitions, rule definitions — all of it goes into a single Git repository.

Split them up and the generations drift. You end up with the classic three-layer arrangement where the docs reflect last year, Ansible reflects last month, and the actual hosts reflect today. Put it all in one repo and a single commit moves "the fact," "the explanation," and "the way to apply it" together. The commit that adds one VM contains the new line in the host inventory, the new file under the per-host directory, the participation note in the cross-service page, the Ansible inventory diff, and the relevant plan-document update. All in one place.

Generation management comes for free. You don't need a separate system to track who changed what and why. git log and git blame do all of it. Ask the agent "why is this host configured this way?" and it walks back through the relevant commits and reconstructs the reasoning.

There's one practical constraint. Infrastructure repos by their nature are awkward to put on external hosting. GitHub and friends are off the table. So the Git remote lives inside the perimeter. PR review and CI as conveniences are gone. In their place: Ansible's --check substitutes for CI, the agent reviews diffs, and a human makes the final call. It's inconvenient. But as long as Git's own generation tracking is in your hands, the ability to ask "why is it like this" is preserved.

3. Set up guardrails (rules) before anything else

Claude Code has a mechanism that auto-loads rule definitions when certain paths are touched. For the whole repo, I keep one sheet of more fundamental principles: information-gathering uses read-only commands only, configuration changes are forbidden until explicitly authorized, output is always size-capped, and so on.

The allowlist framing is what works. "Don't allow X" leaks. "Don't allow anything except what's allowed" doesn't. You can ban rm and find -delete is still there. Walling things off in advance is faster than chasing every hole.

Once rules slot in automatically, the agent's blast radius gets physically narrower. You don't have to write them into every prompt, and there's nothing to forget.

Rules define "the permitted world" in allowlist form. But rules ultimately depend on the agent reading them and choosing to comply, which makes them — in a sense — a request. So I add a layer with actual force: hooks. Claude Code lets you wedge a script into the front and back of any tool call. The hook side is blocklist-shaped: it matches strings against patterns of "this, never, no matter what" and physically stops the call. Whatever the agent decides, it can't get past the hook.

Rules (allowlist, self-discipline) and hooks (blocklist, physical block) form a two-layer defense. That's the basic shape.

I run three hooks in practice.

Block reads of environment files. Read or Bash calls that touch .env or anything under secrets/ get matched on path and stopped. Even if the agent decides "I want to read this," the call dies before the tool fires.
Block destructive Git operations. git reset --hard, git push --force, git clean -fd — anything that rewinds or deletes history. Even if the agent thinks it's recovering, that work happens by hand.
Block destructive remote operations. When something tries to push rm -rf, systemctl stop, or reboot to a target host over SSH, the hook parses the command string and refuses. Because everything goes through the SSH wrapper, the hook can see what's actually being run.

Rules are "self-restraint on the reading side." Hooks are "physical interruption on the execution side." Defending the same thing twice looks redundant, but agents occasionally forget rules — or interpret them creatively — and the hooks let me hand work over without dread. Whatever you absolutely don't want touched, stop it physically with a hook.

4. Turn procedures into skills

Skills and subagents in Claude Code

Similar concepts, different jobs. Skills are closer to runbooks. For a specific operation — "create a VM," "deploy DNS" — you write the pre-checks, the steps, and the post-verification into one file, and invoke it as /skill-name. It's the mechanism for "always do the same thing in the same order." Subagents are closer to specialists. A read-only investigator, an incident triage agent, a capacity calculator — split by role, each with only the permissions and prompt they need. Skills fix how. Subagents fix perspective and authority.

Anything I do repeatedly got pinned down as a skill. Create a VM, deploy DNS, push monitoring config, deploy the bot, run an Ansible playbook. Each skill spells out the pre-checks, the steps, and the post-verification in a fixed order.

Inside every skill, I bake in a five-stage shape: dry-run → present diff → human approval → real run → verify. Agents don't confirm unless you've designed them to confirm. Building the confirmation into the process is on me, not on the agent.

There's always one human approval point. However confidently the agent talks, the run button is mine. Conversely, everything before and after — the boring setup, the boring teardown — can be handed off. Three months from now, the same skill produces the same result.

5. Shape the document layout so the agent doesn't get lost

I redesigned the directory layout itself as "a map the agent won't get lost in."

inventory/   host list, network, storage, rack diagram (single source of truth)
servers/     per-host pages (one host per directory; parent/child topology in one file)
services/    cross-service info (HA, VIPs, routing, DB, cache)
monitoring/  monitoring design and setup procedures
ansible/     inventory / playbooks / roles
docs/plans/  in-progress and unstarted plans
docs/task/   in-progress tasks (with remaining-work checklists)
docs/done/   completed plans

The rules are simple. Facts about hosts live only in inventory. The per-host pages don't copy values from there — they reference. Plans get written into plans/, move to task/ once underway, and move to done/ when finished. The Markdown physically moves.

The benefit shows up for the agent before it shows up for me. Ask "I want to add a VM," and the agent goes straight to inventory, adds the per-host page, writes the participation note in the cross-service file, and regenerates the Ansible inventory. The work is canonicalized and the references are unique. Because it's canonicalized, it can become a skill.

6. Abstract the access path one layer (VPN / SSH / credentials)

The path the agent takes to reach a production server has three stages. WireGuard for network reachability, an SSH wrapper for invoking the host, and credentials known only inside the wrapper. I refer to those three together as "the access path."

One design principle. Abstract the world the agent sees by one layer. No raw IPs, no raw keys, no raw passwords, no raw VPN config files. What the agent sees is "a hostname" and "an intent."

Concretely:

Wall the VPN connection off as a precondition. The wrapper checks reachability on startup and tells you up front if you're not connected. The agent never has to guess why it can't reach something.
Concentrate connection logic in one SSH wrapper. Per-host, it figures out whether you need a key, a password, or sudo. From the agent's perspective, every host is "called the same way."
Credential files live outside Git. The agent's rules also say "don't read this path." Ansible side uses vault, decrypted only at playbook runtime. What the agent gets handed is the variable name, not the value.

This abstraction matters because if the agent goes off the rails, the damage doesn't spread. The most the agent can do is "send a command to a host through the wrapper," and that command is read-only by default (constrained on the guardrail side). It has no way to drop the VPN, no path to wander into another network. The access path itself is fused with the guardrails.

Guardrails constrain "what you mustn't do." The access path constrains "where you can reach." Credential management constrains "what you can see." The three aren't independent pieces of scaffolding — they fit together into one design called "the path that gets through."

What the human is doing

If the agent is constructing things, what am I doing? Here's the actual list.

Decide direction. Big-shape decisions: "reorganize the cache layer from three nodes to two and split it by purpose."
Write the plan document (with the agent). Markdown into docs/plans/. Phases, rollback steps, blast radius, verification items. The agent drafts; I revise.
Read the diff and approve. Read the dry-run diff the skill presents. If it looks fine, approve. If it doesn't, kick it back.
Judge. "Is it safe to shift traffic right now?" "Is this alert real or noise?" The agent gathers the data; the human decides.
Record the why. Write the reason for the decision into the commit message or the plan document, so future me — or the agent — understands when reading it back.

The closest description is "the typing went away." The time spent opening vim, the time spent remembering command flags, the time spent flipping between terminal windows — all close to zero. What replaced it: time spent reading diffs that show up on screen, and time spent deciding.

The incident-response loop

3 a.m., paged awake. Open the laptop with sleepy eyes, start figuring out what's going on — even here, the agent helps.

The actual flow:

The alert lands in the notification channel
Start the triage agent, paste the alert body
The agent queries the metrics server directly and lays out the related metrics across services
If needed, it SSHs into the target host read-only and runs journalctl or ss
It returns "these are the abnormal values, and there are three candidate causes: A, B, C"
The human looks at the candidates and picks which one to dig
Once the cause is identified, remediation goes through a skill (with a human approval gate)

The triage agent talks to the metrics server through MCP directly. Faster than ssh-ing in and curling, and the data comes back structured, which lifts the agent's reasoning quality. The priority is explicit: structured APIs first, command execution over SSH only as a last resort.

IPMI and SEL

IPMI (Intelligent Platform Management Interface) is the protocol for talking to the BMC (Baseboard Management Controller) on a server's motherboard. It works even when the OS isn't responding — power control, sensor values, hardware events. SEL (System Event Log) is the BMC's record of hardware-side events: memory errors, power anomalies, thermal issues.

For when the OS stops responding, IPMI is wired up so the agent can use it the same way. Power state, SEL, sensor values, fan RPMs, temperatures — a separate channel of information from anything you can see through the OS. For a host that doesn't answer SSH, the agent first asks IPMI "is power on" and "are there hardware-origin events," and triages OS hang versus hardware failure. Once that's lined up automatically, the human can decide on the spot whether someone has to drive over.

The time spent assembling PromQL with a sleepy head, and the time spent logging into the IPMI console to check power state, both go away. You enter at "pick from the candidates."

ChatOps Wired into VictoriaMetrics

VictoriaMetrics

An open-source time-series metrics storage system, Prometheus-compatible. Pull-based — it scrapes the same exporters on the same schedule. Its query language (MetricsQL) is PromQL-compatible. More storage-efficient than Prometheus, especially for long retention. Alert evaluation is handled by vmalert.

The other thing that changed the on-call experience is wiring ChatOps directly into VictoriaMetrics. Concretely, there's a Bot living in Discord. It's not a fixed-command bot — the Bot itself is an AI agent, with VictoriaMetrics as a tool.

PromQL instant queries, range queries, target list with up/down — three of those are exposed as tools. When a user talks to the Bot in plain Discord, it composes the right PromQL itself, sends it, and summarizes the result. Questions like "what's the up state for the cache layer?" or "what's the CPU on a particular host over the last hour?" — the Bot picks the tool and answers.

Claude Code sees the same VictoriaMetrics over MCP. So the Discord Bot and Claude Code share the same data source, just with two entrances. At my desk I go through Claude Code; out somewhere with only a phone, I go through Discord. Same questions to the same VictoriaMetrics.

A side effect: the response history naturally lives in the channel log. Bot exchanges, Alertmanager notifications, human judgments — all in one thread. When I sit down to write a postmortem, what we looked at and what we decided is right there to walk through.

There's a caveat. The tools the Bot is given are read-only. No restart, no config change. Chat-driven operations are powerful, and I don't want to add a path where a stray message stops production. Anything that writes goes through a skill, with a human approval gate.

Holes I actually stepped in

If I only wrote down the things that worked, this would be a lie. I'll keep the configs vague, but here are a few of the shapes of holes I actually stepped in. If sparing you the same shape is useful, that's the most useful thing in this article.

The premise of which template to use wasn't shared. The base template for VM creation existed in more than one place, the agent picked the wrong one, and the VM didn't boot. There was room for the same accident to keep happening until I stated "the canonical template is this one" in both the documentation and the skill.
Missed the priority math on a redundancy setup at design time. For a new-vs-old HA cutover, I miscalculated the effective priority when the health check fails, and almost shipped a design where the new side couldn't take over. Caught it in pre-deployment review. Neither I nor the agent spotted it during initial design. Now any skill that touches a redundancy setup has "verify the priority math" as a mandatory pre-check.
Cut over with a gap in assumptions between the old and new components still in place. The old and new sides had different connection prerequisites, and the migration procedure was missing a step at the start that listed those differences. After cutover, the client side cascaded into failures. From then on, every migration procedure starts with "enumerate the deltas between old and new."
The "make the old side unreachable" step was missing. After cutover to the new cluster, the auto-failover machinery rediscovered the old side and pulled it back. There was room for the same shape of mistake to recur until every cutover procedure explicitly said "make the old side unreachable on the network."
An option meant to be safer blocked the legitimate behavior. An option I added to suppress risk also blocked a normal-path action as a side effect. I left a note: "don't add 'safe' options unconditionally — verify the necessity and the side effects first."

Lined up like this, none of these are "the agent went rogue" stories. The agent did exactly what it was told. The hole was in the instructions, or in the missing background. So the response isn't "constrain the agent harder" — it's "add more places to record the premises and the procedure."

Folding what happened back into knowledge

And this is where the operation compounds the most, I think.

When something goes wrong, the cause becomes clear. The usual move is to write a postmortem and stop there. But then whether the same hole gets stepped in again depends on whether the next on-call remembers the postmortem. They won't. People forget.

So whatever happened gets folded back into something. Four destinations.

What happened	Folded into	Effect
Procedural slip / omission	Skill	Inserted automatically next time
Reason / background of a design	Documentation (`docs/`)	"Why this is like this" stays
A triage angle	Subagent definition	Next incident, the agent notices on its own
Common ban / precondition for all agents	`CLAUDE.md`	Effective across the whole repo

For example: an incident where the deploy skipped --check and an unintended diff went out. The next move is to add "presenting a --check diff is mandatory" to the top of that skill. From then on, everyone who invokes that skill — human or agent — has to look at the diff. The same accident doesn't happen twice.

Triage angles are the same. "When the cache layer is acting up, look at the upstream LB connection count first" — once that rule of thumb is learned, write it into the triage subagent. The next overnight page, the agent lays things out from that angle without being asked.

CLAUDE.md holds the preconditions that should apply across the whole repository. "Unreachable without VPN," "don't read secrets/," "no configuration changes without explicit instruction" — the things every agent should read on startup. Add a line, and from that moment on, the behavior of every agent in the repo changes.

Once I started doing this, the quality of operations stepped up, perceptibly. The first few days and the days after feel like a different agent. The agent didn't suddenly get smarter — the world it consults got richer.

What's effective is that the knowledge accumulates in "files an agent can read" rather than "a person's head." What gets lost in handovers is mostly the tacit knowledge in someone's head. If it's in files from the start, the handover is almost unnecessary.

Why I didn't pick a general-purpose autonomous agent

Anyone reading this far might have a question. General-purpose autonomous agents like OpenClaw — LLM as the brain, taking instructions in plain language through messaging apps, executing commands against the local machine and external tools on their own — that direction is spreading fast right now. People call it "the closest thing to JARVIS," and the GitHub stars exploded in days. If telling something in plain language is enough to get the work done, why bother designing skills and subagents in Claude Code?

I tried it. It's nice. For organizing email on your laptop, juggling your calendar, sorting files — it's genuinely powerful.

But for handling a fleet of production servers, I haven't adopted it. The reason is simple: autonomy and the radius of misfire are two sides of the same thing. A general-purpose autonomous agent is designed with a wide outline of "what it may do," and it acts on its own judgment when no one is watching. That property is the source of usefulness for personal local work. It's the source of risk for production infrastructure.

Security researchers describe a shape they call the "lethal trifecta" — access to private data, exposure to untrusted content, and authority to act on behalf of the user. An agent with all three is exposed to prompt injection and can be steered by an attacker. Cisco's AI security team has reported actual cases of data exfiltration and prompt injection happening through third-party skills. An agent holding SSH keys to production starting to type rm because of a crafted string in a log it happened to read — that's not a metaphor. It's a real threat.

So I designed in the direction of intentionally lowering autonomy.

	General-purpose autonomous agent	Approach in this article
Scope of delegation	Whatever the user expresses	Inside pre-defined skills
Decision to execute	Agent decides on its own	Always passes through human approval
What it can touch	Whatever tools it's connected to	Only the wrapper-abstracted path
Credentials	Referenced as needed	Physically hidden from the agent
Blast radius	As wide as the design allows	Up to a single skill's dry-run

This isn't "Claude Code is better and OpenClaw is worse." The two have fundamentally different design philosophies. A general-purpose autonomous agent is valuable because it does the things you didn't explicitly write down. Production infrastructure is valuable because it never does the things you didn't explicitly write down. The directions of value are opposite.

If I had to write the boundary in one line: for reversible work, you can raise autonomy. For irreversible work, you lower it. Sorting personal files: if you mess it up, you restore from backup. Changing a production DB setting: if you mess it up, the service stops. The same "convenience" carries a different order of magnitude in the price of getting it wrong.

That may change. Once general-purpose autonomous agents acquire prompt-injection resistance, ship undo as a standard property, and let you formally bound their blast radius, they'll work their way into production operations too. Right now they're not there yet. So for now, I keep autonomy low and grow what I can do, slowly.

One more I held off on: HolmesGPT

The other one I'd been watching is HolmesGPT. CNCF Sandbox project, primarily built by Robusta.Dev, an open-source SRE agent specialized in investigating production incidents and finding root causes. It picks up alerts, starts investigating on its own, and writes the result back to Slack. Read-only by default, RBAC-aware, designed to be put into production. For someone who's hand-rolled a "triage agent" out of skills and subagents, it looks exactly like an off-the-shelf tool solving the same problem.

I haven't adopted it yet. Two reasons.

The assumed environment skews Kubernetes / cloud. The documentation and the built-in data sources are organized around Kubernetes, cloud providers, and SaaS. On-prem physical hosts and a custom virtualization stack aren't first-class citizens. Not impossible to run, but I couldn't tell from a quick try whether it would slot into my environment without friction.
How far it can cover hardware faults is unverified for me. The IPMI-based investigation I described earlier — "the OS isn't responding, look at SEL to confirm a power anomaly, decide whether someone needs to drive over" — how HolmesGPT would behave inside that loop wasn't something I could tell from reading the docs. In cloud you don't think about IPMI to begin with, so this is a verification item particular to on-prem.

In other words it's not "decided not to adopt" — it's "haven't yet verified it runs safely in my environment." The design philosophy is close, it's growing openly as a CNCF Sandbox project, and it's worth verifying. For now my own skills + subagents are enough, so I've deprioritized it. But as the scope of hardware-side responsibilities clears up, replacing my homegrown triage agent with HolmesGPT is a perfectly plausible future.

Lined up, the selection criterion gets visible. OpenClaw I didn't adopt because the design philosophy (the direction of autonomy) is different. HolmesGPT I'm holding because the design philosophy is close, but the behavior in my environment (on-prem, the physical layer) is unverified. The first is a question of principle. The second is a question of evidence. The boundary line is in the same place: "before this goes into production, can I myself explain its blast radius?"

What's left is judgment

The inherited rack changed a lot in a short time. The old boxes got tidied, monitoring went up, internal name resolution got self-contained, the VPN became redundant, the load balancer got replaced. Most of that, Claude Code did with its hands.

What changed is that the time I spent on "reading," "writing," and "typing" shrank dramatically. What grew in its place is the time I spent on "is this actually worth doing," "what could break from this change," and "what should be prioritized right now."

The agent took the work, not the judgment. If anything, the weight of judgment went up. The agent assembles the information needed to decide so quickly that all that's left for me to do is decide.

I don't think this is the best way. A little ahead of now, I'll have rewritten about half of what's here into some other form. Even so, if the scaffolding running today helps anyone else, that's enough.

Letting Claude Code Build the Servers ​

What happened ​

What "letting it build the servers" actually means ​

The scaffolding that makes it work ​

1. Lift everything into code ​

2. Put everything in one Git repository ​

3. Set up guardrails (rules) before anything else ​

4. Turn procedures into skills ​

5. Shape the document layout so the agent doesn't get lost ​

6. Abstract the access path one layer (VPN / SSH / credentials) ​

What the human is doing ​

The incident-response loop ​

ChatOps Wired into VictoriaMetrics ​

Holes I actually stepped in ​

Folding what happened back into knowledge ​

Why I didn't pick a general-purpose autonomous agent ​

One more I held off on: HolmesGPT ​

What's left is judgment ​