Read the Metrics
Server metrics run deep.
A project. The API would slow down intermittently. Not random. Not correlated with load. A fixed cycle. Latency would spike for a few seconds, then vanish.
Long hours of investigation. Application layer, clean. Network, clean. What I finally reached was MySQL's binlog. The latency cycle matched the binlog file rotation timing exactly.
From there it was fast. The cause was ext3's filesystem freeze during binlog deletion. ext3 can block I/O due to journaling when deleting files. While production traffic was being served, the filesystem was locking up for seconds at a time behind the scenes.
The fix landed. But a thought occurs. What if this had been ECS/Fargate with Aurora? Aurora's storage is a distributed KV-like design, so this particular problem might never surface. But what about a different problem, something hidden inside the managed service itself? Could you trace your way from CloudWatch metrics down to a filesystem freeze?
Fully managed is convenient. Operational costs drop. Tedious work disappears. But at the very end, root-cause analysis is dirty low-level investigation. And managed services will not let you touch that low level.
The more convenient things get, the further that last step becomes when something breaks. Still, the comfort wins.