Observability for Agent Tooling

Agent systems fail in ways that are hard to reproduce without strong traces.

Minimum telemetry

At a minimum, keep traces for:

Group failures by tool, error type, and latency percentile.

Tooling regressions often appear as tail latency increases before hard failures.

Track whether retries fix issues or simply increase cost.

Treat observability as a first-class feature of agent architecture, not a post-release task.