TLDR Monitoring AI agents in production requires distributed tracing: a single user request fans out into 10 or more internal operations, and logs alone cannot show you which step is slow, failing, or burning your token budget. OpenTelemetry's gen_ai.* semantic conventions give you standardized span attributes for LLM calls, tool invocations, and agent steps. Some are stable today; others are still experimental.