AI Agent Observability in Node.js with OpenTelemetry
In short
You can get production-grade observability for a Node.js AI agent with OpenTelemetry alone: the GenAI client span conventions exited experimental status in early 2026, and they now cover LLM calls, agent orchestration, and MCP tool calls. That matters because, as of May 2026, nearly every company runs agents and most of them cannot see inside them. In this post I instrument a TypeScript agent end to end: spans for model calls, tools, and retrieval, token cost per trace, eval wiring, and the three alerts that catch real agent failures.

On this page
Why do AI agents fail silently in production?
Agents fail silently because their failures look like success. A broken tool call, an empty retrieval result, or a looping plan still ends in a fluent response with an HTTP 200 attached, so nothing throws, nothing pages, and the first person to notice is a customer. The supporting numbers are stark: as of May 28, 2026, 97% of companies have deployed AI agents but 79% report significant production challenges, and The Register reported on May 13, 2026 that roughly 74% of AI customer-service agent rollouts get rolled back.
I felt this in my own product before I read any survey. My reputation SaaS syncs Google reviews and drafts AI auto-replies, a few thousand replies a month. A profile-data tool once started returning empty objects after an upstream API change. The agent did not crash. It kept producing replies, just generic ones stripped of the business context that tool used to supply. No exception, no error log. A customer flagged the quality drop before anything in my logs did.
AI agent observability is the practice of recording every model call, tool invocation, and retrieval step an agent performs as structured, correlated telemetry so you can reconstruct exactly why the agent produced any given output.
The 2026 consensus is that agent failures are a runtime/observability problem, not a model problem. The models are good enough; the visibility into what they do with your tools is not. Most of the agents I ship in my AI agents and automation work ↗ break at the tool boundary, not inside the model.
Agents do not crash, they comply. The worst failures return HTTP 200 with a confident, wrong answer, and the only way to catch them is to record what the agent actually did, not what it said.
What do the OpenTelemetry GenAI semantic conventions cover?
The OpenTelemetry GenAI semantic conventions define standard span names and gen_ai. attributes for generative AI operations: model calls (chat), embeddings, agent orchestration (invoke_agent), and tool execution (execute_tool), including MCP tool calls. GenAI client spans exited experimental status in early 2026, which was my green light to standardize on them.
Stability matters more than it sounds. While the conventions were experimental, attribute names churned and dashboards built on them broke between minor versions. With client spans stable, you can build alerts on these attributes and expect them to survive upgrades.
Nearly all reference content for these conventions is Python-first, but the conventions are language-neutral. In Node.js you set the same attributes with @opentelemetry/api and manual spans. These are the attributes that earn their storage for agents:
| Attribute | What it captures | TypeScript example |
gen_ai.operation.name | Operation type | "chat", "execute_tool", "invoke_agent" |
gen_ai.provider.name | Model provider | "anthropic" |
gen_ai.request.model | Model you requested | "claude-opus-4-8" |
gen_ai.response.model | Model that answered | res.model |
gen_ai.usage.input_tokens | Prompt tokens billed | res.usage.input_tokens |
gen_ai.usage.output_tokens | Completion tokens billed | res.usage.output_tokens |
gen_ai.tool.name | Tool being executed | "fetch_business_profile" |
gen_ai.tool.call.id | Correlates call to result | block.id |
gen_ai.conversation.id | Groups traces per conversation | session.id |
| Signal | Threshold (starting point) | Failure mode it catches |
| Tool-call error rate | Over 5% across 15 minutes | Broken integration, expired credentials, schema drift |
| Empty tool results | Over 10% across 1 hour | Upstream API returns 200 with nothing useful |
| Tokens per trace | Over 3x the 7-day median | Runaway loop, context stuffing, prompt regression |
| Loop depth (LLM calls per trace) | Over 8 iterations | Agent stuck cycling on a tool it cannot satisfy |
| p95 trace latency | Over 2x baseline | Slow tool or provider degradation backing up the queue |
Thresholds are starting points; tune them against a week of your own traffic. The empty-result signal is the one I insist on, because it is the only one that catches a dependency that fails politely.
My verdict after running this in production: you do not need an observability SaaS on day one. OpenTelemetry plus any OTLP backend covers a solo team, but instrument tool calls from day one, because that is where agents actually break. Tool spans also double as an audit trail of every external action your agent took, which becomes a security asset the moment you harden the deployment; my MCP server hardening checklist ↗ leans on exactly that trail for incident review.
Key takeaways
- 97% of companies have deployed AI agents, 79% report significant production challenges, and ~74% of AI customer-service rollouts get rolled back. The gap is observability, not model quality.
- OTel GenAI client spans went stable in early 2026. The
gen_ai.attributes now cover chat, embeddings, agent orchestration, and MCP tool calls, and are safe to build alerts on. - Instrument tool calls before anything else. Tool failures, especially empty 200 responses, are where agents silently break.
- Token attributes turn traces into cost receipts. A derived cost-per-trace attribute caught a 60% input-token regression in my SaaS the morning after it shipped.
- DeepEval v4.0.3 and Phoenix v16 (both May 21, 2026) consume OTel traces directly, so production spans double as eval datasets with no second instrumentation layer.
FAQ
What are the OpenTelemetry GenAI semantic conventions?
They are the standard span names and `gen_ai.*` attributes OpenTelemetry defines for generative AI: `chat`, embeddings, `invoke_agent`, and `execute_tool` operations, plus attributes for provider, model, and token usage. GenAI client spans exited experimental status in early 2026, so the core attribute set is now stable enough to build dashboards and alerts on.
How do I trace MCP tool calls?
Wrap each tool execution in a span named `execute_tool {name}` with `gen_ai.operation.name` set to `execute_tool`, plus `gen_ai.tool.name` and `gen_ai.tool.call.id`. The conventions treat MCP tools like any other tool call, so the same span shape works whether the tool runs in process or on a remote MCP server.
Do I need an observability SaaS like LangSmith to monitor agents?
No. OpenTelemetry plus any OTLP backend, such as Jaeger, Grafana Tempo, or SigNoz, covers a solo team or a small product. Vendor platforms add LLM-specific UIs and managed evals that become worth paying for at higher volume, but everything in this post is portable to any of them.
How much overhead does OpenTelemetry add to a Node.js agent?
Effectively none relative to LLM latency. Span creation costs microseconds and exports batch in the background, while an agent turn spends seconds in model and tool time. In my production service the OTLP exporter stays well under 1% CPU; the real cost is backend storage, which trace sampling keeps in check.
Working on something like this?
I build web apps, AI features, and mobile products for clients. If this article matches a problem you have, tell me about it.
Start a conversationMalik Hamza Shabbir · Full-Stack & AI Engineer
I build full-stack and AI products solo: a reputation SaaS in production, RAG pipelines, and React Native apps. I write from what I ship, not from documentation summaries.
Related articles
How to Migrate Your MCP Server to the 2026 Stateless Spec
The final MCP spec ships July 28, 2026 and removes sessions from the protocol. I migrated my production Node server; here is the exact diff and checklist.
How to Secure an MCP Server: 2026 Hardening Checklist
I audited my production MCP stack against the NSA's May 2026 guidance and the OX Security RCE disclosure. Here is the 12-point hardening checklist I use.
What Is WebMCP? Making Your Web App Work with AI Agents
WebMCP, announced at Google I/O 2026, lets your web app register typed tools AI agents can call in Chrome 149. Here is how I exposed mine, with code.
