Best AI Observability Tools 2026

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

You shipped an AI feature. Latency is up 40%, a segment of users is getting nonsense answers, and your token bill doubled overnight. You open your logs and see... OpenAI HTTP responses. No prompt context. No tool calls. No span data. Nothing you can actually use to debug what happened.

That's the wall every team building LLM applications hits. Classical APM tools - Datadog, New Relic, Prometheus - were built for deterministic code. They capture latency and error rates just fine. What they don't capture is the semantic layer: which prompt fired, what the retrieval step returned, why the model chose that tool call, and whether the output quality has degraded from last Tuesday.

LLM observability platforms exist to fill that gap.

TL;DR

Best open-source + self-hosted: Langfuse (MIT, Docker Compose, full eval + tracing stack)
Best for LangChain/LangGraph: LangSmith (zero-config tracing inside those frameworks)
Best for RAG debugging and OSS self-host: Arize Phoenix (ELv2, OpenTelemetry-native, no limits)
Best enterprise evaluation platform: Braintrust (raised $80M Feb 2026, best eval pipeline in this list)
Best infrastructure-native option: Datadog LLM Observability (if you're already paying for Datadog)
Best open-source eval framework add-on: TruLens (Apache-2.0, Snowflake-backed, pairs with any stack)
Best for guardrails + monitoring: WhyLabs LangKit (operational drift detection, free OSS library)

How We Picked These

Trace depth was the primary evaluation axis - not dashboard aesthetics, but whether the tool captures the full span tree: prompt text, retrieved chunks, tool call arguments, token counts, and cost per step. Tools that show you a latency graph without the semantic context that caused the latency are not useful for debugging LLM applications. We looked specifically at what a trace looks like when a RAG retrieval step returns irrelevant chunks and how each tool surfaces that.

We tested self-hostable tools by actually running them in a Docker Compose environment and instrumenting a multi-step LangChain pipeline. For managed platforms, we used free tiers or trial access where available. Eval quality was assessed by running the same set of outputs through each platform's scoring pipeline and comparing the results for consistency - not just whether the feature existed on the pricing page.

We excluded tools that describe themselves as "AI observability" but only provide HTTP-level latency and error rate monitoring without any semantic instrumentation. Tools that required a lengthy enterprise sales process before we could access any functionality were also excluded.

The LLM observability market is consolidating fast - two platforms mentioned in our 2025 version were acquired between then and this publication. Feature sets and pricing tiers changed multiple times during our research window. Verify current plans as of your evaluation date, especially for tools with usage-based free tiers where the included volume has shifted.

Scope of this article: This article covers tracing, spans, evals, and production monitoring for LLM applications. It does NOT cover prompt versioning, A/B testing prompt variants, or prompt registries. For those topics, see the Best AI Prompt Management Tools 2026 companion piece. Several tools (Langfuse, LangSmith, Braintrust, Helicone) appear in both articles - the overlap is real, and the notes below make the boundary clear.

What "LLM Observability" Actually Means in 2026

The term covers three distinct capabilities that tools bundle in different ways:

Tracing is the core: capturing every LLM call, tool invocation, retrieval step, and chain transition as structured spans. A good trace tells you what input went in, what came out, how long each step took, and how much it cost. Without tracing, you're debugging in the dark.

Evals are structured quality assessments - running your outputs through scoring functions that measure accuracy, faithfulness, relevance, toxicity, or custom business metrics. Evals can run offline (before deploy, in CI/CD) or online (scoring production traffic in real time).

Online monitoring is the production ops layer: alerting on latency regressions, cost spikes, hallucination rate increases, and distribution drift in model outputs over time.

The boundary with eval frameworks like DeepEval, RAGAS, or Inspect AI is intentional - those are testing libraries you run in CI/CD. The platforms in this article are the infrastructure layer those frameworks plug into.

Feature Comparison Table

Tool	OSS	Tracing	Evals	Online Monitoring	Guardrails	Self-Hostable	Free Tier
LangSmith	No	Yes	Yes	Yes	No	Enterprise only	5K traces/mo
Langfuse	Yes (MIT)	Yes	Yes	Yes	No	Yes	50K events/mo
Arize Phoenix	Yes (ELv2)	Yes	Yes	Limited	No	Yes	Unlimited (self-host)
Arize AX	No	Yes	Yes	Yes	Yes	No	Free tier
WhyLabs LangKit	Yes (Apache-2.0)	Limited	Yes	Yes	Yes	Yes	Free OSS
Helicone	Yes (MIT)	Yes	No	Yes	No	Yes	10K req/mo
TruLens	Yes (Apache-2.0)	Yes	Yes	No	No	Yes	Free OSS
Datadog LLM Obs.	No	Yes	Basic	Yes	No	No	Paid only
New Relic AI Mon.	No	Yes	No	Yes	No	No	Free (100GB/mo)
Galileo AI	No	Yes	Yes	Yes	Yes	No	Contact sales
DeepChecks LLM	Partial	No	Yes	Yes	Yes	Limited	Contact sales
W&B Weave	No	Yes	Yes	Yes	No	No	Free (personal)
Braintrust	No	Yes	Yes	Yes	No	Enterprise only	1M spans + 10K scores

LangSmith

LangSmith is the observability platform from LangChain Inc. If your application is built on LangChain or LangGraph, the setup is as close to zero-config as exists in this space: set LANGCHAIN_TRACING_V2=true and your API key, and every chain call, tool invocation, and LLM completion is captured automatically as a structured trace tree.

The trace viewer understands LangChain internals. It renders chains and agents as structured hierarchies rather than flat log entries, and it knows which node in a LangGraph state machine produced each output. The playground lets you replay any production trace with modified inputs or a different model - the fastest way to reproduce a specific failure.

Evaluation support covers four evaluator types: LLM-as-judge, heuristic checks, human annotation queues, and pairwise comparisons. You can run offline evals against curated datasets or score live production traffic in real time. LangSmith also integrates directly with LangGraph Cloud deployments for uptime and version metrics.

Outside the LangChain ecosystem, the calculus changes. The SDK instrumentation works fine with non-LangChain code, but you lose the automatic trace structuring and the tight feedback loop. At that point, LangSmith's per-seat-per-trace pricing becomes harder to justify compared to Langfuse or Phoenix.

Pricing: Free Developer tier (5K traces/mo, 1 seat, 14-day retention). Plus at $39/seat/month (10K traces, 14-day retention). Overage: $2.50/1K traces at 14-day retention, $5.00/1K for extended 400-day retention. Enterprise is custom.

When to pick it: You're building on LangChain or LangGraph and want zero-config tracing. Outside that ecosystem, the cost model and integration overhead are harder to justify.

When not to pick it: High-volume production or budget-sensitive teams. A team of 5 running 500K traces/month hits roughly $1,400/month - versus Langfuse Pro at $199/month flat.

Langfuse

Langfuse is the platform I recommend as the default starting point for most teams. It covers tracing, evals, and production monitoring under an MIT license, with full feature parity on self-hosted deployments. There are no usage caps on self-hosted. The cloud free tier is 50,000 events per month, which covers a real development workload.

The trace model is spans-based and provider-agnostic: OpenAI, Anthropic, LiteLLM, LangChain, LlamaIndex, and the raw HTTP API all instrument the same way. OpenTelemetry support was added in a recent release, meaning any OTLP-compatible framework can emit traces to Langfuse without a dedicated SDK.

What distinguishes Langfuse on the eval side is dataset management. You can capture production traces directly into labeled datasets, define scoring rubrics, run LLM-as-judge evaluations, and track quality metrics across prompt versions. The prompt management layer (covered separately in the prompt management article) links every prompt version to the exact traces it produced - useful for attribution when something regresses.

Self-hosting is genuinely usable. Docker Compose works out of the box for development and moderate production loads. For large deployments, Langfuse estimates $3,000-4,000/month in infrastructure costs, compared to $199/month on cloud Pro.

Pricing: Hobby at $0/month (50K events/mo, 2 users, 30-day retention). Core at $29/month (100K events, unlimited users). Pro at $199/month (3-year retention, SOC2, HIPAA). Enterprise at $2,499/month (audit logs, SCIM). Overage at $8/100K events.

When to pick it: Teams that want a serious open-source platform with self-hosting, a generous free tier, and tight eval integration. Especially strong for European data residency.

When not to pick it: Teams running at massive scale who want a fully managed enterprise product with SLAs. At 50M+ events/month, the cost model shifts.

Arize Phoenix

Phoenix is Arize AI's open-source observability project. Licensed under Elastic License 2.0 - not MIT, which matters for embedding in commercial products, but freely usable for self-hosting - it runs with no usage caps, no feature gates, and no phone-home. The managed cloud counterpart is Arize AX.

The technical foundation is strong. Phoenix is built on OpenTelemetry natively, meaning it ingests OTLP traces from any compatible framework - LangChain, LlamaIndex, the OpenAI Agents SDK, CrewAI, DSPy, LiteLLM, and raw HTTP instrumentation. That breadth matters for heterogeneous stacks where you don't want to commit to a single framework's instrumentation idioms.

The evaluation layer ships built-in evaluators for RAG-specific metrics: faithfulness, relevance, context precision, context recall, hallucination detection, and toxicity. RAGAS and DeepEval both integrate with Phoenix as the storage and visualization layer, which means you can run RAGAS evaluations and see the results in the Phoenix UI without writing any glue code.

The RAG debugging workflow is one of Phoenix's strongest differentiators. The UI lets you drill into specific retrieval steps to see which chunks were returned, how they ranked, and whether they were actually used in the final generation. For teams debugging why their RAG pipeline is hallucinating or returning irrelevant answers, that level of retrieval-level visibility is rare.

Arize AX (the managed cloud version) starts at $50/month for Pro and adds online monitoring, alerting, and guardrails that aren't available in the self-hosted OSS version.

Pricing: Phoenix self-hosted is free (ELv2). Arize AX Free ($0/month). AX Pro at $50/month. AX Enterprise is custom.

When to pick it: You need full self-hosting with no limits and no vendor dependency. You're on LlamaIndex (native integration). You need OpenTelemetry-native instrumentation across a mixed framework stack.

When not to pick it: You need managed online monitoring and alerting out of the box - that requires upgrading to Arize AX.

WhyLabs LangKit

WhyLabs focuses on the operational monitoring angle - drift detection, statistical guardrails, and continuous quality tracking - rather than interactive trace debugging. LangKit is their open-source library (Apache-2.0) for profiling LLM inputs and outputs and detecting problematic patterns.

Out of the box, LangKit profiles text for toxicity, sentiment, reading level, injection patterns, and semantic similarity. The WhyLabs platform (free tier available) aggregates these profiles over time and alerts on statistical drift: if your model's output toxicity distribution shifts from Tuesday to Wednesday, you get a notification before users file support tickets.

The integration model is a profiling hook rather than full trace instrumentation. You wrap your LLM calls with LangKit's profiling functions, ship the profiles to WhyLabs, and get a statistical dashboard rather than a per-trace inspector. For teams that care more about "is the population of outputs drifting?" than "why did this specific output fail?", that's the right abstraction level.

LangKit integrates with Langfuse and Arize Phoenix as a data source, so you're not choosing between them - you can run LangKit for population-level monitoring and Phoenix for per-trace debugging on the same system.

Pricing: WhyLabs platform has a free tier. Enterprise pricing is custom. LangKit OSS is free under Apache-2.0.

When to pick it: You care about production drift detection and statistical guardrails. You want population-level alerting rather than individual trace debugging. It's a complement to a tracing platform, not a replacement.

Helicone

Helicone takes a proxy-based approach: you change your LLM provider's base URL to route through Helicone's gateway, and you immediately get request logging, latency tracking, cost attribution, and caching. No SDK. No code changes beyond the base URL. You can add observability to an existing codebase in under fifteen minutes.

The proxy layer enables features that SDK-based tools can't offer: semantic caching that reduces redundant API calls (Helicone claims 20-30% cost reduction for typical workloads), rate limiting, model fallbacks, and traffic routing across providers.

For pure observability - tracing and eval pipelines - Helicone is thinner than Langfuse or LangSmith. There's no native eval workflow, no dataset management, and no LLM-as-judge scoring. What it does well is production cost monitoring and caching. The 7-day retention on the free tier is the shortest window in this comparison, which limits retrospective debugging.

Prompt versioning is available in Helicone (covered in the prompt management article) but it's secondary to the observability and gateway story.

Pricing: Hobby at $0/month (10K req/mo, 7-day retention). Pro at $79/month (unlimited seats, 1-month retention). Team at $799/month (3-month retention, alerts, HQL). Self-hosted OSS is free.

When to pick it: You want production visibility with minimal code changes and you care about API cost reduction via caching. Not the right call if systematic eval pipelines are a priority.

TruLens

TruLens is Snowflake's open-source evaluation and tracking library for LLM applications. Apache-2.0 licensed, it runs locally or self-hosted with no usage caps. The core value proposition is applying structured evaluation metrics - RAG triad: context relevance, groundedness, answer relevance - to any LLM application, not just those using a specific framework.

The instrumentation model uses Python decorators to wrap application components. You annotate your retriever, your LLM call, and your synthesizer, and TruLens records the inputs, outputs, and intermediate states for each. Feedback functions (TruLens terminology for eval metrics) run against these records and produce per-component scores that aggregate into an overall evaluation view.

The TruLens Eval TruRails module handles guardrails: you define input/output constraints that are enforced at inference time, not just measured after the fact. This production-time enforcement is a differentiator versus tools that only run evals offline.

The trade-off relative to full platforms like Langfuse or Braintrust is the ops layer. TruLens is a library for evaluation logic; it integrates with databases and dashboards but doesn't ship a managed monitoring or alerting infrastructure. You bring your own storage backend (SQLite works for development, PostgreSQL or Snowflake Cortex for production).

Pricing: TruLens is free and open source under Apache-2.0. Snowflake Cortex integration is billed at standard Snowflake compute rates.

When to pick it: You want a serious open-source eval framework with guardrails support and you're comfortable managing your own storage backend. Especially strong if you're already on Snowflake infrastructure.

Datadog LLM Observability

Datadog LLM Observability is an extension of the standard Datadog APM product specifically for LLM workloads. If your team already runs Datadog for infrastructure monitoring, it's the lowest-friction path to LLM tracing because it plugs into your existing dashboards, alerts, and on-call workflows.

The product captures LLM-specific spans: prompt, completion, token counts, latency, and cost. It handles multi-step agent traces with tool call attribution. Cluster maps and quality metrics are available in dashboards alongside your existing infrastructure metrics - which means a latency spike in your LLM pipeline surfaces in the same place as a database query regression.

Evaluation capabilities are more limited than the dedicated platforms. Datadog includes quality scoring for outputs (LLM-as-judge style) but doesn't have the dataset management, eval pipeline tooling, or systematic evaluation workflows that Braintrust or Langfuse offer. It's infrastructure monitoring with LLM context, not a full evaluation platform.

The pricing model matches standard Datadog billing: per GB of ingested data, layered on top of existing APM costs. For teams already on large Datadog contracts, LLM Observability may be included or come at minimal marginal cost. For teams not already on Datadog, it's expensive relative to dedicated platforms.

Pricing: Billed per GB of LLM data ingested, on top of existing Datadog APM costs. See Datadog pricing for current rates. No free tier.

When to pick it: You're already a Datadog customer and want LLM traces visible in the same system as infrastructure metrics. Not worth the switching cost if you're starting fresh.

New Relic AI Monitoring

New Relic AI Monitoring (part of New Relic's core platform) follows a similar positioning to Datadog: if you're already running New Relic for APM, you can extend your visibility to LLM calls without adding another vendor.

The product captures AI response messages with token counts, model details, and cost attribution across OpenAI, Bedrock, and other providers. The correlation between LLM call latency and end-user session performance is a genuine differentiator - you can trace a slow AI response back through the full application stack to understand whether the bottleneck is the model, the retrieval step, or the surrounding infrastructure.

Evaluation and systematic quality assessment are out of scope. New Relic AI Monitoring is production monitoring and root-cause analysis, not eval pipelines. For those workflows, you'd layer a dedicated tool on top.

The pricing model is New Relic's standard data ingest billing. The free tier includes 100 GB of data ingest per month, which may cover LLM traces for low-to-moderate volume applications. For high-volume applications, the ingest-based pricing can exceed dedicated platforms.

Pricing: New Relic includes 100 GB data ingest/month on the free plan. Paid plans start at $0.30/GB beyond the free tier (standard ingest). See New Relic pricing.

When to pick it: Existing New Relic customers who want LLM call visibility alongside their existing APM and infrastructure monitoring.

Galileo AI

Galileo is a purpose-built LLM evaluation and monitoring platform with strong guardrails capabilities. The product focuses on production quality monitoring - automatically detecting hallucinations, PII leakage, prompt injections, and toxicity in live traffic - rather than developer-time trace debugging.

The guardrails system is one of Galileo's clearest differentiators. You define policies (no PII, no hallucinations above a threshold, no toxic outputs) and Galileo enforces them in the request path, blocking or flagging violations before they reach users. This real-time enforcement is a step beyond tools that only measure quality after the fact.

Galileo also ships a RAG evaluation toolkit with quality metrics across the retrieval and generation steps, and a human feedback collection layer for annotation workflows.

The limitation is transparency on pricing and self-hosting. Galileo is a fully managed SaaS product - there is no open-source version - and pricing is contact-sales. For teams with strict data residency requirements or those that need to understand total cost before engaging a vendor, this makes evaluation harder.

Pricing: Contact galileo.ai/pricing for current rates. No public self-serve pricing as of April 2026.

When to pick it: You care about automated guardrails enforced in the request path and are building applications where safety/compliance violations in outputs are a primary concern.

DeepChecks LLM

DeepChecks started in the traditional ML space (data validation and model testing) before adding an LLM evaluation layer. The LLM product focuses on systematic evaluation - test suites, regression checks, and ongoing monitoring - rather than trace-level debugging.

The evaluation framework covers 40+ built-in checks across accuracy, fluency, coherence, groundedness, safety, and hallucination categories. You can define custom checks and run evaluation suites as part of a CI/CD pipeline. The production monitoring module extends this to live traffic, flagging outputs that fail quality checks.

The overlap with DeepEval (covered in the eval tools article) is significant. Teams choosing between them will find DeepChecks stronger on the managed cloud and production monitoring side, and DeepEval stronger on the open-source framework and CI integration side.

Pricing: Enterprise-focused. Contact deepchecks.com for pricing. Limited self-service free tier.

When to pick it: You need systematic quality evaluation across large-scale production workloads and prefer a managed platform with a history in traditional ML validation.

Weights & Biases Weave

W&B Weave is the LLM evaluation and tracing product from Weights & Biases (W&B). It builds on W&B's established ML experiment tracking infrastructure to add LLM-specific capabilities: trace logging, evaluation scoring, dataset management, and human review queues.

The trace model captures LLM calls and multi-step agent workflows as structured trees, with cost and latency attribution per span. The evaluation layer lets you define scoring metrics, run them against trace datasets, and track quality trends across experiments using the same W&B interface that ML teams use for hyperparameter comparison.

For teams already running W&B for traditional ML experiments, Weave is the obvious extension. The data model is consistent - experiments, runs, artifacts, datasets - and you don't add another vendor. For teams not on W&B, it's a heavier platform to adopt if you only need LLM observability.

Weave supports OpenAI, Anthropic, Cohere, LiteLLM, DSPy, LangChain, and LlamaIndex. The SDK is straightforward: decorate functions you want to trace, and W&B handles the rest.

Pricing: Personal use is free with unlimited tracked entities. Team plan is $50/month base + $25/seat/month. Enterprise is custom. See wandb.ai for current plan details.

When to pick it: You're already running W&B for ML experiment tracking and want LLM tracing integrated with your existing research workflow.

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry is not a product - it's the CNCF-standardized observability framework that defines how distributed systems should emit traces, metrics, and logs. The GenAI semantic conventions are the emerging standard for how LLM-specific telemetry should be structured.

This matters because of the fragmentation problem: every LLM observability platform currently defines its own span schema. A LangSmith trace looks different from a Langfuse trace, which looks different from a Phoenix trace. If you later want to switch platforms, you rewrite your instrumentation. If you want to use multiple platforms simultaneously, you run multiple SDKs.

The GenAI semantic conventions define standard attribute names for LLM events: gen_ai.system, gen_ai.request.model, gen_ai.request.max_tokens, gen_ai.usage.input_tokens, and so on. Platforms that adopt these conventions become interchangeable at the instrumentation layer - you emit OTLP traces once and route them to whatever backend you choose.

Who's adopting it: Arize Phoenix is built on OTLP natively. Langfuse added OpenTelemetry support in recent releases. OpenAI's Agents SDK emits OTLP traces compatible with the GenAI conventions. The LlamaIndex SDK includes GenAI convention support.

Why it matters for your stack: Instrumenting against the GenAI semantic conventions rather than a vendor SDK is the defensive choice. Your traces are portable across platforms, compatible with standard collectors (OpenTelemetry Collector, Grafana Alloy), and readable by any OTLP-compatible backend. If you're starting a new project in 2026, wrapping your LLM calls in OpenTelemetry spans and routing them to your observability platform of choice is more future-proof than vendor-specific instrumentation.

The conventions are stable for basic LLM spans (prompts, completions, token counts, model info) and still evolving for more complex patterns (multi-modal, tool calls, agents). Check the specification for the current stable set before building production instrumentation.

Braintrust - Evals First, Monitoring Second

Braintrust raised an $80 million Series B in February 2026 at an $800 million valuation (Iconiq-led, a16z and Greylock participating). The product is built on the premise that AI quality management is an engineering discipline - not a dashboard you check monthly, but a systematic process integrated into every deploy.

The evaluation pipeline is the strongest in this comparison. 25+ built-in scorers for accuracy, relevance, faithfulness, hallucination, and safety. The Loop AI assistant generates custom scorers from natural language descriptions. Dataset management handles golden sets, regression suites, and per-experiment tracking. Release gates can block merges when eval scores fall below a defined threshold - automated quality enforcement rather than a metric you have to remember to look at.

For production monitoring, Braintrust captures spans in production and scores them against the same eval rubrics you use offline. The traceability model - every eval score links back to the exact prompt version, model, and dataset that produced it - is what makes quality engineering tractable at scale.

The prompt management angle (covered in the prompt management article) is tightly coupled with evals: prompt versioning and quality measurement are intentionally intertwined.

On the observability side specifically, Braintrust's trace viewer is solid but less polished than LangSmith's for complex multi-step agents. The tracing is production-capable, but the platform's identity is evaluation-first.

Pricing: Starter at $0/month base (1M spans, 10K scores/mo, unlimited users, 14-day retention, then $2.50/1K scores, $4/GB). Pro at $249/month (unlimited spans, 50K scores, 30-day retention). Enterprise custom.

When to pick it: You're treating AI quality as an engineering discipline with CI-integrated release gates and systematic eval pipelines. The evaluation infrastructure justifies the cost for teams that have moved past "make it work" into "manage it reliably."

Best for X - Decision Matrix

Use case	Best pick	Runner-up
Open-source + self-hosted	Langfuse	Arize Phoenix
Self-hosted with no usage caps	Arize Phoenix	Langfuse
RAG debugging	Arize Phoenix	Langfuse
LangChain / LangGraph stacks	LangSmith	Langfuse
Enterprise eval pipeline	Braintrust	Langfuse
Production guardrails	Galileo AI	WhyLabs + LangKit
Infrastructure-native monitoring	Datadog LLM Obs.	New Relic AI Mon.
Cost reduction via caching	Helicone	-
Drift detection / statistical monitoring	WhyLabs LangKit	-
OpenTelemetry-native	Arize Phoenix	Langfuse
Already on W&B for ML experiments	W&B Weave	Braintrust
Budget / free open-source	Langfuse (self-host)	TruLens
Snowflake stack	TruLens	-

Frequently Asked Questions

Do I need a dedicated LLM observability tool or can I use Datadog / New Relic?

If you're already running Datadog or New Relic at scale and your requirements are latency, error rates, and cost tracking, the native LLM extensions may be enough. Where they fall short is systematic evaluation, RAG-specific quality metrics, and the dataset management you need for ongoing quality engineering. Most teams beyond early production will want a dedicated platform layered alongside their existing APM.

What's the difference between tracing tools and eval frameworks?

Tracing tools (this article) are infrastructure: they capture and store what your model is doing in production. Eval frameworks like DeepEval, RAGAS, or Inspect AI are testing libraries you run against that data. The clearest way to think about it: eval frameworks define the tests; tracing platforms run them at scale and store the results. See the LLM eval tools comparison for the framework side of the picture.

Is OpenTelemetry ready for production LLM tracing?

The GenAI semantic conventions are stable enough for core LLM spans (prompts, completions, token usage, model attribution). They're still evolving for complex agent patterns and tool calls. Arize Phoenix and Langfuse both have stable OTLP ingestion. If you're starting fresh, building on the GenAI conventions now is reasonable - just accept that some edge cases may require updating attribute names as the spec matures.

How do these tools relate to prompt management platforms?

Several tools appear in both categories. The rule of thumb: when an article discusses how a tool captures traces, scores quality, and alerts on regressions, that's observability. When it discusses how a tool versions prompts, runs A/B tests, and deploys prompt changes, that's prompt management. Langfuse and Braintrust are genuinely strong on both sides; Helicone and LangSmith lean toward observability with basic prompt management as a secondary feature.

What should I use for AI agents specifically?

Complex agents - multi-step, multi-tool, with branching logic - need trace visualization that understands the graph structure. LangSmith is best if you're using LangGraph. Arize Phoenix handles agent traces well with its span tree view and is framework-agnostic. Braintrust's trace visualization handles nested agent calls but the eval-first design means the trace UI is secondary to the quality metrics. For agent eval specifically, pairing any of these with DeepEval or TruLens for structured quality checks is the current best practice.

Observability | Awesome Agents