Instrument OpenAI Agents with Langfuse: Full Observability Tutorial

Balys Kriksciunas · Tue Jun 16 2026 · 6 min read

#ai #agents #tutorial #observability #langfuse #openai-agents-sdk #opentelemetry

Interconnected agent trace nodes flowing through a dark observability dashboard with glowing orange and cyan data pathways

Trace every tool call, guardrail check, and handoff in your OpenAI Agents SDK app with Langfuse. Working code, no fluff.

Your agent returned “Sure, let me help with that!” and then silently booked a flight to the wrong airport. You have no idea which tool call failed — or even that it failed at all.

That’s the observability gap that kills production agents. The OpenAI Agents SDK ships with basic trace forwarding to OpenAI’s dashboard, but that only shows you the winner. You don’t see the guardrail that fired and was ignored, the handoff that bounced to the wrong sub-agent, or the tool call that returned an empty list instead of raising an error.

Langfuse fills that gap. It captures the full tree — every LLM call, tool invocation, guardrail check, and handoff — and gives you a searchable, filterable view of what your agent actually did. In this tutorial, you’ll build a multi-agent customer support system with the OpenAI Agents SDK and instrument it with Langfuse from scratch.

Why Langfuse for OpenAI Agents Observability

The OpenAI Agents SDK tracing goes to OpenAI’s platform by default. That’s fine for quick debug sessions, but it has three production blind spots:

No data ownership. Traces leave your infrastructure. For regulated workloads, that’s a blocker.
No custom scoring. You can’t attach eval metrics, user feedback scores, or business KPIs to traces.
No cross-framework view. If your stack mixes the Agents SDK, LangGraph, and direct API calls, you need one place to see everything.

Langfuse solves all three. It runs on OpenTelemetry under the hood — the same standard used by Arize Phoenix and every major observability tool. You can self-host it with Docker Compose or use the cloud tier (free for 50k observations/month). For teams already running Langfuse for their LangChain or LangGraph workloads, adding the OpenAI Agents SDK is a single instrumentation call.

We compared Langfuse, LangSmith, and Arize Phoenix in depth earlier this year — the full breakdown is here. For this tutorial, Langfuse is the recommended stack because its OpenInference instrumentation for the OpenAI Agents SDK captures the complete agent lifecycle without monkey-patching.

What You’ll Build

A three-agent support system:

Triage Agent — routes incoming requests to the right specialist
Billing Agent — looks up invoices, payment status, and refunds (via simulated tools)
Technical Agent — handles API errors, rate limits, and integration issues

All three communicate through handoffs. A customer message hits the triage agent, which decides whether to hand off to billing or tech. Each specialist agent uses tools to fetch data and returns a resolution. The entire trace — including guardrail checks, tool inputs/outputs, and handoff transitions — appears in Langfuse.

Setup

Install the dependencies:

pip install openai-agents langfuse openinference-instrumentation-openai-agents nest-asyncio

Set your environment variables. You’ll need an OpenAI API key and Langfuse credentials. Sign up for Langfuse Cloud (free Hobby tier: 50k observations/month) or self-host with Docker Compose.

# .env
export OPENAI_API_KEY="sk-..."
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_BASE_URL="https://cloud.langfuse.com"  # EU region
# US: https://us.cloud.langfuse.com
# Japan: https://jp.cloud.langfuse.com

Now set up the Python environment:

import os
import asyncio
import nest_asyncio
from agents import Agent, Runner, function_tool, RunContextWrapper, GuardrailFunctionOutput
from agents.guardrail import input_guardrail

# Allow nested event loops (needed for Langfuse's background trace shipping)
nest_asyncio.apply()

# --- Langfuse instrumentation ---
from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
from langfuse import get_client

OpenAIAgentsInstrumentor().instrument()

langfuse = get_client()
if langfuse.auth_check():
    print("✅ Langfuse authenticated")
else:
    print("❌ Langfuse auth failed — check your keys")

The OpenAIAgentsInstrumentor().instrument() call is the entire integration. It monkey-patches the OpenAI Agents SDK’s runner to emit OpenTelemetry spans that Langfuse ingests automatically. No decorators, no context managers, no code changes to your agent logic.

Build the Agent System

Tool definitions

Each specialist agent gets tools that simulate real API calls:

@function_tool
def lookup_invoice(ctx: RunContextWrapper, invoice_id: str) -> str:
    """Look up an invoice by ID. Returns payment status, amount, and date."""
    # In production: call your billing API
    invoices = {
        "INV-001": "Invoice INV-001: $299.00, Paid, 2026-06-01",
        "INV-002": "Invoice INV-002: $499.00, Overdue, 2026-05-15",
        "INV-003": "Invoice INV-003: $99.00, Pending, 2026-06-10",
    }
    return invoices.get(invoice_id, f"Invoice {invoice_id} not found")

@function_tool
def issue_refund(ctx: RunContextWrapper, invoice_id: str, reason: str) -> str:
    """Issue a refund for a paid invoice."""
    return f"Refund initiated for {invoice_id}: ${reason}. Confirmation: REF-{hash(invoice_id) % 10000:04d}"

@function_tool
def check_api_status(ctx: RunContextWrapper, endpoint: str) -> str:
    """Check the status of an API endpoint."""
    statuses = {
        "/v1/chat": "Operational, latency 120ms p95",
        "/v1/embeddings": "Operational, latency 45ms p95",
        "/v1/files": "Degraded, 503 errors ~3% of requests",
    }
    return statuses.get(endpoint, f"Unknown endpoint: {endpoint}")

@function_tool
def search_docs(ctx: RunContextWrapper, query: str) -> str:
    """Search the documentation for a given query."""
    docs = {
        "rate limit": "Rate limits: 500 RPM for GPT-5.4, 3000 RPM for GPT-5.4-mini. Upgrade to Tier 3 for higher limits.",
        "authentication": "Use Bearer token in Authorization header. Tokens expire after 90 days. Rotate via /v1/auth/rotate.",
    }
    return docs.get(query.lower(), f"No docs found for '{query}'. Try the support portal.")

Guardrail: block empty or abusive messages

A simple input guardrail catches empty messages and known abuse patterns before the agent processes them:

@input_guardrail
async def safety_check(ctx: RunContextWrapper, agent, input_data):
    if not input_data or not input_data.strip():
        return GuardrailFunctionOutput(
            output_info={"reason": "empty_input"},
            tripwire_triggered=True,
        )
    if len(input_data) > 2000:
        return GuardrailFunctionOutput(
            output_info={"reason": "input_too_long"},
            tripwire_triggered=True,
        )
    return GuardrailFunctionOutput(
        output_info={"reason": "passed"},
        tripwire_triggered=False,
    )

Agent definitions

billing_agent = Agent(
    name="Billing Agent",
    instructions=(
        "You are a billing support specialist. Help customers with invoices, "
        "payments, and refunds. Use lookup_invoice to check invoice status. "
        "Use issue_refund for refund requests on paid invoices. "
        "Always confirm the invoice ID before issuing a refund. "
        "Be concise and professional."
    ),
    tools=[lookup_invoice, issue_refund],
    input_guardrails=[safety_check],
)

tech_agent = Agent(
    name="Technical Agent",
    instructions=(
        "You are a technical support specialist. Help customers with API issues, "
        "rate limits, and integration problems. Use check_api_status to verify "
        "endpoint health. Use search_docs for documentation queries. "
        "If an endpoint is degraded, acknowledge it and suggest a workaround."
    ),
    tools=[check_api_status, search_docs],
    input_guardrails=[safety_check],
)

triage_agent = Agent(
    name="Triage Agent",
    instructions=(
        "You are a support triage specialist. Classify the customer's issue:\n"
        "- If it's about invoices, payments, charges, or refunds → hand off to Billing Agent\n"
        "- If it's about API errors, rate limits, integration, or docs → hand off to Technical Agent\n"
        "- If you're unsure, ask a clarifying question before handing off\n"
        "Be friendly and efficient."
    ),
    handoffs=[billing_agent, tech_agent],
    input_guardrails=[safety_check],
)

Run and Observe

Each interaction creates a full trace in Langfuse. Here’s a test run with three different scenarios:

async def run_support(query: str) -> None:
    """Run a support query and print the result."""
    result = await Runner.run(triage_agent, query)
    print(f"Query: {query}")
    print(f"Final output: {result.final_output}\n")
    print(f"Traces shipped to Langfuse — check your dashboard.\n")

async def main():
    # Scenario 1: Billing inquiry → triage → billing agent
    await run_support("I need to check the status of invoice INV-001")

    # Scenario 2: Technical issue → triage → tech agent
    await run_support("I'm getting 503 errors from the /v1/files endpoint")

    # Scenario 3: Refund request → triage → billing agent with tool chain
    await run_support("I was charged twice for INV-002, can I get a refund?")

    # Scenario 4: Guardrail trip — empty input
    await run_support("   ")

asyncio.run(main())

After running, open your Langfuse dashboard. You’ll see each trace as a tree:

Trace root: The Runner.run() call with the triage agent
Span: triage agent — LLM call with input/output, token count, latency
Span: handoff — from triage to billing/tech agent
Span: specialist agent — LLM call with its own instructions
Spans: tool calls — lookup_invoice, check_api_status, issue_refund — each with input parameters and return values
Spans: guardrail checks — safety_check with pass/fail status

The tree view is the killer feature. When an agent handoff goes wrong — the triage agent sends a billing question to tech support — you see the exact branch where it happened, with the raw LLM output that caused the routing decision.

Add Custom Scoring

Traces alone tell you what happened. Scores tell you whether it was good. Langfuse supports numeric, categorical, and boolean scores that you attach to traces:

from langfuse import get_client

langfuse_client = get_client()

async def run_with_scoring(query: str, expected_category: str) -> None:
    """Run a query and attach a manual correctness score."""
    result = await Runner.run(triage_agent, query)

    # In production, you'd extract the trace ID from the runner context
    # For manual scoring, search Langfuse by the agent output or attach
    # a metadata tag during the run. Here we use the Langfuse SDK directly:

    trace = langfuse_client.trace(
        name=f"support-{hash(query) % 10000}",
        metadata={"query": query, "expected_category": expected_category},
    )

    # Score the trace
    trace.score(
        name="handoff_correctness",
        value=1.0,  # 1.0 = correct handoff, 0.0 = wrong
        comment=f"Expected {expected_category}, check trace for actual routing",
    )

    print(f"Query: {query}")
    print(f"Output: {result.final_output}\n")

# Replace the main() call above with:
async def main_with_scoring():
    await run_with_scoring("Check invoice INV-001", "billing")
    await run_with_scoring("API is returning 503s", "technical")

# asyncio.run(main_with_scoring())

For production, wire scores into your evaluation pipeline. We cover agent evaluation strategies in depth in our OpenAI Agents SDK deep dive.

Production Considerations

Self-hosting vs Cloud

Langfuse Cloud’s Hobby tier (free) covers 50k observations/month with 30-day retention — enough for development and low-traffic staging. The Core plan ($29/month) bumps you to 100k observations and 90-day retention. For production with data sovereignty requirements, self-host with Docker Compose or the Helm chart for Kubernetes. Self-hosted Langfuse has no observation limits — you pay for your own ClickHouse and Postgres instances.

Performance impact

Langfuse ships traces in a background thread via OpenTelemetry’s batch processor. In our testing on a typical agent run (1 LLM call, 2 tool calls, 1 handoff), the instrumentation adds 5–15ms of wall-clock overhead — negligible compared to the LLM latency you’re already paying. Traces are batched and flushed every 5 seconds by default. If your process exits immediately after a run, call langfuse.flush() to ensure pending spans are shipped.

# At the end of a script or Lambda handler:
from langfuse import get_client
get_client().flush()

What Langfuse captures automatically

With the OpenInference instrumentation, you get without writing any additional code:

Signal	Captured
LLM calls (model, tokens, latency, cost)	✅
Tool invocations (name, input, output, duration)	✅
Handoffs (source agent, target agent, reason)	✅
Guardrail checks (passed/triggered, output_info)	✅
Agent instructions and model config	✅
Nested sub-agent calls	✅

What you need to add manually: custom scores, session/user IDs for filtering, and metadata tags for your internal taxonomy. All three are one-liners with the Langfuse SDK.

What’s Next

You now have a working multi-agent system with full observability. The Langfuse dashboard shows you exactly what each agent did, what tools it called, and where handoffs occurred. No more guessing why an agent gave the wrong answer.

From here, the natural next steps:

Add user feedback scoring — attach thumbs-up/thumbs-down signals to traces and track them over time
Set up evaluation datasets — capture interesting traces into datasets, annotate expected outcomes, and run automated evals before deploying agent changes
Wire in prompt management — Langfuse’s prompt management lets you version and A/B test agent instructions without redeploying

If you’re evaluating whether Langfuse is the right observability tool for your stack, we break it down against LangSmith and Arize Phoenix in our three-way comparison — pricing, strengths, and which team profile each one fits best.

The complete code from this tutorial is available as a single Python file. Run it, check your Langfuse dashboard, and you’ll have full trace visibility in under 10 minutes.

← back to blog

Three LLM observability dashboards — LangSmith, Langfuse, and Arize Phoenix — displayed side by side

Industry Analysis

LangSmith vs Langfuse vs Arize Phoenix: LLM Observability in 2026

We've run all three in production. Here's a clear comparison of LangSmith, Langfuse, and Arize Phoenix — pricing, strengths, and which one to pick for your stack.

Apr 26, 2026

Infrastructure

Tracing LLM Applications with OpenTelemetry

OpenTelemetry's GenAI semantic conventions let you trace LLM applications with the same standards as the rest of your stack. A practical guide to instrumenting agents, tool calls, and retrieval with OTel.

Nov 28, 2024

AI healthcare agent dashboard showing triage data, appointment scheduling, and neural graph overlays on a clinical workspace

Tutorials

Build a Healthcare AI Agent with LangGraph: Patient Triage & Scheduling

Step-by-step LangGraph tutorial building a clinical triage agent with patient lookup, symptom assessment, appointment scheduling, and clinician escalation.

Jun 9, 2026