• Turing Post
  • Posts
  • Guest Post: Building a Self-Improving Agent with Arize Phoenix and DSPy

Guest Post: Building a Self-Improving Agent with Arize Phoenix and DSPy

Using prompt optimization, evaluation, and observability to create an automated improvement process

Vibe coding is taking over the industry. But vibe-coding can only take you so far before you’re faced with actual development, or more specifically, debugging. This is especially true when building LLM agents. The unpredictable nature of LLMs has created an entire field dedicated to evaluating LLM agents and their responses. Often, coding an agent takes one tenth as much time as refining it for production.

Normally when faced with this type of problem – one that requires some critical thinking and repeated action – we’d turn to AI.

So let’s do just that.

In this detailed tutorial by John Gilhuly, you’ll learn how to build a database querying agent that gets better over time by analyzing its own performance and optimizing its prompts, with or without human intervention. This example combines the paradigms of agent evaluation, tracing, experimentation, and prompt optimization to create a virtuous cycle of application improvement.

But first, if you’re interested in Agents, LLM Evaluation, Prompt Optimization, or the future of AI development, we recommend joining the annual Arize Observe conference on June 25th. Click here for a special community discount ticket.

The cycle will run as follows:

  1. Generate or manually create an initial set of test cases, with ground truth output labels

  2. Use DSPy to create an optimized prompt based on those test cases

  3. Save that prompt in Phoenix

  4. Run that prompt through an experiment suite of tests

  5. Once a certain level of performance has been achieved, tag the prompt for production

  6. Run the agent in production, capturing tracing on its invocations

  7. Use LLM evals to label traces at scale

  8. (Optionally) Verify these traces with a human labeler

  9. Add the new traces along with their labels to the training dataset.

  10. Return to step 2

Necessary Tools:

  • Arize Phoenix – an open-source LLM app development tool that brings the Prompt Management, Tracing, Evaluation, and Experimentation capabilities.

  • OpenInference/OpenTelemetry – an open-source library to capture tracing data in Phoenix.

  • DSPy – a final open-source library that enables multiple prompt optimization techniques.

Self-Improving Agent Tutorial

This tutorial will focus on optimizing the function calling step of an agent equipped with tools. We’ve chosen this step to optimize since it is the most ubiquitous piece of agents we’ve found. The same cycle could be applied to any other aspect of an agent, from its tool performance to its path selection.

Agent Setup

To start, you need to create a workable agent. This section walks through how to configure and trace a basic tool calling agent.

For the sake of brevity, we’ll skip over the code defining the tools of your agent. See the completed notebook here for that full implementation.

Environment Setup

Start by installing the required packages:

!pip install uv
!uv pip install -q openai "arize-phoenix>=8.8.0" "arize-phoenix-otel>=0.8.0" openinference-instrumentation-openai python-dotenv duckdb "openinference-instrumentation>=0.1.21" tqdm dspy

Import the necessary libraries:

import dotenv
dotenv.load_dotenv()

import json
import os
from getpass import getpass

import duckdb
import pandas as pd
from openai import OpenAI
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.trace import StatusCode
from pydantic import BaseModel, Field
from tqdm import tqdm

from phoenix.otel import register
from phoenix.client import Client as PhoenixClient

Configure your API keys and set defaults:

if os.getenv("OPENAI_API_KEY") is None:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

client = OpenAI()
model = "gpt-4o-mini"
project_name = "self-improving-agent"

Any model could be used in place of GPT-4o-mini, provided it supports tool calling.

Setting Up Phoenix Tracing

Next you’ll need to connect your application to Phoenix. If you don’t have a Phoenix instance, you can either self-host one or create a free account online. This tutorial uses an online account, but either approach will work.

if os.getenv("PHOENIX_API_KEY") is None:
    os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API key: ")

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.getenv('PHOENIX_API_KEY')}"

Then register your tracer:

tracer_provider = register(
    project_name=project_name,
    auto_instrument=True,
)

tracer = tracer_provider.get_tracer(__name__)

The code above will connect to your Phoenix instance. The auto_instrument will scan for any OpenInference tracing libraries in your environment. If one is present, then that library will be called, capturing any calls made to its respective library. Because the openinference-instrumentation-openai library is present, this call will automatically record any calls to OpenAI.

Save your Initial Router Prompt in Phoenix

Next, store your prompt in Phoenix for versioning and traceability. This is helpful for two main reasons. First, Phoenix will automatically create a new version of your prompt when you save a new iteration of it, providing you with a comprehensive log of changes. This is especially useful when these iterations may happen automatically.

Second, Phoenix allows you to tag versions of your prompt as “production”, “staging”, or “development”, either from the dashboard or via code. Since your code can then be connected to a prompt tag, you can easily change which prompt is used by your agent without redeploying code.

import phoenix as px
from phoenix.client.types import PromptVersion
from openai.types.chat.completion_create_params import CompletionCreateParamsBase

params = CompletionCreateParamsBase(
    model="gpt-4o-mini",
    tools=tools,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."},
        {"role": "user", "content": "{user_query}"},
    ],
)

prompt_name = "self-improving-agent-router"
prompt = px.Client().prompts.create(
    name=prompt_name,
    version=PromptVersion.from_openai(params),
)

# Tag your prompt as ready for production
px.Client().prompts.tags.create(
    prompt_version_id=prompt.id,
    name="production",
    description="Ready for production environment"
)

Define your Agent Routing Logic

Now we'll implement the core agent logic that handles tool calls and routing for your agent:

@tracer.chain()
def handle_tool_calls(tool_calls, messages):
    for tool_call in tool_calls:
        function = tool_implementations[tool_call.function.name]
        function_args = json.loads(tool_call.function.arguments)
        result = function(**function_args)

        messages.append({"role": "tool", "content": result, "tool_call_id": tool_call.id})
    return messages

The @tracer decorator adds additional automatic instrumentation of your function in Phoenix.

And for the main agent execution loop:

def run_agent(messages):
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]

    # Check and add system prompt if needed
    if not any(
        isinstance(message, dict) and message.get("role") == "system" for message in messages
    ):
	# Retrieve the production tagged prompt in Phoenix
        phoenix_production_router_prompt = PhoenixClient().prompts.get(prompt_identifier="self-improving-agent-router", tag="production")
        
        system_prompt = {
            "role": "system",
            "content": phoenix_production_router_prompt,
        }
        messages.append(system_prompt)

    while True:
        # Router call instrumentation
        with tracer.start_as_current_span(
            "router_call",
            openinference_span_kind="chain",
        ) as span:
            span.set_input(value=messages)

		# Call the agent router
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                tools=tools,
            )

            messages.append(response.choices[0].message.model_dump())
            tool_calls = response.choices[0].message.tool_calls
            span.set_status(StatusCode.OK)

		# Handle tool calls, or respond to the user
            if tool_calls:
                # Tool calls instrumentation
                messages = handle_tool_calls(tool_calls, messages)
                span.set_output(value=tool_calls)
            else:
                span.set_output(value=response.choices[0].message.content)
                return response.choices[0].message.content

Run Your Agent

Let's test your agent with some example questions:

ret = run_agent([{"role": "user", "content": "Create a line chart showing sales in 2021"}])

Now that you have your agent running and generating traces in Phoenix, you can see a visualization like:

At this point, you now have a working agent, along with telemetry capture in Phoenix. The next few steps will walk through how to set up your experiment suite, prompt optimization, and production evaluations.

Testing your Agent in Development

Before deploying, you need to evaluate your agent's performance. You can do this by using experiments to compare an agent’s function calling selections against a labeled set of data.

import nest_asyncio

import phoenix as px
from phoenix.evals import TOOL_CALLING_PROMPT_TEMPLATE, OpenAIModel, llm_classify
from phoenix.experiments import run_experiment
from phoenix.experiments.types import Example
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery

nest_asyncio.apply()

px_client = px.Client()
eval_model = OpenAIModel(model="gpt-4o-mini")

Set up a Function Calling Experiment

Experiments are made up of 3 components:

  1. A dataset of test cases with known outputs

  2. A task to run across each test case - in this case, your router call

  3. One or more evaluators to apply to the result of those tasks.

First, create a dataset of test cases with known expected outputs:

import uuid
id = str(uuid.uuid4())

# Create a list of tuples with input_messages and next_tool_call
data = [
    (
        [
            {
                "role": "user",
                "content": "Plot daily sales volume over time"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            },
            {
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_1",
                        "type": "function",
                        "function": {
                            "name": "lookup_sales_data",
                            "arguments": "{\"prompt\":\"Plot daily sales volume over time\"}"
                        }
                    }
                ]
            },
            {
                "role": "tool",
                "tool_call_id": "call_1",
                "content": "     Sold_Date  Daily_Sales_Volume\n0   2021-11-01              1021.0\n1   2021-11-02              1035.0\n2   2021-11-03               900.0"
            }
        ],
        "analyze_sales_data"
    ),
    # ... more test cases here ...
]

dataframe = pd.DataFrame(data, columns=["input_messages", "next_tool_call"])

dataset = px_client.upload_dataset(
    dataframe=dataframe,
    dataset_name=f"tool_calling_ground_truth_{id}",
    input_keys=["input_messages"],
    output_keys=["next_tool_call"],
)

Then create a task function to run just the router step of your agent:

def run_router_step(example: Example) -> str:
    input_messages = example.input.get("input_messages")

    phoenix_production_router_prompt = PhoenixClient().prompts.get(prompt_identifier="self-improving-agent-router", tag="development")
    
    system_prompt = {
        "role": "system",
        "content": phoenix_production_router_prompt,
    }
    
    # Replace the system message in input_messages with our production router prompt
    # or add it if no system message exists
    system_message_index = None
    
    for i, message in enumerate(input_messages):
        if message.get("role") == "system":
            system_message_index = i
            break
    
    if system_message_index is not None:
        # Replace existing system message
        input_messages[system_message_index] = system_prompt
    else:
        # Add system message if none exists
        input_messages.insert(0, system_prompt)
    
    response = client.chat.completions.create(
        model=model,
        messages=input_messages,
        tools=tools,
    )
    
    if response.choices[0].message.tool_calls is None:
        return "no tool called"
    
    tool_calls = []
    for tool_call in response.choices[0].message.tool_calls:
        tool_calls.append(tool_call.function.name)
    return tool_calls

And finally an evaluator to compare with expected outputs:

def tools_match(expected: str, output: str) -> bool:
    if not isinstance(output, list):
        return False
    
    # Check if all expected tools are in output and no additional tools are present
    expected_tools = expected.get("next_tool_call").split(", ")
    expected_set = set(expected_tools)
    output_set = set(output)
    
    # Return True if the sets are identical (same elements, no extras)
    return expected_set == output_set

Now you can run your experiment:

experiment = run_experiment(
    dataset,
    run_router_step,
    evaluators=[tools_match],
    experiment_name="Tool Calling Eval",
    experiment_description="Evaluating the tool calling step of the agent",
)

Following the link supplied should show you a summary of your experiment.

Optimizing Your Agent with DSPy

This is where the self-improvement magic happens. DSPy allows you to optimize a prompt using a variety of different optimization techniques, like Bootstrapped Few Shot prompt and MiProV2, and a set of labeled test cases.

Use DSPy to first create a signature that represents your agent router.

import dspy

# Configure DSPy to use OpenAI
dspy_lm = dspy.LM(model="gpt-4o-mini")
dspy.settings.configure(lm=dspy_lm)

# Define the prompt classification task
class RouterPromptSignature(dspy.Signature):
    """Route a user prompt to the correct tool based on the task requirements.
    
    Available tools:
    1. analyze_sales_data: Use for complex analysis of sales data, including trends, patterns, and insights
    2. lookup_sales_data: Use for simple data retrieval or filtering of sales records
    3. generate_visualization: Use when the user needs visual representation of data
    4. no tool called: Use when no tool is needed
    
    The tool selection should be based on:
    - The complexity of the analysis needed
    - Whether raw data or processed insights are required
    - If visualization would help communicate the results
    """

    input_messages = dspy.InputField(desc="The routers input messages. Can include the user's query and any tool calls that have already been made.")
    tool_call = dspy.OutputField(
        desc="A list of tool calls to execute in sequence. Each tool call should include: "
             "1. tool_name: The name of the tool to use "
    )

router = dspy.Predict(RouterPromptSignature)

Test your basic router:

result = router(input_messages=[{"role": "user", "content": "Which stores had the highest sales volume?"}])

Now create a training set from your previous examples. In practice, you may look to instead just use a sample of data here. A small number of examples goes a long way!

trainset = []

for input_messages, next_tool_call in dataframe.values:
    trainset.append(dspy.Example(input_messages=input_messages, tool_call=next_tool_call).with_inputs("input_messages"))

print(trainset[:3])

And finally optimize your router with DSPy:

# Optimize via BootstrapFinetune.
optimizer = dspy.BootstrapFewShot(metric=(lambda x, y, trace=None: x.tool_call == y.tool_call))
optimized = optimizer.compile(router, trainset=trainset)

optimized(input_messages=[{"role": "user", "content": "Which stores had the highest sales volume?"}])

Here is where you could gate any prompt updates on performance. You could rerun your previous experiment and only proceed if performance has improved by a certain amount.

Once you’re happy with performance, extract your optimized prompt and save it to Phoenix:

# Get the prompt from the optimized router
new_prompt = optimized.signature.instructions
print(new_prompt)

params = CompletionCreateParamsBase(
    model="gpt-4o-mini",
    tools=tools,
    messages=[
        {"role": "system", "content": new_prompt},
        {"role": "user", "content": "{user_query}"},
    ],
)

# This will update the existing prompt in Phoenix
prompt_name = "self-improving-agent-router"
prompt = px.Client().prompts.create(
    name=prompt_name,
    prompt_description="Router prompt for the self-improving agent",
    version=PromptVersion.from_openai(params),
)

# Create a tag for a prompt version
px.Client().prompts.tags.create(
    prompt_version_id=prompt.id,
    name="production",
    description="Ready for production environment"
)

The beauty of Phoenix prompts means that your saved prompt will go into effect immediately!

However new issues always arise in production. Let’s now turn to how to incorporate production learnings into your agent’s routing step.

Evaluating Your Agent in Production

To evaluate your agent in production, one approach is to use human labelers. However this can be time consuming and costly. An alternative approach is to use an LLM judge instead. While the judge won’t be 100% accurate, it can provide highly accurate directional feedback on correct or incorrect traces.

LLM Judges can also be combined with human labelers for maximum accuracy and efficiency.

To create an LLM Judge that evaluates function calling, start by exporting the data you wish to evaluate from Phoenix:

# Extract tool calls from Phoenix
def get_tool_calls():
    query = (
        SpanQuery()
        .where(
            "span_kind == 'LLM'",
        )
        .select(question="input.value", output_messages="llm.output_messages")
    )

    # The Phoenix Client can take this query and return the dataframe.
    tool_calls_df = px.Client().query_spans(query, project_name=project_name, timeout=None)
    tool_calls_df.dropna(subset=["output_messages"], inplace=True)

    def get_tool_call(outputs):
        if outputs[0].get("message").get("tool_calls"):
            return (
                outputs[0]
                .get("message")
                .get("tool_calls")[0]
                .get("tool_call")
                .get("function")
                .get("name")
            )
        else:
            return "No tool used"

    tool_calls_df["tool_call"] = tool_calls_df["output_messages"].apply(get_tool_call)
    tool_definitions_list = [tools] * len(tool_calls_df)
    tool_calls_df["tool_definitions"] = tool_definitions_list
    return tool_calls_df

Then define an LLM as a judge to evaluate the tool calls. Phoenix provides a built-in LLM prompt template that can do this for you:

def eval_tool_calls(dataframe):
	# Phoenix helper function that applies the tool calling template to each row of data
    tool_call_eval = llm_classify(
        data=dataframe,
        template=TOOL_CALLING_PROMPT_TEMPLATE,
        rails=["correct", "incorrect"],
        model=eval_model,
        provide_explanation=True,
    )

    tool_call_eval["score"] = tool_call_eval.apply(
        lambda x: 1 if x["label"] == "correct" else 0, axis=1
    )

    return tool_call_eval, dataframe

def eval_and_log_tool_calls():
    tool_calls_df = get_tool_calls()
    tool_call_eval, dataframe = eval_tool_calls(tool_calls_df)
	# Log evaluations to the Phoenix UI
    px.Client().log_evaluations(
        SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval),
    )
    
    # Merge the evaluation results with the original dataframe on context.span_id
    merged_df = pd.merge(
        tool_call_eval,
        dataframe,
        left_index=True,
        right_index=True,
        how='inner'
    )
    
    # Return the merged dataframe
    return merged_df

Run your evaluation:

tool_call_eval = eval_and_log_tool_calls()

After running this, you should see evaluation results in Phoenix:

Creating the Automated Improvement Loop

Now it’s time to tie everything together into a complete automated improvement loop:

Define a few more helper functions:

def create_trainset(tool_call_eval):
    trainset = []
    for _, row in tool_call_eval.iterrows():
        if row["label"] == "correct":
            trainset.append(dspy.Example(input_messages=row["question"], tool_call=row["tool_call"]).with_inputs("input_messages"))
    return trainset

def save_trainset(trainset):
    trainset_df = pd.DataFrame(trainset)
    px.Client().upload_dataset(
        dataframe=trainset_df,
        dataset_name="self-improving-agent-trainset-{}".format(uuid.uuid4()),
    )

def optimize_router(trainset):
    optimizer = dspy.BootstrapFewShot(metric=(lambda x, y, trace=None: x.tool_call == y.tool_call))
    optimized = optimizer.compile(router, trainset=trainset)
    new_prompt = optimized.signature.instructions
    return new_prompt

def run_experiment():
    experiment = run_experiment(
        dataset,
        run_router_step,
        evaluators=[tools_match],
        experiment_name="Tool Calling Eval",
        experiment_description="Evaluating the tool calling step of the agent",
    )
    return experiment.eval_summaries()

def save_prompt(prompt):
    params = CompletionCreateParamsBase(
        model="gpt-4o-mini",
        tools=tools,
        messages=[
            {"role": "system", "content": new_prompt},
            {"role": "user", "content": "{user_query}"},
        ],
    )

    # This will update the existing prompt in Phoenix
    prompt_name = "self-improving-agent-router"
    prompt = px.Client().prompts.create(
        name=prompt_name,
        prompt_description="Router prompt for the self-improving agent",
        version=PromptVersion.from_openai(params),
    )
    
    px.Client().prompts.tags.create(
        prompt_version_id=prompt.id,
        name="production",
        description="Ready for production environment"
    )

And finally, the automated loop itself:

def automated_loop():
    # Step 1: Evaluate production performance
    tool_call_eval = eval_and_log_tool_calls()
    
    # Step 2: Create training set from successful examples
    trainset = create_trainset(tool_call_eval)
    save_trainset(trainset)
    
    # Step 3: Optimize router prompt using DSPy
    new_prompt = optimize_router(trainset)
    
    # Step 4: Run experiment to benchmark new prompt
    experiment_results = run_experiment()
    print(experiment_results.eval_summaries())
    
    # Step 5: Ask user if they want to apply the new prompt
    apply_prompt = input("Do you want to apply the new prompt? (yes/no): ")
    
    if apply_prompt.lower() not in ["yes", "y"]:
        print("Prompt update cancelled.")
        return
    
    print("Applying new prompt...")
    save_prompt(new_prompt)

Just like that, you now have a function that will run evaluations, export correct runs, add them to a training set, create an optimized prompt on that training set, and deploy the prompt into production.

You can set this function to run on a particular cadence, or trigger it and view the results yourself.

Wrapping Up

You’ve built a self-improving agent that:

  1. Monitors its own performance using Phoenix tracing

  2. Evaluates its decisions with ground truth and LLM-as-judge approaches

  3. Automatically optimizes its prompts using DSPy

  4. Maintains version control of prompts in Phoenix

  5. Creates a continuous improvement loop that gets better with more usage

This approach can be applied to any agent system. By combining observability and optimization, you create a feedback loop that turns every user interaction into an opportunity for improvement.

You could expand and customize this approach in countless ways, from modifying the optimization technique used by DSPy, to optimizing different aspects of your agent, to changing the flow of applying prompt iterations in production.

I hope this walkthrough has given you a glimpse of how this cycle can be applied to improve nearly any agent.

For more on Agents or LLM Evaluation, check out Arize’s website.

Happy building!

Reply

or to participate.