你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

评估 AI 代理

评估对于在部署之前确保代理满足质量和安全标准至关重要。通过在开发期间运行评估，可以建立代理性能的基线，并可以设置验收阈值，例如 85% 任务符合性通过率，然后再将其释放给用户。

本文介绍如何利用内置评估器对 Foundry 代理或托管代理进行针对代理的评估，以评估质量、安全性和代理行为。具体而言，你：

设置用于评估的 SDK 客户端。
为质量、安全和代理行为选择评估人员。
创建测试数据集并运行评估。
解释结果并将其集成到工作流中。

提示

有关生成 AI 模型和应用程序的常规用途评估，包括自定义计算器、不同的数据源和其他 SDK 选项，请参阅 SDK 中的“运行评估”。

先决条件

Python 3.8 或更高版本。
具有代理或托管代理的 Foundry 项目。
具有支持聊天完成的 GPT 模型的 Azure OpenAI 部署（例如，gpt-4o 或 gpt-4o-mini）。
Azure AI 用户 Foundry 项目中的角色。

注意

某些评估功能具有区域限制。有关详细信息，请参阅支持的区域。

设置客户端

安装 Foundry SDK 并设置身份验证：

pip install "azure-ai-projects>=2.0.0"

创建项目客户端。以下代码示例假定在此上下文中运行它们：

import os
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"]
model_deployment = os.environ["AZURE_AI_MODEL_DEPLOYMENT_NAME"]

credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=endpoint, credential=credential)
client = project_client.get_openai_client()

选择评估者

评估器是用于评估代理响应的函数。一些评估者使用 AI 模型作为法官，而另一些则使用规则或算法。对于代理评估，请考虑以下数据集：

评估器	它度量的内容
任务符合性	代理是否遵循其系统说明？
相干	响应是否逻辑良好且结构良好？
暴力	响应是否包含暴力内容？

如需了解更多内置评估器，请参阅。

代理评估器 - 评估代理处理任务、工具和用户意图的有效性。
质量评估器 - 衡量生成的响应的总体质量。
文本相似性计算器 - 使用 NLP 指标将生成的文本与参考答案进行比较。
安全评估程序 - 确定生成的输出中的潜在内容和安全风险。

若要生成自己的计算器，请参阅自定义计算器。

创建测试数据集

为您的代理创建包含测试查询的 JSONL 文件。每行都包含一个带有 query 字段的 JSON 对象：

{"query": "What's the weather in Seattle?"}
{"query": "Book a flight to Paris"}
{"query": "Tell me a joke"}

将此文件作为数据集上传到项目中：

dataset = project_client.datasets.upload_file(
    name="agent-test-queries",
    version="1",
    file_path="./test-queries.jsonl",
)

进行评估

运行评估时，服务会将每个测试查询发送到代理，捕获响应，并应用所选计算器对结果评分。

首先，配置评估器。每个计算器都需要一个数据映射，告知其查找输入的位置：

{{item.X}} 引用测试数据中的字段，例如 query。
{{sample.output_items}} 引用完整的代理响应，包括工具调用。
{{sample.output_text}} 仅引用响应消息文本。

AI 辅助评估程序（如 Task Adherence 和 Coherence）需要在 initialization_parameters 中包含一个模型部署名称。该值必须与项目中的 GPT 部署名称匹配，这是用于对响应进行评分的判断模型。某些计算器可能需要其他字段，例如 ground_truth 或工具定义。有关详细信息，请参阅评估器文档。

testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "Task Adherence",
        "evaluator_name": "builtin.task_adherence",
        "data_mapping": {
            "query": "{{item.query}}",
            "response": "{{sample.output_items}}",
        },
        "initialization_parameters": {"deployment_name": model_deployment},
    },
    {
        "type": "azure_ai_evaluator",
        "name": "Coherence",
        "evaluator_name": "builtin.coherence",
        "data_mapping": {
            "query": "{{item.query}}",
            "response": "{{sample.output_text}}",
        },
        "initialization_parameters": {"deployment_name": model_deployment},
    },
    {
        "type": "azure_ai_evaluator",
        "name": "Violence",
        "evaluator_name": "builtin.violence",
        "data_mapping": {
            "query": "{{item.query}}",
            "response": "{{sample.output_text}}",
        },
    },
]

接下来，创建评估。评估定义测试数据架构和测试条件。它用作多个运行的容器。所有在同一评估下进行的运行都符合相同的模式，并产生相同的指标集。此一致性对于比较各运行的结果非常重要。

data_source_config = {
    "type": "custom",
    "item_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string"},
        },
        "required": ["query"],
    },
    "include_sample_schema": True,
}

evaluation = client.evals.create(
    name="Agent Quality Evaluation",
    data_source_config=data_source_config,
    testing_criteria=testing_criteria,
)

最后，创建一个任务，用于将测试查询发送到代理并应用评估器：

eval_run = client.evals.runs.create(
    eval_id=evaluation.id,
    name="Agent Evaluation Run",
    data_source={
        "type": "azure_ai_target_completions",
        "source": {
            "type": "file_id",
            "id": dataset.id,
        },
        "input_messages": {
            "type": "template",
            "template": [{"type": "message", "role": "user", "content": {"type": "input_text", "text": "{{item.query}}"}}],
        },
        "target": {
            "type": "azure_ai_agent",
            "name": "my-agent",  # Replace with your agent name
            "version": "1",  # Optional; omit to use latest version
        },
    },
)

print(f"Evaluation run started: {eval_run.id}")

提示

此示例适用于使用响应协议的提示代理和托管代理。对于使用调用协议的托管代理，格式 input_messages 不同 — 提供任意格式的 JSON 对象，而不是结构化模板。有关详细信息和代码示例，请参阅云评估指南中的托管代理调用协议。

提示

若要使用 Application Insights 中的跟踪评估已发生的代理交互，请参阅云评估指南中的跟踪评估。

解释结果

评估通常在几分钟内完成，具体取决于查询数。轮询完成并检索报表 URL，以便在 Evaluations 选项卡下的 Microsoft Foundry 门户中查看结果：

import time

# Wait for completion
while True:
    run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=evaluation.id)
    if run.status in ["completed", "failed"]:
        break
    time.sleep(5)

print(f"Status: {run.status}")
print(f"Report URL: {run.report_url}")

屏幕截图显示 Microsoft Foundry 门户中代理的评估结果。

聚合结果

在运行级别，您可以查看汇总数据，包括通过和失败次数、每个模型的令牌使用量以及每个评估器的结果：

{
    "result_counts": {
        "total": 3,
        "passed": 1,
        "failed": 2,
        "errored": 0
    },
    "per_model_usage": [
        {
            "model_name": "gpt-4o-mini-2024-07-18",
            "invocation_count": 6,
            "total_tokens": 9285,
            "prompt_tokens": 8326,
            "completion_tokens": 959
        },
        ...
    ],
    "per_testing_criteria_results": [
        {
            "testing_criteria": "Task Adherence",
            "passed": 1,
            "failed": 2
        },
        ... // remaining testing criteria
    ]
}

行级别输出

每个评估运行都会返回测试数据集中每行的输出项，从而详细了解代理的性能。输出项包括原始查询、代理响应、具有分数和推理的单个计算器结果以及令牌用法：

{
    "object": "eval.run.output_item",
    "id": "1",
    "run_id": "evalrun_abc123",
    "eval_id": "eval_xyz789",
    "status": "completed",
    "datasource_item": {
        "query": "What's the weather in Seattle?",
        "response_id": "resp_abc123",
        "agent_name": "my-agent",
        "agent_version": "10",
        "sample.output_text": "I'd be happy to help with the weather! However, I need to check the current conditions. Let me look that up for you.",
        "sample.output_items": [
            ... // agent response messages with tool calls
        ]
    },
    "results": [
        {
            "type": "azure_ai_evaluator",
            "name": "Task Adherence",
            "metric": "task_adherence",
            "label": "pass",
            "reason": "Agent followed system instructions correctly",
            "threshold": 3,
            "passed": true,
            "sample":
            {
               ... // evaluator input/output and token usage
            }
        },
        ... // remaining evaluation results
    ]
}

集成到工作流中

CI/CD 管道：在部署管道中将评估用作质量控制点。有关详细的集成信息，请参阅运行评估与 GitHub Actions。
生产监视：使用持续评估监视生产中的代理。有关设置说明，请参阅 “设置持续评估”。

优化和比较版本

使用评估来迭代和改进您的代理：

运行评估以识别弱区域。使用群集分析查找模式和错误。
根据调查结果调整代理指令或工具。
重新评估和比较运行以衡量改进。
重复，直到满足质量阈值。

反馈

此页面是否有帮助？

Last updated on 2026-05-08