Evaluation scenarios, quality signals, and acceptance criteria for AI agents. Browse by agent type or evaluation concern, adapt the examples to your use case, and build a comprehensive eval set.
This library gives you a head start on agent evaluation. Instead of designing test cases from scratch, you select from proven scenarios that cover business quality, architecture, compliance, and safety — then adapt them to your agent.
Use it to:
- Identify which quality dimensions apply to your agent
- Select evaluation scenarios that match your business needs and architecture
- Build test cases using the examples and patterns each scenario provides
- Validate that your eval set is comprehensive — beyond happy-path testing
The library organizes scenarios into two complementary categories:
- Business-problem scenarios — what your agent does for users (e.g., "employee asks about PTO policy," "customer troubleshoots a billing issue"). These capture the outcomes your stakeholders care about.
- Capability scenarios — how your agent's architecture behaves (e.g., "verify tool invocations," "validate knowledge grounding"). These confirm that each component works correctly, stays safe, and communicates clearly.
You need both. Business-problem scenarios verify that your agent solves the right problem. Capability scenarios verify that the underlying components work correctly. An agent can return the right answer from the wrong source, or call the right tool with the wrong parameters — only capability testing catches that.
Start with one of the two entry paths below:
- Entry Path A — "I have an agent that does X — what should I evaluate?"
- Entry Path B — "I have a specific evaluation concern — where do I go?"
Open the linked scenario files and read the "When to Use" section to confirm relevance. Most agents need 3–5 business-problem scenarios and 3–5 capability scenarios.
Each scenario provides everything you need to create test cases:
- Recommended Test Methods — which evaluation methods to use, why, and how to combine them
- Setup Steps — step-by-step instructions for creating test cases
- Evaluation Patterns — sub-patterns covering different angles of the scenario
- Practical Examples — sample test cases you can adapt directly
- Tips — coverage targets, thresholds, and best practices
Adapt the examples to your agent's specific knowledge sources, tools, and user base. Use multiple test methods per test case wherever relevant — each method catches a different failure mode, and Copilot Studio supports combining them. For detailed guidance on choosing methods, see the Evaluation Method Selection Guide.
After building your initial eval set, revisit the routing tables for missed dimensions. Common gaps:
- Testing only happy paths — add edge cases from Safety & Boundary Enforcement and Graceful Failure & Escalation
- Skipping compliance testing — see Compliance & Verbatim Content
- No regression baseline — see Regression Testing
"I have an agent that does X — what should I evaluate?"
| My agent... | Start with these scenarios |
|---|---|
| Answers questions using knowledge sources (docs, SharePoint, FAQ) | Information Retrieval & Q&A + Knowledge Grounding + Compliance |
| Executes tasks via Power Automate, APIs, or connectors | Request Submission & Task Execution + Tool Invocations + Safety |
| Walks users through diagnostic or troubleshooting steps | Troubleshooting & Guided Diagnosis + Knowledge Grounding + Graceful Failure |
| Guides users through multi-step processes | Process Navigation & Multi-Step Guidance + Trigger Routing + Tone & Quality |
| Routes conversations across multiple topics | Triage & Routing + Trigger Routing + Graceful Failure |
| Serves external customers (not just internal employees) | Tone & Response Quality + Safety & Boundary + Compliance |
| Handles sensitive data (PII, financial, health) | Safety & Boundary + Compliance |
| Is about to be updated or republished | Regression Testing + all sections previously passing |
| Needs adversarial/red-team safety testing | Red-Teaming & Adversarial Evaluation + Safety & Boundary |
Tip: Most agents match multiple rows. An agent that answers HR questions from SharePoint AND submits PTO requests via Power Automate would combine rows 1 and 2.
"I have a specific evaluation concern — where do I go?"
| I want to... | Go to |
|---|---|
| Test whether my agent answers business questions correctly | Information Retrieval & Q&A |
| Verify my agent handles troubleshooting workflows | Troubleshooting & Guided Diagnosis |
| Test request submission and task execution | Request Submission & Task Execution |
| Evaluate multi-step process guidance | Process Navigation & Multi-Step Guidance |
| Check that my agent triages and routes correctly | Triage & Routing |
| Confirm my agent doesn't hallucinate or return ungrounded answers | Knowledge Grounding & Accuracy |
| Check that the right Power Automate flow, connector, or API fires | Tool & Connector Invocations |
| Verify my topic triggers route correctly | Trigger Routing |
| Confirm a legal disclaimer or policy appears word-for-word | Compliance & Verbatim Content |
| Test whether my agent handles adversarial or out-of-scope inputs safely | Safety & Boundary Enforcement |
| Evaluate tone, empathy, and response quality | Tone, Helpfulness & Response Quality |
| Confirm my agent escalates or declines appropriately when stuck | Graceful Failure & Escalation |
| Ensure nothing broke before I publish an update | Regression Testing |
| Make sure my agent tailors answers to user-specific context | Information Retrieval & Policy Q&A (personalization scenarios) |
| Measure my agent's attack success rate (ASR) and safety posture | Red-Teaming & Adversarial Evaluation (ASR baseline) |
| Test resistance to multi-turn manipulation and crescendo attacks | Red-Teaming & Adversarial Evaluation (multi-turn attacks) |
| Verify my agent resists indirect prompt injection through tool outputs | Red-Teaming & Adversarial Evaluation (XPIA) |
| Test encoding and obfuscation bypass resistance | Red-Teaming & Adversarial Evaluation (encoding attacks) |
| Automate adversarial testing in my CI/CD pipeline | Red-Teaming & Adversarial Evaluation (CI/CD integration) |
Every scenario follows a consistent structure:
| Section | What It Provides |
|---|---|
| When to Use | When this scenario applies, from the customer's perspective |
| Recommended Test Methods | Which evaluation methods to use and why |
| Setup Steps | Step-by-step instructions for creating test cases |
| Anti-Pattern | The most common mistake to avoid |
| Evaluation Patterns | Named sub-patterns covering different angles |
| Practical Examples | Concrete sample test cases in a table |
| Tips | Coverage targets, thresholds, and best practices |
ai-agent-eval-scenario-library/
│
├── README.md ← You are here
│
├── business-problem-scenarios/ ← Scenarios grounded in business value
│ ├── README.md
│ ├── information-retrieval-and-policy-qa.md
│ ├── troubleshooting-and-guided-diagnosis.md
│ ├── request-submission-and-task-execution.md
│ ├── process-navigation-and-multistep-guidance.md
│ └── triage-and-routing.md
│
├── capability-scenarios/ ← Scenarios grounded in agent infrastructure
│ ├── README.md
│ ├── knowledge-grounding-and-accuracy.md
│ ├── tool-and-connector-invocations.md
│ ├── trigger-routing.md
│ ├── compliance-and-verbatim-content.md
│ ├── safety-and-boundary-enforcement.md
│ ├── tone-helpfulness-and-response-quality.md
│ ├── graceful-failure-and-escalation.md
│ └── regression-testing.md
│
└── resources/
├── scenario-index.csv ← Flat index of all scenarios (filterable)
├── eval-set-template.md ← Template for building eval sets
├── agent-profile-template.yaml ← Structured snapshot of your agent's config
├── agent-profile-guide.md ← How to extract the profile from a solution export
├── eval-generation-prompt.md ← Reusable prompt for LLM-based eval set generation
└── evaluation-method-selection-guide.md ← How to choose the right evaluation methods
Example: HR Q&A Agent
BUSINESS-PROBLEM SCENARIOS:
├── "Employee asks about PTO policy" (Information Retrieval)
├── "Employee submits PTO request" (Request Submission)
└── "New hire asks about onboarding steps" (Process Navigation)
CAPABILITY SCENARIOS:
├── Knowledge Grounding — are the right policy docs retrieved?
├── Tool Invocations — does the PTO submission flow execute correctly?
├── Compliance — is legally required language included?
├── Safety — does the agent protect employee PII?
└── Tone — is the agent empathetic for sensitive HR topics?
RESULT: Comprehensive eval set covering business quality + infrastructure +
compliance + safety + communication quality.
Business-problem scenarios test what the agent delivers. Capability scenarios test how it delivers it. Together they ensure nothing falls through the cracks.
This library is a living resource. If you have scenario suggestions, corrections, or feedback, please open an issue or submit a pull request.
This project is licensed under the MIT License — see LICENSE for details.