This example demonstrates how to extend the agent-control Evaluator base class to create custom evaluators using external libraries like DeepEval.
DeepEval's GEval is an LLM-as-a-judge metric that uses chain-of-thoughts (CoT) to evaluate LLM outputs based on custom criteria. This example shows how to:
- Extend the base Evaluator class - Create a custom evaluator by implementing the required interface
- Configure evaluation criteria - Define custom quality metrics (coherence, relevance, correctness, etc.)
- Register via entry points - Make the evaluator discoverable by the agent-control server
- Integrate with agent-control - Use the evaluator in controls to enforce quality standards
examples/deepeval/
├── __init__.py # Package initialization
├── config.py # DeepEvalEvaluatorConfig - Configuration model
├── evaluator.py # DeepEvalEvaluator - Main evaluator implementation
├── qa_agent.py # Q&A agent with DeepEval controls
├── setup_controls.py # Setup script to create controls on server
├── start_server_with_evaluator.sh # Helper script to start server with evaluator
├── pyproject.toml # Project config with entry point and dependencies
└── README.md # This file
Package Structure Notes:
- Uses a flat layout with Python files at the root (configured via
packages = ["."]in pyproject.toml) - Modules use absolute imports (e.g.,
from config import X) rather than relative imports - Entry point
evaluator:DeepEvalEvaluatorreferences the module directly - Install with
uv pip install -e .to register the entry point for server discovery
-
DeepEvalEvaluatorConfig (config.py)
- Pydantic model defining configuration options
- Based on DeepEval's GEval API parameters
- Validates that either
criteriaorevaluation_stepsis provided
-
DeepEvalEvaluator (evaluator.py)
- Extends
Evaluator[DeepEvalEvaluatorConfig] - Implements the
evaluate()method - Registered with
@register_evaluatordecorator - Handles LLMTestCase creation and metric execution
- Extends
-
Q&A Agent Demo (qa_agent.py)
- Complete working agent with DeepEval quality controls
- Uses
@control()decorator for automatic evaluation - Demonstrates handling
ControlViolationError
-
Setup Script (setup_controls.py)
- Creates agent and registers with server
- Configures DeepEval-based controls
- Creates 3 quality controls (coherence, relevance, correctness)
-
Entry Point Registration (pyproject.toml)
- Registers evaluator with server via
project.entry-points - Depends on
agent-control-evaluators>=5.0.0,agent-control-models>=5.0.0, andagent-control-sdk>=5.0.0 - In monorepo: uses workspace dependencies (editable installs)
- For third-party: can use published PyPI packages
- Enables automatic discovery when server starts
- Registers evaluator with server via
The evaluator follows the standard pattern for all agent-control evaluators:
from agent_control_evaluators import Evaluator, EvaluatorMetadata, register_evaluator
@register_evaluator
class DeepEvalEvaluator(Evaluator[DeepEvalEvaluatorConfig]):
# Define metadata
metadata = EvaluatorMetadata(
name="deepeval-geval",
version="1.0.0",
description="DeepEval GEval custom LLM-based evaluator",
requires_api_key=True,
timeout_ms=30000,
)
# Define config model
config_model = DeepEvalEvaluatorConfig
# Implement evaluate method
async def evaluate(self, data: Any) -> EvaluatorResult:
# matched=True triggers the deny action when quality fails
# matched=False allows the request when quality passes
return EvaluatorResult(
matched=not is_successful, # Trigger when quality fails
confidence=score,
message=reason,
)The evaluator is registered via pyproject.toml:
[project.entry-points."agent_control.evaluators"]
deepeval-geval = "evaluator:DeepEvalEvaluator"This makes the evaluator automatically discoverable by the server when it starts. The pattern works with both workspace dependencies (for monorepo development) and published PyPI packages (for third-party evaluators).
DeepEval's GEval supports two modes:
With Criteria (auto-generates evaluation steps):
config = DeepEvalEvaluatorConfig(
name="Coherence",
criteria="Evaluate whether the response is coherent and logically consistent.",
evaluation_params=["input", "actual_output"],
threshold=0.6,
)With Explicit Steps:
config = DeepEvalEvaluatorConfig(
name="Correctness",
evaluation_steps=[
"Check whether facts in actual output contradict expected output",
"Heavily penalize omission of critical details",
"Minor wording differences are acceptable"
],
evaluation_params=["input", "actual_output", "expected_output"],
threshold=0.7,
)Once registered, the evaluator can be used in control definitions:
control_definition = {
"name": "check-coherence",
"description": "Ensures responses are coherent and logically consistent",
"definition": {
"description": "Ensures responses are coherent",
"enabled": True,
"execution": "server",
"scope": {"stages": ["post"]}, # Apply to all steps at post stage
"selector": {}, # Pass full data (input + output)
"evaluator": {
"name": "deepeval-geval", # From metadata.name
"config": {
"name": "Coherence",
"criteria": "Evaluate whether the response is coherent",
"evaluation_params": ["input", "actual_output"],
"threshold": 0.6,
"model": "gpt-4o",
},
},
"action": {
"decision": "deny",
"message": "Response failed coherence check",
},
},
}Key points:
execution: "server"- Required fieldscope: {"stages": ["post"]}- Apply to all function calls at post stageselector: {}- Pass full data so evaluator gets both input and outputevaluation_params: ["input", "actual_output"]- Both fields required for relevance checks
This example demonstrates custom evaluator development within the agent-control monorepo. It uses workspace dependencies (editable installs) to work with the latest development versions of:
agent-control-models- Base evaluator classes and typesagent-control-sdk- Agent Control SDK for integrationdeepeval- DeepEval evaluation framework
Note: This is a development/monorepo example showing the evaluator architecture.
# Clone the repository
git clone https://github.com/agentcontrol/agent-control.git
cd agent-control# Start PostgreSQL database and run migrations
cd server && docker-compose up -d && make alembic-upgrade && cd ..
# Start the agent-control server (from repository root)
make server-runThe server will be running at http://localhost:8000.
# Navigate to the DeepEval example directory
cd examples/deepeval
# Install the evaluator package itself in editable mode
uv pip install -e . --upgradeThis installs:
- Dependencies:
deepeval>=1.0.0,openai>=1.0.0,pydantic>=2.0.0, etc. - Workspace packages (as editable installs):
agent-control-models,agent-control-sdk - This evaluator package in editable mode, which registers the entry point for server discovery
The entry point deepeval-geval = "evaluator:DeepEvalEvaluator" makes the evaluator discoverable by the server.
NOTE: You need to setup OPENAI_API_KEY in server as well as your app folder
# Required for DeepEval GEval (uses OpenAI models)
export OPENAI_API_KEY="your-openai-api-key"
# Optional: Disable DeepEval telemetry
export DEEPEVAL_TELEMETRY_OPT_OUT="true"After installing the DeepEval example, restart the server so it can discover the new evaluator:
# Stop the server (Ctrl+C) and restart
cd ../../ # Back to repository root
make server-runVerify the evaluator is registered:
curl http://localhost:8000/api/v1/evaluators | grep deepeval-gevalcd examples/deepeval
uv run setup_controls.pyThis creates the agent registration and three quality controls (coherence, relevance, correctness).
uv run qa_agent.pyTry asking questions like "What is Python?" or test the controls with "Tell me about something trigger_irrelevant".
Once the agent is running, try these commands:
You: What is Python?
You: What is the capital of France?
You: Test trigger_incoherent response please
You: Tell me about something trigger_irrelevant
You: /test-good # Test with quality questions
You: /test-bad # Test quality control triggers
You: /help # Show all commands
You: /quit # Exit
The agent will:
- Accept questions with coherent, relevant responses
- Block questions that produce incoherent or irrelevant responses
- Show which control triggered when quality checks fail
Good Quality Responses (Pass controls):
You: What is Python?
Agent: Python is a high-level, interpreted programming language known for its
simplicity and readability. It was created by Guido van Rossum and first
released in 1991. Python supports multiple programming paradigms...
Poor Quality Responses (Blocked by controls):
You: Test trigger_incoherent response please
⚠️ Quality control triggered: check-coherence
Reason: Response failed coherence check
Agent: I apologize, but my response didn't meet quality standards.
Could you rephrase your question or ask something else?
The DeepEval controls evaluate responses in real-time and block those that don't meet quality thresholds.
DeepEval supports multiple test case parameters:
input- The user query or promptactual_output- The LLM's generated responseexpected_output- Reference/ground truth answercontext- Additional context for evaluationretrieval_context- Retrieved documents (for RAG)tools_called- Tools invoked by the agentexpected_tools- Expected tool usage- Plus MCP-related parameters
Configure which parameters to use via the evaluation_params config field.
Important: For relevance checks, always include both input and actual_output so the evaluator can compare the question with the answer.
This example shows the evaluator architecture for extending agent-control. While this specific example is set up for monorepo development, the same pattern works for third-party evluators using published packages.
To create your own evaluator:
- Extend the Evaluator base class from
agent-control-evaluators(published on PyPI) - Define a configuration model using Pydantic
- Register via entry points in your
pyproject.toml - Install your package so the server can discover the entry point
- Restart the server to load the new evaluator
For standalone packages outside the monorepo, use published versions:
[project]
dependencies = [
"agent-control-evaluators>=5.0.0", # From PyPI - base classes
"agent-control-models>=5.0.0", # From PyPI - data models
"your-evaluation-library>=1.0.0"
]See the Extending This Example section below for the complete pattern.
For production deployments, build your evaluator as a Python wheel and install it on your agent-control server:
Development (this example):
uv pip install -e . # Editable install for developmentProduction:
python -m build # Creates dist/*.whl
# Install wheel on production server where agent-control runsDeployment Options:
-
Self-Hosted Server (Full Control)
- Deploy your own agent-control server instance
- Install custom evaluator packages (wheel, source, or private PyPI)
- Your agents connect to this server via the SDK
- Complete control over evaluators and controls
-
Managed Service (If Available)
- Use a hosted agent-control service
- May require coordination to install custom evaluators
- Or use only built-in/approved evaluators
In both cases, evaluators run server-side (execution: "server"), so your agent applications only need the lightweight SDK installed. The evaluator package must be installed where the agent-control server runs, not in your agent application.
Follow this pattern to create evaluators for other libraries:
-
Define a Config Model
from pydantic import BaseModel class MyEvaluatorConfig(BaseModel): threshold: float = 0.5 # Your config fields
-
Implement the Evaluator
from agent_control_evaluators import Evaluator, EvaluatorMetadata, register_evaluator @register_evaluator class MyEvaluator(Evaluator[MyEvaluatorConfig]): metadata = EvaluatorMetadata(name="my-evaluator", ...) config_model = MyEvaluatorConfig async def evaluate(self, data: Any) -> EvaluatorResult: score = # Your evaluation logic return EvaluatorResult( matched=score < self.config.threshold, # Trigger when fails confidence=score, )
-
Register via Entry Point
[project.entry-points."agent_control.evaluators"] my-evaluator = "evaluator:MyEvaluator"
-
Install and Use
uv sync # Server will discover it automatically
You can create specialized evaluators for specific use cases:
- Bias Detection: Evaluate responses for bias or fairness
- Safety: Check for harmful or unsafe content
- Style Compliance: Ensure responses match brand guidelines
- Technical Accuracy: Validate technical correctness
- Tone Assessment: Evaluate emotional tone and sentiment
- DeepEval Documentation: https://deepeval.com/docs/metrics-llm-evals
- G-Eval Guide: https://www.confident-ai.com/blog/g-eval-the-definitive-guide
- Agent Control Evaluators: Base evaluator class
- CrewAI Example: Using agent-control as a consumer
- Entry Points are Critical: The server discovers evaluators via
project.entry-points, not PYTHONPATH - Extensibility: The
Evaluatorbase class makes it easy to integrate any evaluation library - Configuration: Pydantic models provide type-safe, validated configuration
- Registration: The
@register_evaluatordecorator handles registration automatically - Integration: Evaluators work seamlessly with agent-control's control system
- Control Logic:
matched=Truetriggers the action (deny/allow), so invert when quality passes
- Check that
execution: "server"is in control definition - Use
scope: {"stages": ["post"]}instead ofstep_types - Use empty selector
{}to pass full data (input + output) - Restart server after evaluator code changes
The server couldn't discover the evaluator. Check:
-
Entry point registration in
pyproject.toml:[project.entry-points."agent_control.evaluators"] deepeval-geval = "evaluator:DeepEvalEvaluator"
-
Package is installed:
cd examples/deepeval uv sync # Install dependencies uv pip install -e . # Install this package
-
Server was restarted after package installation:
# Stop server (Ctrl+C), then restart make server-run -
Verify registration:
curl http://localhost:8000/api/v1/evaluators | grep deepeval-geval -
Check server logs for evaluator discovery messages during startup
- For relevance: include both
inputandactual_outputinevaluation_params - Check that
matchedlogic is inverted (trigger when quality fails) - Lower threshold to be more strict (0.5 instead of 0.7)
If you see import errors like ImportError: cannot import name 'AgentRef':
-
Stale editable install: Reinstall the package
uv pip install -e /path/to/package --force-reinstall --no-deps
-
For agent-control-models specifically:
uv pip install -e ../../models --force-reinstall --no-deps
-
Clear Python cache if issues persist:
find . -name "*.pyc" -delete find . -name "__pycache__" -type d -exec rm -rf {} +
-
Verify installation:
python -c "from agent_control_models.server import AgentRef; print('Success')"
If you see attempted relative import with no known parent package:
-
Ensure the package is installed:
cd examples/deepeval uv pip install -e .
-
Verify entry point registration:
uv pip show agent-control-deepeval-example
-
Check pyproject.toml has:
[tool.hatch.build.targets.wheel] packages = ["."]
- DeepEval creates a
.deepeval/directory with telemetry files in the working directory - When the evaluator runs on the server, files appear in
server/.deepeval/ - These files don't need to be committed (add
.deepeval/to.gitignore) - To disable telemetry: set environment variable
DEEPEVAL_TELEMETRY_OPT_OUT="true"
This example is part of the agent-control project.