Heuristic
Rule-based checks like exact match, contains, regex, JSON validation
DuraGraph Evals provides a comprehensive evaluation framework for AI agents. Test your agents automatically, collect human feedback, and monitor quality over time.
AI agents can fail in subtle ways. Evaluations help you:
Heuristic
Rule-based checks like exact match, contains, regex, JSON validation
LLM Judge
Use GPT-4 or Claude to evaluate subjective quality
Human Feedback
Collect ratings, thumbs up/down, and comments
Comparison
A/B test between assistant versions
First, create a dataset with test cases:
curl -X POST http://localhost:8081/api/v1/datasets \ -H "Content-Type: application/json" \ -d '{ "name": "customer_support_cases", "description": "Common customer support scenarios" }'from duragraph.evals import Dataset
dataset = Dataset.create(name="customer_support_cases",description="Common customer support scenarios")Add input/expected output pairs:
curl -X POST http://localhost:8081/api/v1/datasets/{dataset_id}/examples \ -H "Content-Type: application/json" \ -d '{ "examples": [ { "input": {"message": "How do I reset my password?"}, "expected_output": {"contains": "reset", "action": "password_reset"}, "tags": ["password", "account"] }, { "input": {"message": "I want a refund"}, "expected_output": {"action": "refund_request"}, "tags": ["billing", "refund"] } ] }'dataset.add_examples([ { "input": {"message": "How do I reset my password?"}, "expected_output": {"contains": "reset", "action": "password_reset"}, "tags": ["password", "account"] }, { "input": {"message": "I want a refund"}, "expected_output": {"action": "refund_request"}, "tags": ["billing", "refund"] }])Create and run an evaluation against your assistant:
curl -X POST http://localhost:8081/api/v1/evals \ -H "Content-Type: application/json" \ -d '{ "name": "support_agent_v1.2", "type": "heuristic", "assistant_id": "your-assistant-id", "dataset_id": "your-dataset-id", "criteria": [ { "name": "contains_action", "scorer": "json_schema", "config": { "schema": { "type": "object", "required": ["action"] } } }, { "name": "response_quality", "scorer": "llm_judge", "config": { "model": "gpt-4o", "rubric": "Rate helpfulness 1-5" } } ], "run_immediately": true }'from duragraph.evals import EvalRunner, ExactMatch, LLMJudge
runner = EvalRunner(graph=support_agent,scorers=[JSONSchema(schema={"type": "object", "required": ["action"]}),LLMJudge(model="gpt-4o", rubric="Rate helpfulness 1-5"),],)
results = await runner.run(dataset)print(f"Pass rate: {results.pass_rate:.1%}")print(f"Avg score: {results.avg_score:.2f}")Check evaluation results in the dashboard or via API:
curl http://localhost:8081/api/v1/evals/{eval_id}{ "id": "eval-123", "name": "support_agent_v1.2", "status": "completed", "summary": { "total_examples": 50, "passed": 45, "failed": 5, "pass_rate": 0.9, "avg_score": 4.2, "by_criterion": { "contains_action": { "pass_rate": 0.96 }, "response_quality": { "avg_score": 4.2 } } }}The Evals Dashboard provides visualization and analysis:
Scorers Reference
Learn about all available scorers
LLM Judge
Configure LLM-as-judge evaluation
Human Feedback
Set up human feedback collection
CI/CD Integration
Add evals to your CI/CD pipeline