Skip to content

Evaluation Scorers

Scorers evaluate agent outputs and produce scores. DuraGraph supports both heuristic (rule-based) and LLM-based scorers.

Fast, deterministic scorers for objective criteria.

Checks if output exactly matches expected value.

{
"name": "answer_correct",
"scorer": "exact_match",
"config": {
"field": "answer",
"case_sensitive": false
}
}

Checks if output contains a substring.

{
"name": "mentions_refund",
"scorer": "contains",
"config": {
"substring": "refund",
"case_sensitive": false
}
}

Checks if output matches a regular expression.

{
"name": "valid_email",
"scorer": "regex",
"config": {
"pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
}
}

Checks if output is valid JSON.

{
"name": "valid_json",
"scorer": "json_valid"
}

Validates output against a JSON Schema.

{
"name": "valid_response",
"scorer": "json_schema",
"config": {
"schema": {
"type": "object",
"required": ["action", "message"],
"properties": {
"action": { "type": "string", "enum": ["reply", "transfer", "escalate"] },
"message": { "type": "string", "minLength": 1 }
}
}
}
}

Checks if output length is within bounds.

{
"name": "appropriate_length",
"scorer": "length",
"config": {
"min": 50,
"max": 500
}
}

Uses an LLM to evaluate subjective qualities.

{
"name": "helpfulness",
"scorer": "llm_judge",
"config": {
"model": "gpt-4o",
"criteria": ["helpfulness", "accuracy", "clarity"]
}
}
{
"name": "custom_quality",
"scorer": "llm_judge",
"config": {
"model": "gpt-4o",
"rubric": "Evaluate the response on:\n1. Does it address the user's question?\n2. Is the tone professional?\n3. Is the information accurate?\n\nScore 1-5 for each criterion."
}
}
{
"name": "comprehensive",
"scorer": "llm_judge",
"config": {
"model": "claude-3-sonnet",
"criteria": [
{
"name": "relevance",
"description": "How relevant is the response to the query?"
},
{
"name": "completeness",
"description": "Does the response fully address the question?"
},
{
"name": "safety",
"description": "Is the response safe and appropriate?"
}
]
}
}
from duragraph.evals import (
EvalRunner,
ExactMatch,
Contains,
Regex,
JSONSchema,
)
runner = EvalRunner(
graph=my_agent,
scorers=[
ExactMatch(field="action"),
Contains(substring="thank you", case_sensitive=False),
JSONSchema(schema={"type": "object", "required": ["response"]}),
],
)
results = await runner.run(dataset)

All scorers produce a standardized score:

@dataclass
class Score:
criterion: str # Name of the criterion
value: float # Score value (0-1 for pass/fail, 1-5 for ratings)
passed: bool # Whether the score passes threshold
explanation: str # Human-readable explanation

Create custom scorers by implementing the Scorer interface:

from duragraph.evals import Scorer, Score
class SentimentScorer(Scorer):
"""Custom scorer that checks sentiment."""
def __init__(self, target_sentiment: str = "positive"):
self.target_sentiment = target_sentiment
async def score(
self,
output: str,
expected: str | None,
config: dict | None,
) -> Score:
# Your scoring logic here
sentiment = analyze_sentiment(output)
passed = sentiment == self.target_sentiment
return Score(
criterion="sentiment",
value=1.0 if passed else 0.0,
passed=passed,
explanation=f"Detected sentiment: {sentiment}",
)
# Use custom scorer
runner = EvalRunner(
graph=my_agent,
scorers=[SentimentScorer(target_sentiment="positive")],
)
  1. Start simple: Begin with heuristic scorers, add LLM judge for nuanced evaluation
  2. Be specific: Clear criteria produce more reliable scores
  3. Calibrate thresholds: Adjust pass/fail thresholds based on your quality bar
  4. Monitor drift: Track scores over time to catch regressions