Overview

The Evaluations system enables you to assess, track, and improve the quality of AI-generated responses. By creating evaluations, you can build datasets for fine-tuning models, monitor performance trends, and ensure consistent high-quality customer interactions.

Key Benefits

  • Quality Assurance: Monitor and improve response quality
  • Model Training: Export evaluations as JSONL for fine-tuning
  • Performance Tracking: Analyze trends and identify areas for improvement
  • Team Insights: Understand response patterns across different support types

Evaluation Status Types

Each evaluation is assigned a status that reflects the quality of the response:

Outstanding

Exceptional responses that exceed expectations. These serve as gold-standard examples for training.

Satisfactory

Good responses that meet quality standards and properly address customer needs.

Needs Further Review

Responses requiring additional assessment or minor improvements before final classification.

Unsatisfactory

Responses with significant issues that don’t meet quality standards.

Code Red

Critical failures requiring immediate attention and remediation.

Creating Evaluations

Manual Evaluation Creation

To create a new evaluation manually:

  1. Navigate to the Evaluations Dashboard
  2. Click “Create New Evaluation”
  3. Fill in the evaluation details:
{
  "eval_name": "Customer Refund Request Handling",
  "eval_type": "Customer Service",
  "user_message": "I want to return my order #12345. It arrived damaged.",
  "preferred_output": "I'm sorry to hear your order arrived damaged. I'll help you with the return right away. I've initiated a return for order #12345 and you'll receive a prepaid shipping label via email within 24 hours. Once we receive the item, your refund will be processed within 3-5 business days.",
  "non_preferred_output": "You need to go to our website and fill out the return form.",
  "eval_status": "Outstanding",
  "description": "Exemplary handling of damaged product return with empathy and clear next steps"
}

Evaluation Types

Choose the appropriate type for your evaluation:

  • Customer Service: General customer inquiries and support
  • Technical Support: Technical issues and troubleshooting
  • Sales: Sales-related interactions and inquiries
  • Product Support: Product-specific questions and guidance
  • General: Other types of interactions

Managing Evaluations

Dashboard Features

The Evaluations Dashboard provides comprehensive tools for management:

Performance Statistics

Monitor your evaluation metrics through dashboard cards:

  • Total Evaluations: Overall count of all evaluations
  • Outstanding: Count of exceptional responses
  • Needs Review: Evaluations requiring further assessment
  • This Week: Recent evaluation activity

Each metric includes trend indicators showing performance changes over time.

Exporting for Model Training

JSONL Export Format

Evaluations can be exported in JSONL format for model fine-tuning:

{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "I want to return my order #12345. It arrived damaged."
      }
    ],
    "tools": [],
    "parallel_tool_calls": true
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "I'm sorry to hear your order arrived damaged..."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "You need to go to our website and fill out the return form."
    }
  ]
}

Export Methods

  1. Navigate to the Export tab in the dashboard
  2. Select “Create New” mode
  3. Enter the evaluation details:
    • User Message
    • Preferred Output
    • Non-Preferred Output
    • Tools (optional, for function calling)
  4. Click “Export to JSONL”

Advanced Export Options

For evaluations involving tool use or function calling:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "process_return",
        "description": "Process a product return request",
        "parameters": {
          "type": "object",
          "properties": {
            "order_id": {
              "type": "string",
              "description": "The order ID to process return for"
            },
            "reason": {
              "type": "string",
              "description": "Reason for the return"
            }
          },
          "required": ["order_id", "reason"]
        }
      }
    }
  ]
}

Best Practices

Creating High-Quality Evaluations

1

Clear Naming

Use descriptive names that indicate the scenario being evaluated

✅ "Damaged Product Return - Empathetic Response"
❌ "Eval 1"
2

Realistic Scenarios

Base evaluations on actual customer interactions and common use cases

3

Comprehensive Coverage

Include both ideal responses and common mistakes to avoid

4

Consistent Standards

Apply evaluation criteria consistently across similar interaction types

5

Regular Review

Periodically review and update evaluations to maintain relevance

Evaluation Criteria

When assessing responses, consider:

Accuracy

  • Correct information provided
  • Proper understanding of the issue
  • Appropriate solution offered

Tone & Empathy

  • Professional and friendly tone
  • Empathy for customer situation
  • Appropriate level of formality

Completeness

  • All questions answered
  • Clear next steps provided
  • No missing information

Efficiency

  • Concise yet comprehensive
  • Direct problem resolution
  • Minimal back-and-forth needed

Use Cases

Model Fine-Tuning

Export evaluations to create training datasets:

  1. Collect Examples: Build a corpus of high-quality evaluations
  2. Export to JSONL: Use the bulk export feature
  3. Prepare Dataset: Format according to your model’s requirements
  4. Fine-Tune: Use the dataset to improve model performance

Quality Monitoring

Track response quality over time:

  • Monitor status distribution (Outstanding vs. Unsatisfactory)
  • Identify patterns in problematic responses
  • Track improvement after training or process changes

Team Training

Use evaluations for human agent training:

  • Share examples of outstanding responses
  • Highlight common mistakes to avoid
  • Create training materials from real scenarios

Troubleshooting

Next Steps