Skip to main content

Overview

The Evaluations system enables you to assess, track, and improve the quality of AI-generated responses. By creating evaluations, you can build datasets for fine-tuning models, monitor performance trends, and ensure consistent high-quality customer interactions.

Key Benefits

  • Quality Assurance: Monitor and improve response quality
  • Model Training: Export evaluations as JSONL for fine-tuning
  • Performance Tracking: Analyze trends and identify areas for improvement
  • Team Insights: Understand response patterns across different support types

Evaluation Status Types

Each evaluation is assigned a status that reflects the quality of the response:

Outstanding

Exceptional responses that exceed expectations. These serve as gold-standard examples for training.

Satisfactory

Good responses that meet quality standards and properly address customer needs.

Needs Further Review

Responses requiring additional assessment or minor improvements before final classification.

Unsatisfactory

Responses with significant issues that don’t meet quality standards.

Code Red

Critical failures requiring immediate attention and remediation.

Creating Evaluations

Manual Evaluation Creation

To create a new evaluation manually:
  1. Navigate to the Evaluations Dashboard
  2. Click “Create New Evaluation”
  3. Fill in the evaluation details:
{
  "eval_name": "Customer Refund Request Handling",
  "eval_type": "Customer Service",
  "user_message": "I want to return my order #12345. It arrived damaged.",
  "preferred_output": "I'm sorry to hear your order arrived damaged. I'll help you with the return right away. I've initiated a return for order #12345 and you'll receive a prepaid shipping label via email within 24 hours. Once we receive the item, your refund will be processed within 3-5 business days.",
  "non_preferred_output": "You need to go to our website and fill out the return form.",
  "eval_status": "Outstanding",
  "description": "Exemplary handling of damaged product return with empathy and clear next steps"
}

Evaluation Types

Choose the appropriate type for your evaluation:
  • Customer Service: General customer inquiries and support
  • Technical Support: Technical issues and troubleshooting
  • Sales: Sales-related interactions and inquiries
  • Product Support: Product-specific questions and guidance
  • General: Other types of interactions

Building Effective Test Sets

To create robust evaluations, build test sets that represent real-world scenarios:
  • Golden Test Sets: Curate 50-100 high-quality examples with expected responses.
  • Diverse Scenarios: Include in-scope queries, out-of-scope small talk, and adversarial questions.
  • Synthetic Data: Generate variations using LLMs to expand coverage, including edge cases and bias tests.
  • Iteration: Continuously update based on production data and user feedback.
Use these test sets to ensure comprehensive evaluation coverage.

Managing Evaluations

Dashboard Features

The Evaluations Dashboard provides comprehensive tools for management:
  • Smart Search: Search by name, type, ticket ID, or description
  • Type Filter: Filter by evaluation type (Customer Service, Technical Support, etc.)
  • Status Filter: View evaluations by their quality status
  • Date Range: Filter by creation date (Last 7/30/90 days or all time)
  • Sorting: Sort by date, name, status, or type in ascending/descending order
  • Select Multiple: Use checkboxes to select multiple evaluations
  • Bulk Export: Export selected evaluations to JSONL format
  • Bulk Delete: Remove multiple evaluations at once
  • Select All: Quickly select all visible evaluations
  • Table View: Traditional table layout for efficient scanning
  • Card View: Visual card layout for detailed preview
  • Analytics View: Coming soon - detailed insights and trends

Types of Evaluation Methods

Enhance your evaluation process using different methods:
  • Direct feedback from users or experts
  • Best for nuanced quality assessment
  • Example: Rate response empathy on a 1-5 scale
  • Use another AI to evaluate outputs
  • Scalable for large datasets
  • Example: Prompt an LLM to score factual accuracy
  • Programmatic checks for specific criteria
  • Ideal for objective metrics
  • Example: Verify response format or keyword presence
Combine these methods for comprehensive quality assurance.

Performance Statistics

Monitor your evaluation metrics through dashboard cards:
  • Total Evaluations: Overall count of all evaluations
  • Outstanding: Count of exceptional responses
  • Needs Review: Evaluations requiring further assessment
  • This Week: Recent evaluation activity
Each metric includes trend indicators showing performance changes over time.

Exporting for Model Training

JSONL Export Format

Evaluations can be exported in JSONL format for model fine-tuning:
{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "I want to return my order #12345. It arrived damaged."
      }
    ],
    "tools": [],
    "parallel_tool_calls": true
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "I'm sorry to hear your order arrived damaged..."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "You need to go to our website and fill out the return form."
    }
  ]
}

Export Methods

  • Manual Export
  • Bulk Export
  1. Navigate to the Export tab in the dashboard
  2. Select “Create New” mode
  3. Enter the evaluation details:
    • User Message
    • Preferred Output
    • Non-Preferred Output
    • Tools (optional, for function calling)
  4. Click “Export to JSONL”

Advanced Export Options

For evaluations involving tool use or function calling:
{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "process_return",
        "description": "Process a product return request",
        "parameters": {
          "type": "object",
          "properties": {
            "order_id": {
              "type": "string",
              "description": "The order ID to process return for"
            },
            "reason": {
              "type": "string",
              "description": "Reason for the return"
            }
          },
          "required": ["order_id", "reason"]
        }
      }
    }
  ]
}

Best Practices

Creating High-Quality Evaluations

1

Clear Naming

Use descriptive names that indicate the scenario being evaluated
✅ "Damaged Product Return - Empathetic Response"
❌ "Eval 1"
2

Realistic Scenarios

Base evaluations on actual customer interactions and common use cases
3

Comprehensive Coverage

Include both ideal responses and common mistakes to avoid
4

Consistent Standards

Apply evaluation criteria consistently across similar interaction types
5

Regular Review

Periodically review and update evaluations to maintain relevance

Evaluation Criteria

When assessing responses, consider:

Accuracy

  • Correct information provided
  • Proper understanding of the issue
  • Appropriate solution offered

Tone & Empathy

  • Professional and friendly tone
  • Empathy for customer situation
  • Appropriate level of formality

Completeness

  • All questions answered
  • Clear next steps provided
  • No missing information

Efficiency

  • Concise yet comprehensive
  • Direct problem resolution
  • Minimal back-and-forth needed

Advanced Evaluation Metrics

To further refine your evaluations, consider these advanced metrics tailored for AI-generated customer service responses:

Relevance

  • How well the response addresses the specific query
  • Avoidance of unnecessary information
  • Alignment with customer needs

Coherence

  • Logical flow of information
  • Consistent language and structure
  • Easy to follow reasoning

Helpfulness

  • Enables customer action
  • Provides value beyond basic information
  • Anticipates follow-up needs

Safety & Bias

  • Absence of harmful content
  • Fair and unbiased responses
  • Compliance with ethical guidelines
Incorporate these metrics into your evaluation rubrics for more comprehensive assessments.

Use Cases

Model Fine-Tuning

Export evaluations to create training datasets:
  1. Collect Examples: Build a corpus of high-quality evaluations
  2. Export to JSONL: Use the bulk export feature
  3. Prepare Dataset: Format according to your model’s requirements
  4. Fine-Tune: Use the dataset to improve model performance

Quality Monitoring

Track response quality over time:
  • Monitor status distribution (Outstanding vs. Unsatisfactory)
  • Identify patterns in problematic responses
  • Track improvement after training or process changes

Team Training

Use evaluations for human agent training:
  • Share examples of outstanding responses
  • Highlight common mistakes to avoid
  • Create training materials from real scenarios

Troubleshooting

  • Ensure you’re logged into the correct organization
  • Check filters aren’t hiding evaluations
  • Refresh the dashboard
  • Verify all required fields are filled
  • Check for valid JSON in tools field
  • Ensure evaluations have required data
  • Use filters to reduce displayed evaluations
  • Export in smaller batches
  • Clear browser cache if needed

Next Steps

I