Evaluations Guide - StateSet API Docs

Overview

The Evaluations system enables you to assess, track, and improve the quality of AI-generated responses. By creating evaluations, you can build datasets for fine-tuning models, monitor performance trends, and ensure consistent high-quality customer interactions.

Key Benefits

Quality Assurance: Monitor and improve response quality
Model Training: Export evaluations as JSONL for fine-tuning
Performance Tracking: Analyze trends and identify areas for improvement
Team Insights: Understand response patterns across different support types

Evaluation Status Types

Each evaluation is assigned a status that reflects the quality of the response:

Outstanding

Exceptional responses that exceed expectations. These serve as gold-standard examples for training.

Satisfactory

Good responses that meet quality standards and properly address customer needs.

Needs Further Review

Responses requiring additional assessment or minor improvements before final classification.

Unsatisfactory

Responses with significant issues that don’t meet quality standards.

Code Red

Critical failures requiring immediate attention and remediation.

Creating Evaluations

Manual Evaluation Creation

To create a new evaluation manually:

Navigate to the Evaluations Dashboard
Click “Create New Evaluation”
Fill in the evaluation details:

{
  "eval_name": "Customer Refund Request Handling",
  "eval_type": "Customer Service",
  "user_message": "I want to return my order #12345. It arrived damaged.",
  "preferred_output": "I'm sorry to hear your order arrived damaged. I'll help you with the return right away. I've initiated a return for order #12345 and you'll receive a prepaid shipping label via email within 24 hours. Once we receive the item, your refund will be processed within 3-5 business days.",
  "non_preferred_output": "You need to go to our website and fill out the return form.",
  "eval_status": "Outstanding",
  "description": "Exemplary handling of damaged product return with empathy and clear next steps"
}

Evaluation Types

Choose the appropriate type for your evaluation:

Customer Service: General customer inquiries and support
Technical Support: Technical issues and troubleshooting
Sales: Sales-related interactions and inquiries
Product Support: Product-specific questions and guidance
General: Other types of interactions

Building Effective Test Sets

To create robust evaluations, build test sets that represent real-world scenarios:

Golden Test Sets: Curate 50-100 high-quality examples with expected responses.
Diverse Scenarios: Include in-scope queries, out-of-scope small talk, and adversarial questions.
Synthetic Data: Generate variations using LLMs to expand coverage, including edge cases and bias tests.
Iteration: Continuously update based on production data and user feedback.

Use these test sets to ensure comprehensive evaluation coverage.

Managing Evaluations

Dashboard Features

The Evaluations Dashboard provides comprehensive tools for management:

Search and Filter

Bulk Operations

View Modes

Types of Evaluation Methods

Enhance your evaluation process using different methods:

Human Evaluations

LLM-as-Judge

Code-based Evaluations

Combine these methods for comprehensive quality assurance.

Performance Statistics

Monitor your evaluation metrics through dashboard cards:

Total Evaluations: Overall count of all evaluations
Outstanding: Count of exceptional responses
Needs Review: Evaluations requiring further assessment
This Week: Recent evaluation activity

Each metric includes trend indicators showing performance changes over time.

Exporting for Model Training

JSONL Export Format

Evaluations can be exported in JSONL format for model fine-tuning:

{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "I want to return my order #12345. It arrived damaged."
      }
    ],
    "tools": [],
    "parallel_tool_calls": true
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "I'm sorry to hear your order arrived damaged..."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "You need to go to our website and fill out the return form."
    }
  ]
}

Export Methods

Navigate to the Export tab in the dashboard
Select “Create New” mode
Enter the evaluation details:
- User Message
- Preferred Output
- Non-Preferred Output
- Tools (optional, for function calling)
Click “Export to JSONL”

Advanced Export Options

For evaluations involving tool use or function calling:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "process_return",
        "description": "Process a product return request",
        "parameters": {
          "type": "object",
          "properties": {
            "order_id": {
              "type": "string",
              "description": "The order ID to process return for"
            },
            "reason": {
              "type": "string",
              "description": "Reason for the return"
            }
          },
          "required": ["order_id", "reason"]
        }
      }
    }
  ]
}

Best Practices

Creating High-Quality Evaluations

Clear Naming

Use descriptive names that indicate the scenario being evaluated

✅ "Damaged Product Return - Empathetic Response"
❌ "Eval 1"

Realistic Scenarios

Base evaluations on actual customer interactions and common use cases

Comprehensive Coverage

Include both ideal responses and common mistakes to avoid

Consistent Standards

Apply evaluation criteria consistently across similar interaction types

Regular Review

Periodically review and update evaluations to maintain relevance

Evaluation Criteria

When assessing responses, consider:

Accuracy

Correct information provided
Proper understanding of the issue
Appropriate solution offered

Tone & Empathy

Professional and friendly tone
Empathy for customer situation
Appropriate level of formality

Completeness

All questions answered
Clear next steps provided
No missing information

Efficiency

Concise yet comprehensive
Direct problem resolution
Minimal back-and-forth needed

Advanced Evaluation Metrics

To further refine your evaluations, consider these advanced metrics tailored for AI-generated customer service responses:

Relevance

How well the response addresses the specific query
Avoidance of unnecessary information
Alignment with customer needs

Coherence

Logical flow of information
Consistent language and structure
Easy to follow reasoning

Helpfulness

Enables customer action
Provides value beyond basic information
Anticipates follow-up needs

Safety & Bias

Absence of harmful content
Fair and unbiased responses
Compliance with ethical guidelines

Incorporate these metrics into your evaluation rubrics for more comprehensive assessments.

Use Cases

Model Fine-Tuning

Export evaluations to create training datasets:

Collect Examples: Build a corpus of high-quality evaluations
Export to JSONL: Use the bulk export feature
Prepare Dataset: Format according to your model’s requirements
Fine-Tune: Use the dataset to improve model performance

Quality Monitoring

Track response quality over time:

Monitor status distribution (Outstanding vs. Unsatisfactory)
Identify patterns in problematic responses
Track improvement after training or process changes

Team Training

Use evaluations for human agent training:

Share examples of outstanding responses
Highlight common mistakes to avoid
Create training materials from real scenarios

Troubleshooting

Evaluations not appearing

Export failing

Performance issues

Next Steps

Create Your First Evaluation

Start building your evaluation dataset

Best Practices

Deep dive into evaluation strategies

Overview

Quickstart

StateSet One

StateSet Response

StateSet Commerce

​Overview

Key Benefits

​Evaluation Status Types

Outstanding

Satisfactory

Needs Further Review

Unsatisfactory

Code Red

​Creating Evaluations

​Manual Evaluation Creation

​Evaluation Types

​Building Effective Test Sets

​Managing Evaluations

​Dashboard Features

​Types of Evaluation Methods

​Performance Statistics

​Exporting for Model Training

​JSONL Export Format

​Export Methods

​Advanced Export Options

​Best Practices

​Creating High-Quality Evaluations

​Evaluation Criteria

Accuracy

Tone & Empathy

Completeness

Efficiency

​Advanced Evaluation Metrics

Relevance

Coherence

Helpfulness

Safety & Bias

​Use Cases

​Model Fine-Tuning

​Quality Monitoring

​Team Training

​Troubleshooting

​Next Steps

Create Your First Evaluation

Best Practices

Overview

Evaluation Status Types

Creating Evaluations

Manual Evaluation Creation

Evaluation Types

Building Effective Test Sets

Managing Evaluations

Dashboard Features

Types of Evaluation Methods

Performance Statistics

Exporting for Model Training

JSONL Export Format

Export Methods

Advanced Export Options

Best Practices

Creating High-Quality Evaluations

Evaluation Criteria

Advanced Evaluation Metrics

Use Cases

Model Fine-Tuning

Quality Monitoring

Team Training

Troubleshooting

Next Steps