Evaluations Guide
Create and manage evaluations to improve AI response quality and train custom models
Overview
The Evaluations system enables you to assess, track, and improve the quality of AI-generated responses. By creating evaluations, you can build datasets for fine-tuning models, monitor performance trends, and ensure consistent high-quality customer interactions.
Key Benefits
- Quality Assurance: Monitor and improve response quality
- Model Training: Export evaluations as JSONL for fine-tuning
- Performance Tracking: Analyze trends and identify areas for improvement
- Team Insights: Understand response patterns across different support types
Evaluation Status Types
Each evaluation is assigned a status that reflects the quality of the response:
Outstanding
Exceptional responses that exceed expectations. These serve as gold-standard examples for training.
Satisfactory
Good responses that meet quality standards and properly address customer needs.
Needs Further Review
Responses requiring additional assessment or minor improvements before final classification.
Unsatisfactory
Responses with significant issues that don’t meet quality standards.
Code Red
Critical failures requiring immediate attention and remediation.
Creating Evaluations
Manual Evaluation Creation
To create a new evaluation manually:
- Navigate to the Evaluations Dashboard
- Click “Create New Evaluation”
- Fill in the evaluation details:
Evaluation Types
Choose the appropriate type for your evaluation:
- Customer Service: General customer inquiries and support
- Technical Support: Technical issues and troubleshooting
- Sales: Sales-related interactions and inquiries
- Product Support: Product-specific questions and guidance
- General: Other types of interactions
Building Effective Test Sets
To create robust evaluations, build test sets that represent real-world scenarios:
- Golden Test Sets: Curate 50-100 high-quality examples with expected responses.
- Diverse Scenarios: Include in-scope queries, out-of-scope small talk, and adversarial questions.
- Synthetic Data: Generate variations using LLMs to expand coverage, including edge cases and bias tests.
- Iteration: Continuously update based on production data and user feedback.
Use these test sets to ensure comprehensive evaluation coverage.
Managing Evaluations
Dashboard Features
The Evaluations Dashboard provides comprehensive tools for management:
Search and Filter
Search and Filter
- Smart Search: Search by name, type, ticket ID, or description
- Type Filter: Filter by evaluation type (Customer Service, Technical Support, etc.)
- Status Filter: View evaluations by their quality status
- Date Range: Filter by creation date (Last 7/30/90 days or all time)
- Sorting: Sort by date, name, status, or type in ascending/descending order
Bulk Operations
Bulk Operations
- Select Multiple: Use checkboxes to select multiple evaluations
- Bulk Export: Export selected evaluations to JSONL format
- Bulk Delete: Remove multiple evaluations at once
- Select All: Quickly select all visible evaluations
View Modes
View Modes
- Table View: Traditional table layout for efficient scanning
- Card View: Visual card layout for detailed preview
- Analytics View: Coming soon - detailed insights and trends
Types of Evaluation Methods
Enhance your evaluation process using different methods:
Human Evaluations
Human Evaluations
- Direct feedback from users or experts
- Best for nuanced quality assessment
- Example: Rate response empathy on a 1-5 scale
LLM-as-Judge
LLM-as-Judge
- Use another AI to evaluate outputs
- Scalable for large datasets
- Example: Prompt an LLM to score factual accuracy
Code-based Evaluations
Code-based Evaluations
- Programmatic checks for specific criteria
- Ideal for objective metrics
- Example: Verify response format or keyword presence
Combine these methods for comprehensive quality assurance.
Performance Statistics
Monitor your evaluation metrics through dashboard cards:
- Total Evaluations: Overall count of all evaluations
- Outstanding: Count of exceptional responses
- Needs Review: Evaluations requiring further assessment
- This Week: Recent evaluation activity
Each metric includes trend indicators showing performance changes over time.
Exporting for Model Training
JSONL Export Format
Evaluations can be exported in JSONL format for model fine-tuning:
Export Methods
- Navigate to the Export tab in the dashboard
- Select “Create New” mode
- Enter the evaluation details:
- User Message
- Preferred Output
- Non-Preferred Output
- Tools (optional, for function calling)
- Click “Export to JSONL”
- Navigate to the Export tab in the dashboard
- Select “Create New” mode
- Enter the evaluation details:
- User Message
- Preferred Output
- Non-Preferred Output
- Tools (optional, for function calling)
- Click “Export to JSONL”
- Navigate to the Export tab
- Select “From Existing Evals” mode
- Choose evaluations to export:
- Use checkboxes to select specific evaluations
- Click “Select All” to export all evaluations
- Click “Export X Evaluations”
Advanced Export Options
For evaluations involving tool use or function calling:
Best Practices
Creating High-Quality Evaluations
Clear Naming
Use descriptive names that indicate the scenario being evaluated
Realistic Scenarios
Base evaluations on actual customer interactions and common use cases
Comprehensive Coverage
Include both ideal responses and common mistakes to avoid
Consistent Standards
Apply evaluation criteria consistently across similar interaction types
Regular Review
Periodically review and update evaluations to maintain relevance
Evaluation Criteria
When assessing responses, consider:
Accuracy
- Correct information provided
- Proper understanding of the issue
- Appropriate solution offered
Tone & Empathy
- Professional and friendly tone
- Empathy for customer situation
- Appropriate level of formality
Completeness
- All questions answered
- Clear next steps provided
- No missing information
Efficiency
- Concise yet comprehensive
- Direct problem resolution
- Minimal back-and-forth needed
Advanced Evaluation Metrics
To further refine your evaluations, consider these advanced metrics tailored for AI-generated customer service responses:
Relevance
- How well the response addresses the specific query
- Avoidance of unnecessary information
- Alignment with customer needs
Coherence
- Logical flow of information
- Consistent language and structure
- Easy to follow reasoning
Helpfulness
- Enables customer action
- Provides value beyond basic information
- Anticipates follow-up needs
Safety & Bias
- Absence of harmful content
- Fair and unbiased responses
- Compliance with ethical guidelines
Incorporate these metrics into your evaluation rubrics for more comprehensive assessments.
Use Cases
Model Fine-Tuning
Export evaluations to create training datasets:
- Collect Examples: Build a corpus of high-quality evaluations
- Export to JSONL: Use the bulk export feature
- Prepare Dataset: Format according to your model’s requirements
- Fine-Tune: Use the dataset to improve model performance
Quality Monitoring
Track response quality over time:
- Monitor status distribution (Outstanding vs. Unsatisfactory)
- Identify patterns in problematic responses
- Track improvement after training or process changes
Team Training
Use evaluations for human agent training:
- Share examples of outstanding responses
- Highlight common mistakes to avoid
- Create training materials from real scenarios
Troubleshooting
Evaluations not appearing
Evaluations not appearing
- Ensure you’re logged into the correct organization
- Check filters aren’t hiding evaluations
- Refresh the dashboard
Export failing
Export failing
- Verify all required fields are filled
- Check for valid JSON in tools field
- Ensure evaluations have required data
Performance issues
Performance issues
- Use filters to reduce displayed evaluations
- Export in smaller batches
- Clear browser cache if needed