Get Started
StateSet One Guides
- Orders Management Quickstart
- Order settlements
- Wholesale Quickstart
- Returns Quickstart
- Warranties Quickstart
- Subscriptions Quickstart
- Manufacturing and Production Quickstart
- Warehouse Quickstart
- COGS Quickstart Guide
- Supplier Quickstart
- Inventory Management Quickstart
- Error Handling Best Practices
- API Rate Limiting & Optimization
- Performance Optimization Guide
- Comprehensive Testing Guide
StateSet ResponseCX Guides
- Getting Started with ResponseCX
- Agent Training Guide
- Agent Objectives, Goals, Metrics & Rewards Guide
- Multi-Agent System Architectures
- Reinforcement Learning Platform Overview
- GRPO Agent Framework
- Knowledge Base Quickstart
- RAG Quickstart
- Agents Quickstart
- Agent Attributes & Personality
- Agent Rules Quickstart
- Examples Quickstart
- Evaluations
- Agent Schedules & Automation
- Agent Functions Quickstart
- Shopify Product Quickstart
- Gorgias Ticket Quickstart
- Vision Quickstart
- Voice AI Quickstart
- Synthetic Data Studio
- StateSet Synthetic Data Studio Architecture Guide
StateSet Commerce Network Guides
StateSet ReSponse API Documentation
Synthetic Data Studio
Generate high-quality synthetic data for training, testing, and improving your StateSet agents
Introduction
StateSet Synthetic Data Studio is a powerful platform for generating realistic, diverse synthetic data at scale. Whether you’re training AI agents, testing systems, or building demos, our synthetic data engine creates production-quality data that maintains statistical properties while ensuring privacy compliance.
Why Synthetic Data?
Privacy Compliant
Generate data without exposing real customer information
Unlimited Scale
Create millions of records on-demand for any use case
Perfect Testing
Test edge cases and scenarios rare in production data
Getting Started
Prerequisites
- StateSet account with Synthetic Data Studio access
- API key from your dashboard
- Node.js 18+, Python 3.8+, or any HTTP client
Base Configuration
# Development
export SYNTHETIC_DATA_API="http://localhost:8000"
# Production
export SYNTHETIC_DATA_API="https://studio.stateset.app"
export STATESET_API_KEY="your_api_key_here"
Core Features
1. E-commerce Customer Generation
Generate realistic customer profiles with comprehensive demographic, behavioral, and predictive data.
Quick Start
// Generate 1000 diverse customer profiles
const generateCustomers = async () => {
const formData = new FormData();
formData.append('project_id', 'my-ecommerce-project');
formData.append('num_customers', '1000');
formData.append('output_format', 'json');
const response = await fetch(`${API_BASE}/synthdata/generate-ecommerce-customers`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${API_KEY}`
},
body: formData
});
const job = await response.json();
console.log(`Job started: ${job.job_id}`);
// Monitor progress via WebSocket
const ws = new WebSocket(`ws://localhost:8000/ws/jobs/${job.job_id}`);
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log(`Progress: ${data.progress}% - ${data.message}`);
};
return job;
};
Customer Profile Schema
Each generated customer includes:
{
customer_id: string,
personal_info: {
first_name: string,
last_name: string,
gender: "male" | "female" | "other",
date_of_birth: string,
username: string,
avatar_url: string
}
}
{
customer_id: string,
personal_info: {
first_name: string,
last_name: string,
gender: "male" | "female" | "other",
date_of_birth: string,
username: string,
avatar_url: string
}
}
{
demographics: {
customer_segment: "budget_conscious" | "value_seeker" | "premium_buyer" | "luxury_enthusiast",
income_range: { min: number, max: number },
occupation: string,
education_level: string,
interests: string[],
household_size: number
}
}
{
behavioral_data: {
preferred_device: string,
avg_session_duration: number,
preferred_shopping_time: string,
marketing_opt_in: {
email: boolean,
sms: boolean,
push_notifications: boolean
}
}
}
{
predictive_scores: {
lifetime_value_prediction: number,
churn_probability: number,
next_purchase_probability: number,
fraud_risk_score: number,
recommendation_responsiveness: number
}
}
Advanced Customer Generation
// Generate segment-specific customers with custom parameters
async function generateSegmentedCustomers() {
const segments = [
{
segment: 'premium_buyer',
count: 200,
config: {
min_income: 100000,
min_order_value: 150,
interests: ['luxury', 'fashion', 'technology']
}
},
{
segment: 'value_seeker',
count: 500,
config: {
price_sensitivity: 'high',
promotion_responsiveness: 0.9
}
}
];
const jobs = [];
for (const segment of segments) {
const formData = new FormData();
formData.append('project_id', 'segmented-customers');
formData.append('num_customers', segment.count.toString());
formData.append('segment_filter', segment.segment);
formData.append('custom_config', JSON.stringify(segment.config));
const response = await fetch(`${API_BASE}/synthdata/generate-ecommerce-customers`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${API_KEY}` },
body: formData
});
jobs.push(await response.json());
}
return jobs;
}
2. QA Pair Generation
Create high-quality question-answer pairs from documents for training conversational AI.
Generate QA Pairs
async function generateQAPairs(documentPath, options = {}) {
const formData = new FormData();
formData.append('project_id', 'qa-generation');
formData.append('input_file', documentPath);
formData.append('qa_type', options.qaType || 'qa'); // qa, cot, summary, extraction
formData.append('num_pairs', options.numPairs || '100');
formData.append('verbose', options.verbose || 'false');
const response = await fetch(`${API_BASE}/synthdata/create-qa`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${API_KEY}` },
body: formData
});
return response.json();
}
// Generate different types of QA pairs
const qaTypes = {
standard: await generateQAPairs('/docs/product-manual.pdf', {
qaType: 'qa',
numPairs: 200
}),
chainOfThought: await generateQAPairs('/docs/technical-guide.pdf', {
qaType: 'cot',
numPairs: 100
}),
summaries: await generateQAPairs('/docs/company-reports.pdf', {
qaType: 'summary',
numPairs: 50
}),
extraction: await generateQAPairs('/docs/contracts.pdf', {
qaType: 'extraction',
numPairs: 150
})
};
Curate QA Pairs
Apply quality scoring and filtering to ensure high-quality training data:
async function curateQAPairs(inputFile, qualityThreshold = 8.0) {
const formData = new FormData();
formData.append('project_id', 'qa-curation');
formData.append('input_file', inputFile);
formData.append('threshold', qualityThreshold.toString());
formData.append('batch_size', '100');
const response = await fetch(`${API_BASE}/synthdata/curate-qa`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${API_KEY}` },
body: formData
});
const job = await response.json();
// Wait for curation to complete
const result = await waitForJob(job.id);
console.log(`Curated ${result.kept_pairs} high-quality pairs`);
console.log(`Filtered out ${result.removed_pairs} low-quality pairs`);
return result;
}
3. Fine-Tuning Data Preparation
Prepare and format data for fine-tuning language models:
class FineTuningDataPipeline {
constructor(apiClient) {
this.client = apiClient;
}
async prepareTrainingData(rawData, config) {
// Step 1: Generate synthetic examples if needed
if (config.augmentWithSynthetic) {
const synthetic = await this.generateSyntheticExamples(
rawData,
config.syntheticRatio
);
rawData = [...rawData, ...synthetic];
}
// Step 2: Format for fine-tuning
const formatted = this.formatForFineTuning(rawData, config.model);
// Step 3: Split into train/validation
const { train, validation } = this.splitData(formatted, config.validationSplit);
// Step 4: Upload files
const trainFile = await this.uploadTrainingFile(train);
const validationFile = await this.uploadTrainingFile(validation);
// Step 5: Create fine-tuning job
const job = await this.createFineTuningJob({
training_file: trainFile.id,
validation_file: validationFile.id,
model: config.model,
hyperparameters: config.hyperparameters
});
return job;
}
formatForFineTuning(data, model) {
return data.map(item => {
if (model.includes('gpt')) {
return {
messages: [
{ role: 'system', content: item.system || 'You are a helpful assistant.' },
{ role: 'user', content: item.prompt },
{ role: 'assistant', content: item.completion }
]
};
}
// Add other model formats as needed
return item;
});
}
async uploadTrainingFile(data) {
const jsonl = data.map(item => JSON.stringify(item)).join('\n');
const blob = new Blob([jsonl], { type: 'application/jsonl' });
const formData = new FormData();
formData.append('file', blob, 'training_data.jsonl');
const response = await fetch(`${API_BASE}/api/finetuning/upload-training-file`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${API_KEY}` },
body: formData
});
return response.json();
}
}
Advanced Use Cases
1. Multi-Modal Data Generation
Generate coordinated datasets across multiple data types:
class MultiModalDataGenerator {
async generateEcommerceDataset(config) {
const dataset = {
customers: [],
products: [],
orders: [],
reviews: [],
support_tickets: []
};
// Step 1: Generate customers
const customerJob = await this.generateCustomers(config.numCustomers);
dataset.customers = await this.waitForJobCompletion(customerJob);
// Step 2: Generate products based on customer interests
const productJob = await this.generateProducts({
count: config.numProducts,
categories: this.extractCategories(dataset.customers)
});
dataset.products = await this.waitForJobCompletion(productJob);
// Step 3: Generate realistic order history
const orderJob = await this.generateOrders({
customers: dataset.customers,
products: dataset.products,
timeRange: config.orderTimeRange
});
dataset.orders = await this.waitForJobCompletion(orderJob);
// Step 4: Generate reviews based on orders
const reviewJob = await this.generateReviews({
orders: dataset.orders,
sentiment_distribution: config.reviewSentiment
});
dataset.reviews = await this.waitForJobCompletion(reviewJob);
// Step 5: Generate support tickets based on orders and reviews
const ticketJob = await this.generateSupportTickets({
orders: dataset.orders,
reviews: dataset.reviews.filter(r => r.rating < 3),
issue_probability: config.supportTicketRate
});
dataset.support_tickets = await this.waitForJobCompletion(ticketJob);
return dataset;
}
}
2. Time-Series Data Generation
Create realistic time-series data for analytics and forecasting:
async function generateTimeSeriesData(config) {
const generator = new TimeSeriesGenerator({
startDate: '2023-01-01',
endDate: '2024-12-31',
frequency: 'daily',
metrics: [
{
name: 'daily_revenue',
baseValue: 10000,
trend: 0.002, // 0.2% daily growth
seasonality: {
weekly: { sunday: 0.7, saturday: 1.3 },
monthly: { december: 1.8, january: 0.6 }
},
noise: 0.1
},
{
name: 'customer_count',
baseValue: 1000,
trend: 0.001,
correlation: { daily_revenue: 0.8 }
}
]
});
const data = await generator.generate();
// Add realistic anomalies
const anomalies = [
{ date: '2023-11-24', metric: 'daily_revenue', multiplier: 3.5 }, // Black Friday
{ date: '2023-12-26', metric: 'daily_revenue', multiplier: 2.0 }, // Boxing Day
];
return generator.injectAnomalies(data, anomalies);
}
3. Scenario Testing Data
Generate specific scenarios for testing edge cases:
class ScenarioDataGenerator {
async generateTestScenarios() {
const scenarios = {
highValueCustomerChurn: await this.generateScenario({
customerProfile: {
lifetime_value: { min: 10000 },
loyalty_points: { min: 5000 },
order_count: { min: 50 }
},
behavior: {
recent_activity: 'declining',
support_tickets: 'increasing',
satisfaction_trend: 'negative'
},
count: 100
}),
fraudulentPatterns: await this.generateScenario({
customerProfile: {
account_age_days: { max: 7 },
shipping_addresses: { min: 3 },
payment_methods: { min: 4 }
},
orderPatterns: {
high_value_items: true,
rush_shipping: true,
different_billing_shipping: true
},
count: 50
}),
seasonalSurge: await this.generateScenario({
timeframe: 'holiday_season',
traffic_multiplier: 5,
conversion_rate: 0.08,
average_order_value: 1.5,
support_ticket_rate: 2.0,
count: 10000
})
};
return scenarios;
}
}
Monitoring & Analytics
Real-Time Progress Monitoring
class SyntheticDataMonitor {
constructor(jobId) {
this.jobId = jobId;
this.metrics = {
recordsGenerated: 0,
qualityScore: 0,
estimatedTimeRemaining: 0
};
}
async monitor() {
// WebSocket connection for real-time updates
const ws = new WebSocket(`ws://localhost:8000/ws/jobs/${this.jobId}`);
ws.onmessage = (event) => {
const update = JSON.parse(event.data);
switch (update.type) {
case 'progress':
this.updateProgress(update);
break;
case 'quality_check':
this.updateQuality(update);
break;
case 'completed':
this.handleCompletion(update);
break;
case 'error':
this.handleError(update);
break;
}
};
// Periodic status checks via REST API
this.statusInterval = setInterval(async () => {
const status = await this.checkJobStatus();
this.updateMetrics(status);
}, 5000);
}
async checkJobStatus() {
const response = await fetch(`${API_BASE}/jobs/${this.jobId}`, {
headers: { 'Authorization': `Bearer ${API_KEY}` }
});
return response.json();
}
}
Quality Metrics Dashboard
async function getDataQualityMetrics(projectId) {
const response = await fetch(`${API_BASE}/projects/${projectId}/quality-metrics`, {
headers: { 'Authorization': `Bearer ${API_KEY}` }
});
const metrics = await response.json();
return {
overall_quality_score: metrics.overall_score,
data_distribution: {
statistical_validity: metrics.distribution.ks_test_score,
diversity_index: metrics.distribution.diversity,
balance_score: metrics.distribution.balance
},
field_quality: metrics.fields.map(field => ({
name: field.name,
completeness: field.completeness,
uniqueness: field.uniqueness,
validity: field.validity,
consistency: field.consistency
})),
recommendations: metrics.recommendations
};
}
Best Practices
1. Data Generation Strategy
// Good: Incremental generation with validation
async function generateDataIncrementally(totalRecords, batchSize = 1000) {
const batches = Math.ceil(totalRecords / batchSize);
const generatedData = [];
for (let i = 0; i < batches; i++) {
const batch = await generateBatch({
size: Math.min(batchSize, totalRecords - i * batchSize),
offset: i * batchSize
});
// Validate each batch
const validation = await validateBatch(batch);
if (validation.isValid) {
generatedData.push(...batch);
} else {
console.error(`Batch ${i} failed validation:`, validation.errors);
// Retry or handle error
}
// Progress update
console.log(`Generated ${generatedData.length}/${totalRecords} records`);
}
return generatedData;
}
// Bad: Generating all data at once
async function generateAllAtOnce(totalRecords) {
return generateBatch({ size: totalRecords }); // May timeout or OOM
}
2. Quality Assurance
class DataQualityAssurance {
async validateSyntheticData(data, requirements) {
const validations = {
schema: await this.validateSchema(data, requirements.schema),
statistics: await this.validateStatistics(data, requirements.statistics),
business_rules: await this.validateBusinessRules(data, requirements.rules),
privacy: await this.validatePrivacy(data)
};
const report = {
passed: Object.values(validations).every(v => v.passed),
validations,
recommendations: this.generateRecommendations(validations)
};
return report;
}
async validateStatistics(data, expectedStats) {
const actualStats = calculateStatistics(data);
const deviations = {};
for (const [metric, expected] of Object.entries(expectedStats)) {
const actual = actualStats[metric];
const deviation = Math.abs(actual - expected) / expected;
deviations[metric] = {
expected,
actual,
deviation,
acceptable: deviation < 0.1 // 10% tolerance
};
}
return {
passed: Object.values(deviations).every(d => d.acceptable),
deviations
};
}
}
3. Performance Optimization
// Use streaming for large datasets
async function* streamSyntheticData(config) {
const pageSize = 1000;
let offset = 0;
while (offset < config.total) {
const response = await fetch(`${API_BASE}/synthdata/stream`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
...config,
offset,
limit: pageSize
})
});
const data = await response.json();
if (data.records.length === 0) break;
yield data.records;
offset += data.records.length;
}
}
// Process data as it's generated
async function processStreamingData() {
const stream = streamSyntheticData({
type: 'customers',
total: 1000000
});
for await (const batch of stream) {
await processBatch(batch);
console.log(`Processed ${batch.length} records`);
}
}
Error Handling
Comprehensive Error Management
class SyntheticDataErrorHandler {
async handleAPIError(error) {
const errorHandlers = {
RATE_LIMITED: async () => {
const retryAfter = error.headers['X-RateLimit-Reset'];
await this.delay(retryAfter * 1000);
return { retry: true };
},
INVALID_INPUT: () => {
console.error('Invalid input:', error.detail);
return { retry: false, fix: this.suggestInputFix(error) };
},
INTERNAL_ERROR: async () => {
await this.reportError(error);
return { retry: true, delay: 5000 };
},
SERVICE_UNAVAILABLE: () => {
return { retry: true, delay: 30000, useBackup: true };
}
};
const handler = errorHandlers[error.error_code] || errorHandlers.INTERNAL_ERROR;
return handler();
}
suggestInputFix(error) {
// Analyze error and suggest fixes
const suggestions = {
'missing_required_field': `Add required field: ${error.field}`,
'invalid_format': `Expected format: ${error.expected_format}`,
'value_out_of_range': `Value must be between ${error.min} and ${error.max}`
};
return suggestions[error.validation_error] || 'Check API documentation';
}
}
Security & Compliance
Privacy-Preserving Generation
class PrivacyPreservingSynthData {
async generateCompliantData(config) {
const privacyRules = {
// No real PII patterns
email_format: 'synthetic_[hash]@example.com',
phone_format: '555-0[random]',
// Differential privacy for statistics
differential_privacy: {
epsilon: 1.0,
delta: 1e-5
},
// K-anonymity for demographics
k_anonymity: {
k: 5,
quasi_identifiers: ['age', 'zipcode', 'gender']
}
};
const data = await this.generateWithPrivacy(config, privacyRules);
// Validate compliance
const compliance = await this.validateCompliance(data, {
gdpr: true,
ccpa: true,
hipaa: config.industry === 'healthcare'
});
return {
data,
compliance_report: compliance
};
}
}
Pricing & Limits
- 10,000 records/month
- Basic customer profiles
- Standard QA generation
- Community support
- 10,000 records/month
- Basic customer profiles
- Standard QA generation
- Community support
- 1M records/month
- Advanced profiles with ML scores
- Custom data schemas
- Priority support
- $299/month
- Unlimited records
- Custom data generators
- Private deployment option
- SLA guarantee
- Contact sales
Next Steps
API Reference
Complete API documentation with all endpoints
Data Schemas
Detailed schemas for all data types
Pro Tip: Start with small batches to validate your data generation parameters, then scale up. Use preview endpoints to check data quality before generating large datasets.
For support and examples, visit our GitHub repository or contact support@stateset.com.
- Introduction
- Why Synthetic Data?
- Getting Started
- Prerequisites
- Base Configuration
- Core Features
- 1. E-commerce Customer Generation
- Quick Start
- Customer Profile Schema
- Advanced Customer Generation
- 2. QA Pair Generation
- Generate QA Pairs
- Curate QA Pairs
- 3. Fine-Tuning Data Preparation
- Advanced Use Cases
- 1. Multi-Modal Data Generation
- 2. Time-Series Data Generation
- 3. Scenario Testing Data
- Monitoring & Analytics
- Real-Time Progress Monitoring
- Quality Metrics Dashboard
- Best Practices
- 1. Data Generation Strategy
- 2. Quality Assurance
- 3. Performance Optimization
- Error Handling
- Comprehensive Error Management
- Security & Compliance
- Privacy-Preserving Generation
- Pricing & Limits
- Next Steps