Introduction

StateSet Synthetic Data Studio is a powerful platform for generating realistic, diverse synthetic data at scale. Whether you’re training AI agents, testing systems, or building demos, our synthetic data engine creates production-quality data that maintains statistical properties while ensuring privacy compliance.

Why Synthetic Data?

Privacy Compliant

Generate data without exposing real customer information

Unlimited Scale

Create millions of records on-demand for any use case

Perfect Testing

Test edge cases and scenarios rare in production data

Getting Started

Prerequisites

  1. StateSet account with Synthetic Data Studio access
  2. API key from your dashboard
  3. Node.js 18+, Python 3.8+, or any HTTP client

Base Configuration

# Development
export SYNTHETIC_DATA_API="http://localhost:8000"

# Production
export SYNTHETIC_DATA_API="https://studio.stateset.app"
export STATESET_API_KEY="your_api_key_here"

Core Features

1. E-commerce Customer Generation

Generate realistic customer profiles with comprehensive demographic, behavioral, and predictive data.

Quick Start

// Generate 1000 diverse customer profiles
const generateCustomers = async () => {
  const formData = new FormData();
  formData.append('project_id', 'my-ecommerce-project');
  formData.append('num_customers', '1000');
  formData.append('output_format', 'json');
  
  const response = await fetch(`${API_BASE}/synthdata/generate-ecommerce-customers`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${API_KEY}`
    },
    body: formData
  });
  
  const job = await response.json();
  console.log(`Job started: ${job.job_id}`);
  
  // Monitor progress via WebSocket
  const ws = new WebSocket(`ws://localhost:8000/ws/jobs/${job.job_id}`);
  
  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    console.log(`Progress: ${data.progress}% - ${data.message}`);
  };
  
  return job;
};

Customer Profile Schema

Each generated customer includes:

{
  customer_id: string,
  personal_info: {
    first_name: string,
    last_name: string,
    gender: "male" | "female" | "other",
    date_of_birth: string,
    username: string,
    avatar_url: string
  }
}

Advanced Customer Generation

// Generate segment-specific customers with custom parameters
async function generateSegmentedCustomers() {
  const segments = [
    { 
      segment: 'premium_buyer', 
      count: 200,
      config: {
        min_income: 100000,
        min_order_value: 150,
        interests: ['luxury', 'fashion', 'technology']
      }
    },
    {
      segment: 'value_seeker',
      count: 500,
      config: {
        price_sensitivity: 'high',
        promotion_responsiveness: 0.9
      }
    }
  ];
  
  const jobs = [];
  
  for (const segment of segments) {
    const formData = new FormData();
    formData.append('project_id', 'segmented-customers');
    formData.append('num_customers', segment.count.toString());
    formData.append('segment_filter', segment.segment);
    formData.append('custom_config', JSON.stringify(segment.config));
    
    const response = await fetch(`${API_BASE}/synthdata/generate-ecommerce-customers`, {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${API_KEY}` },
      body: formData
    });
    
    jobs.push(await response.json());
  }
  
  return jobs;
}

2. QA Pair Generation

Create high-quality question-answer pairs from documents for training conversational AI.

Generate QA Pairs

async function generateQAPairs(documentPath, options = {}) {
  const formData = new FormData();
  formData.append('project_id', 'qa-generation');
  formData.append('input_file', documentPath);
  formData.append('qa_type', options.qaType || 'qa'); // qa, cot, summary, extraction
  formData.append('num_pairs', options.numPairs || '100');
  formData.append('verbose', options.verbose || 'false');
  
  const response = await fetch(`${API_BASE}/synthdata/create-qa`, {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${API_KEY}` },
    body: formData
  });
  
  return response.json();
}

// Generate different types of QA pairs
const qaTypes = {
  standard: await generateQAPairs('/docs/product-manual.pdf', {
    qaType: 'qa',
    numPairs: 200
  }),
  
  chainOfThought: await generateQAPairs('/docs/technical-guide.pdf', {
    qaType: 'cot',
    numPairs: 100
  }),
  
  summaries: await generateQAPairs('/docs/company-reports.pdf', {
    qaType: 'summary',
    numPairs: 50
  }),
  
  extraction: await generateQAPairs('/docs/contracts.pdf', {
    qaType: 'extraction',
    numPairs: 150
  })
};

Curate QA Pairs

Apply quality scoring and filtering to ensure high-quality training data:

async function curateQAPairs(inputFile, qualityThreshold = 8.0) {
  const formData = new FormData();
  formData.append('project_id', 'qa-curation');
  formData.append('input_file', inputFile);
  formData.append('threshold', qualityThreshold.toString());
  formData.append('batch_size', '100');
  
  const response = await fetch(`${API_BASE}/synthdata/curate-qa`, {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${API_KEY}` },
    body: formData
  });
  
  const job = await response.json();
  
  // Wait for curation to complete
  const result = await waitForJob(job.id);
  
  console.log(`Curated ${result.kept_pairs} high-quality pairs`);
  console.log(`Filtered out ${result.removed_pairs} low-quality pairs`);
  
  return result;
}

3. Fine-Tuning Data Preparation

Prepare and format data for fine-tuning language models:

class FineTuningDataPipeline {
  constructor(apiClient) {
    this.client = apiClient;
  }
  
  async prepareTrainingData(rawData, config) {
    // Step 1: Generate synthetic examples if needed
    if (config.augmentWithSynthetic) {
      const synthetic = await this.generateSyntheticExamples(
        rawData,
        config.syntheticRatio
      );
      rawData = [...rawData, ...synthetic];
    }
    
    // Step 2: Format for fine-tuning
    const formatted = this.formatForFineTuning(rawData, config.model);
    
    // Step 3: Split into train/validation
    const { train, validation } = this.splitData(formatted, config.validationSplit);
    
    // Step 4: Upload files
    const trainFile = await this.uploadTrainingFile(train);
    const validationFile = await this.uploadTrainingFile(validation);
    
    // Step 5: Create fine-tuning job
    const job = await this.createFineTuningJob({
      training_file: trainFile.id,
      validation_file: validationFile.id,
      model: config.model,
      hyperparameters: config.hyperparameters
    });
    
    return job;
  }
  
  formatForFineTuning(data, model) {
    return data.map(item => {
      if (model.includes('gpt')) {
        return {
          messages: [
            { role: 'system', content: item.system || 'You are a helpful assistant.' },
            { role: 'user', content: item.prompt },
            { role: 'assistant', content: item.completion }
          ]
        };
      }
      // Add other model formats as needed
      return item;
    });
  }
  
  async uploadTrainingFile(data) {
    const jsonl = data.map(item => JSON.stringify(item)).join('\n');
    const blob = new Blob([jsonl], { type: 'application/jsonl' });
    const formData = new FormData();
    formData.append('file', blob, 'training_data.jsonl');
    
    const response = await fetch(`${API_BASE}/api/finetuning/upload-training-file`, {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${API_KEY}` },
      body: formData
    });
    
    return response.json();
  }
}

Advanced Use Cases

1. Multi-Modal Data Generation

Generate coordinated datasets across multiple data types:

class MultiModalDataGenerator {
  async generateEcommerceDataset(config) {
    const dataset = {
      customers: [],
      products: [],
      orders: [],
      reviews: [],
      support_tickets: []
    };
    
    // Step 1: Generate customers
    const customerJob = await this.generateCustomers(config.numCustomers);
    dataset.customers = await this.waitForJobCompletion(customerJob);
    
    // Step 2: Generate products based on customer interests
    const productJob = await this.generateProducts({
      count: config.numProducts,
      categories: this.extractCategories(dataset.customers)
    });
    dataset.products = await this.waitForJobCompletion(productJob);
    
    // Step 3: Generate realistic order history
    const orderJob = await this.generateOrders({
      customers: dataset.customers,
      products: dataset.products,
      timeRange: config.orderTimeRange
    });
    dataset.orders = await this.waitForJobCompletion(orderJob);
    
    // Step 4: Generate reviews based on orders
    const reviewJob = await this.generateReviews({
      orders: dataset.orders,
      sentiment_distribution: config.reviewSentiment
    });
    dataset.reviews = await this.waitForJobCompletion(reviewJob);
    
    // Step 5: Generate support tickets based on orders and reviews
    const ticketJob = await this.generateSupportTickets({
      orders: dataset.orders,
      reviews: dataset.reviews.filter(r => r.rating < 3),
      issue_probability: config.supportTicketRate
    });
    dataset.support_tickets = await this.waitForJobCompletion(ticketJob);
    
    return dataset;
  }
}

2. Time-Series Data Generation

Create realistic time-series data for analytics and forecasting:

async function generateTimeSeriesData(config) {
  const generator = new TimeSeriesGenerator({
    startDate: '2023-01-01',
    endDate: '2024-12-31',
    frequency: 'daily',
    metrics: [
      {
        name: 'daily_revenue',
        baseValue: 10000,
        trend: 0.002, // 0.2% daily growth
        seasonality: {
          weekly: { sunday: 0.7, saturday: 1.3 },
          monthly: { december: 1.8, january: 0.6 }
        },
        noise: 0.1
      },
      {
        name: 'customer_count',
        baseValue: 1000,
        trend: 0.001,
        correlation: { daily_revenue: 0.8 }
      }
    ]
  });
  
  const data = await generator.generate();
  
  // Add realistic anomalies
  const anomalies = [
    { date: '2023-11-24', metric: 'daily_revenue', multiplier: 3.5 }, // Black Friday
    { date: '2023-12-26', metric: 'daily_revenue', multiplier: 2.0 }, // Boxing Day
  ];
  
  return generator.injectAnomalies(data, anomalies);
}

3. Scenario Testing Data

Generate specific scenarios for testing edge cases:

class ScenarioDataGenerator {
  async generateTestScenarios() {
    const scenarios = {
      highValueCustomerChurn: await this.generateScenario({
        customerProfile: {
          lifetime_value: { min: 10000 },
          loyalty_points: { min: 5000 },
          order_count: { min: 50 }
        },
        behavior: {
          recent_activity: 'declining',
          support_tickets: 'increasing',
          satisfaction_trend: 'negative'
        },
        count: 100
      }),
      
      fraudulentPatterns: await this.generateScenario({
        customerProfile: {
          account_age_days: { max: 7 },
          shipping_addresses: { min: 3 },
          payment_methods: { min: 4 }
        },
        orderPatterns: {
          high_value_items: true,
          rush_shipping: true,
          different_billing_shipping: true
        },
        count: 50
      }),
      
      seasonalSurge: await this.generateScenario({
        timeframe: 'holiday_season',
        traffic_multiplier: 5,
        conversion_rate: 0.08,
        average_order_value: 1.5,
        support_ticket_rate: 2.0,
        count: 10000
      })
    };
    
    return scenarios;
  }
}

Monitoring & Analytics

Real-Time Progress Monitoring

class SyntheticDataMonitor {
  constructor(jobId) {
    this.jobId = jobId;
    this.metrics = {
      recordsGenerated: 0,
      qualityScore: 0,
      estimatedTimeRemaining: 0
    };
  }
  
  async monitor() {
    // WebSocket connection for real-time updates
    const ws = new WebSocket(`ws://localhost:8000/ws/jobs/${this.jobId}`);
    
    ws.onmessage = (event) => {
      const update = JSON.parse(event.data);
      
      switch (update.type) {
        case 'progress':
          this.updateProgress(update);
          break;
        case 'quality_check':
          this.updateQuality(update);
          break;
        case 'completed':
          this.handleCompletion(update);
          break;
        case 'error':
          this.handleError(update);
          break;
      }
    };
    
    // Periodic status checks via REST API
    this.statusInterval = setInterval(async () => {
      const status = await this.checkJobStatus();
      this.updateMetrics(status);
    }, 5000);
  }
  
  async checkJobStatus() {
    const response = await fetch(`${API_BASE}/jobs/${this.jobId}`, {
      headers: { 'Authorization': `Bearer ${API_KEY}` }
    });
    return response.json();
  }
}

Quality Metrics Dashboard

async function getDataQualityMetrics(projectId) {
  const response = await fetch(`${API_BASE}/projects/${projectId}/quality-metrics`, {
    headers: { 'Authorization': `Bearer ${API_KEY}` }
  });
  
  const metrics = await response.json();
  
  return {
    overall_quality_score: metrics.overall_score,
    data_distribution: {
      statistical_validity: metrics.distribution.ks_test_score,
      diversity_index: metrics.distribution.diversity,
      balance_score: metrics.distribution.balance
    },
    field_quality: metrics.fields.map(field => ({
      name: field.name,
      completeness: field.completeness,
      uniqueness: field.uniqueness,
      validity: field.validity,
      consistency: field.consistency
    })),
    recommendations: metrics.recommendations
  };
}

Best Practices

1. Data Generation Strategy

// Good: Incremental generation with validation
async function generateDataIncrementally(totalRecords, batchSize = 1000) {
  const batches = Math.ceil(totalRecords / batchSize);
  const generatedData = [];
  
  for (let i = 0; i < batches; i++) {
    const batch = await generateBatch({
      size: Math.min(batchSize, totalRecords - i * batchSize),
      offset: i * batchSize
    });
    
    // Validate each batch
    const validation = await validateBatch(batch);
    if (validation.isValid) {
      generatedData.push(...batch);
    } else {
      console.error(`Batch ${i} failed validation:`, validation.errors);
      // Retry or handle error
    }
    
    // Progress update
    console.log(`Generated ${generatedData.length}/${totalRecords} records`);
  }
  
  return generatedData;
}

// Bad: Generating all data at once
async function generateAllAtOnce(totalRecords) {
  return generateBatch({ size: totalRecords }); // May timeout or OOM
}

2. Quality Assurance

class DataQualityAssurance {
  async validateSyntheticData(data, requirements) {
    const validations = {
      schema: await this.validateSchema(data, requirements.schema),
      statistics: await this.validateStatistics(data, requirements.statistics),
      business_rules: await this.validateBusinessRules(data, requirements.rules),
      privacy: await this.validatePrivacy(data)
    };
    
    const report = {
      passed: Object.values(validations).every(v => v.passed),
      validations,
      recommendations: this.generateRecommendations(validations)
    };
    
    return report;
  }
  
  async validateStatistics(data, expectedStats) {
    const actualStats = calculateStatistics(data);
    const deviations = {};
    
    for (const [metric, expected] of Object.entries(expectedStats)) {
      const actual = actualStats[metric];
      const deviation = Math.abs(actual - expected) / expected;
      
      deviations[metric] = {
        expected,
        actual,
        deviation,
        acceptable: deviation < 0.1 // 10% tolerance
      };
    }
    
    return {
      passed: Object.values(deviations).every(d => d.acceptable),
      deviations
    };
  }
}

3. Performance Optimization

// Use streaming for large datasets
async function* streamSyntheticData(config) {
  const pageSize = 1000;
  let offset = 0;
  
  while (offset < config.total) {
    const response = await fetch(`${API_BASE}/synthdata/stream`, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        ...config,
        offset,
        limit: pageSize
      })
    });
    
    const data = await response.json();
    
    if (data.records.length === 0) break;
    
    yield data.records;
    offset += data.records.length;
  }
}

// Process data as it's generated
async function processStreamingData() {
  const stream = streamSyntheticData({ 
    type: 'customers', 
    total: 1000000 
  });
  
  for await (const batch of stream) {
    await processBatch(batch);
    console.log(`Processed ${batch.length} records`);
  }
}

Error Handling

Comprehensive Error Management

class SyntheticDataErrorHandler {
  async handleAPIError(error) {
    const errorHandlers = {
      RATE_LIMITED: async () => {
        const retryAfter = error.headers['X-RateLimit-Reset'];
        await this.delay(retryAfter * 1000);
        return { retry: true };
      },
      
      INVALID_INPUT: () => {
        console.error('Invalid input:', error.detail);
        return { retry: false, fix: this.suggestInputFix(error) };
      },
      
      INTERNAL_ERROR: async () => {
        await this.reportError(error);
        return { retry: true, delay: 5000 };
      },
      
      SERVICE_UNAVAILABLE: () => {
        return { retry: true, delay: 30000, useBackup: true };
      }
    };
    
    const handler = errorHandlers[error.error_code] || errorHandlers.INTERNAL_ERROR;
    return handler();
  }
  
  suggestInputFix(error) {
    // Analyze error and suggest fixes
    const suggestions = {
      'missing_required_field': `Add required field: ${error.field}`,
      'invalid_format': `Expected format: ${error.expected_format}`,
      'value_out_of_range': `Value must be between ${error.min} and ${error.max}`
    };
    
    return suggestions[error.validation_error] || 'Check API documentation';
  }
}

Security & Compliance

Privacy-Preserving Generation

class PrivacyPreservingSynthData {
  async generateCompliantData(config) {
    const privacyRules = {
      // No real PII patterns
      email_format: 'synthetic_[hash]@example.com',
      phone_format: '555-0[random]',
      
      // Differential privacy for statistics
      differential_privacy: {
        epsilon: 1.0,
        delta: 1e-5
      },
      
      // K-anonymity for demographics
      k_anonymity: {
        k: 5,
        quasi_identifiers: ['age', 'zipcode', 'gender']
      }
    };
    
    const data = await this.generateWithPrivacy(config, privacyRules);
    
    // Validate compliance
    const compliance = await this.validateCompliance(data, {
      gdpr: true,
      ccpa: true,
      hipaa: config.industry === 'healthcare'
    });
    
    return {
      data,
      compliance_report: compliance
    };
  }
}

Pricing & Limits

  • 10,000 records/month
  • Basic customer profiles
  • Standard QA generation
  • Community support

Next Steps


Pro Tip: Start with small batches to validate your data generation parameters, then scale up. Use preview endpoints to check data quality before generating large datasets.

For support and examples, visit our GitHub repository or contact support@stateset.com.