Introduction
StateSet Synthetic Data Studio is a powerful platform for generating realistic, diverse synthetic data at scale. Whether you’re training AI agents, testing systems, or building demos, our synthetic data engine creates production-quality data that maintains statistical properties while ensuring privacy compliance.
Why Synthetic Data?
Privacy Compliant Generate data without exposing real customer information
Unlimited Scale Create millions of records on-demand for any use case
Perfect Testing Test edge cases and scenarios rare in production data
Getting Started
Prerequisites
StateSet account with Synthetic Data Studio access
API key from your dashboard
Node.js 18+, Python 3.8+, or any HTTP client
Base Configuration
Environment
JavaScript
Python
# Development
export SYNTHETIC_DATA_API = "http://localhost:8000"
# Production
export SYNTHETIC_DATA_API = "https://studio.stateset.app"
export STATESET_API_KEY = "your_api_key_here"
Core Features
1. E-commerce Customer Generation
Generate realistic customer profiles with comprehensive demographic, behavioral, and predictive data.
Quick Start
// Generate 1000 diverse customer profiles
const generateCustomers = async () => {
const formData = new FormData ();
formData . append ( 'project_id' , 'my-ecommerce-project' );
formData . append ( 'num_customers' , '1000' );
formData . append ( 'output_format' , 'json' );
const response = await fetch ( ` ${ SyntheticDataClient . baseURL } /synthdata/generate-ecommerce-customers` , {
method: 'POST' ,
headers: {
'Authorization' : SyntheticDataClient . headers . Authorization
},
body: formData
});
const job = await response . json ();
console . log ( `Job started: ${ job . job_id } ` );
// Monitor progress via WebSocket
const wsURL = SyntheticDataClient . baseURL . replace ( / ^ http/ , 'ws' ). replace ( / ^ https/ , 'wss' );
const ws = new WebSocket ( ` ${ wsURL } /ws/jobs/ ${ job . job_id } ` );
ws . onmessage = ( event ) => {
const data = JSON . parse ( event . data );
console . log ( `Progress: ${ data . progress } % - ${ data . message } ` );
};
return job ;
};
Customer Profile Schema
Each generated customer includes:
Personal Info Demographics Behavioral Data Predictive Scores {
customer_id : string ,
personal_info : {
first_name : string ,
last_name : string ,
gender : "male" | "female" | "other" ,
date_of_birth : string ,
username : string ,
avatar_url : string
}
}
Advanced Customer Generation
// Generate segment-specific customers with custom parameters
async function generateSegmentedCustomers () {
const segments = [
{
segment: 'premium_buyer' ,
count: 200 ,
config: {
min_income: 100000 ,
min_order_value: 150 ,
interests: [ 'luxury' , 'fashion' , 'technology' ]
}
},
{
segment: 'value_seeker' ,
count: 500 ,
config: {
price_sensitivity: 'high' ,
promotion_responsiveness: 0.9
}
}
];
const jobs = [];
for ( const segment of segments ) {
const formData = new FormData ();
formData . append ( 'project_id' , 'segmented-customers' );
formData . append ( 'num_customers' , segment . count . toString ());
formData . append ( 'segment_filter' , segment . segment );
formData . append ( 'custom_config' , JSON . stringify ( segment . config ));
const response = await fetch ( ` ${ SyntheticDataClient . baseURL } /synthdata/generate-ecommerce-customers` , {
method: 'POST' ,
headers: { 'Authorization' : `Bearer ${ SyntheticDataClient . headers . Authorization } ` },
body: formData
});
jobs . push ( await response . json ());
}
return jobs ;
}
2. QA Pair Generation
Create high-quality question-answer pairs from documents for training conversational AI.
Generate QA Pairs
async function generateQAPairs ( documentPath , options = {}) {
const formData = new FormData ();
formData . append ( 'project_id' , 'qa-generation' );
formData . append ( 'input_file' , documentPath );
formData . append ( 'qa_type' , options . qaType || 'qa' ); // qa, cot, summary, extraction
formData . append ( 'num_pairs' , options . numPairs || '100' );
formData . append ( 'verbose' , options . verbose || 'false' );
const response = await fetch ( ` ${ SyntheticDataClient . baseURL } /synthdata/create-qa` , {
method: 'POST' ,
headers: { 'Authorization' : `Bearer ${ SyntheticDataClient . headers . Authorization } ` },
body: formData
});
return response . json ();
}
// Generate different types of QA pairs
const qaTypes = {
standard: await generateQAPairs ( '/docs/product-manual.pdf' , {
qaType: 'qa' ,
numPairs: 200
}),
chainOfThought: await generateQAPairs ( '/docs/technical-guide.pdf' , {
qaType: 'cot' ,
numPairs: 100
}),
summaries: await generateQAPairs ( '/docs/company-reports.pdf' , {
qaType: 'summary' ,
numPairs: 50
}),
extraction: await generateQAPairs ( '/docs/contracts.pdf' , {
qaType: 'extraction' ,
numPairs: 150
})
};
Curate QA Pairs
Apply quality scoring and filtering to ensure high-quality training data:
async function curateQAPairs ( inputFile , qualityThreshold = 8.0 ) {
const formData = new FormData ();
formData . append ( 'project_id' , 'qa-curation' );
formData . append ( 'input_file' , inputFile );
formData . append ( 'threshold' , qualityThreshold . toString ());
formData . append ( 'batch_size' , '100' );
const response = await fetch ( ` ${ SyntheticDataClient . baseURL } /synthdata/curate-qa` , {
method: 'POST' ,
headers: { 'Authorization' : `Bearer ${ SyntheticDataClient . headers . Authorization } ` },
body: formData
});
const job = await response . json ();
// Wait for curation to complete
const result = await waitForJob ( job . id );
console . log ( `Curated ${ result . kept_pairs } high-quality pairs` );
console . log ( `Filtered out ${ result . removed_pairs } low-quality pairs` );
return result ;
}
3. Fine-Tuning Data Preparation
Prepare and format data for fine-tuning language models:
class FineTuningDataPipeline {
constructor ( apiClient ) {
this . client = apiClient ;
}
async prepareTrainingData ( rawData , config ) {
// Step 1: Generate synthetic examples if needed
if ( config . augmentWithSynthetic ) {
const synthetic = await this . generateSyntheticExamples (
rawData ,
config . syntheticRatio
);
rawData = [ ... rawData , ... synthetic ];
}
// Step 2: Format for fine-tuning
const formatted = this . formatForFineTuning ( rawData , config . model );
// Step 3: Split into train/validation
const { train , validation } = this . splitData ( formatted , config . validationSplit );
// Step 4: Upload files
const trainFile = await this . uploadTrainingFile ( train );
const validationFile = await this . uploadTrainingFile ( validation );
// Step 5: Create fine-tuning job
const job = await this . createFineTuningJob ({
training_file: trainFile . id ,
validation_file: validationFile . id ,
model: config . model ,
hyperparameters: config . hyperparameters
});
return job ;
}
formatForFineTuning ( data , model ) {
return data . map ( item => {
if ( model . includes ( 'gpt' )) {
return {
messages: [
{ role: 'system' , content: item . system || 'You are a helpful assistant.' },
{ role: 'user' , content: item . prompt },
{ role: 'assistant' , content: item . completion }
]
};
}
// Add other model formats as needed
return item ;
});
}
async uploadTrainingFile ( data ) {
const jsonl = data . map ( item => JSON . stringify ( item )). join ( ' \n ' );
const blob = new Blob ([ jsonl ], { type: 'application/jsonl' });
const formData = new FormData ();
formData . append ( 'file' , blob , 'training_data.jsonl' );
const response = await fetch ( ` ${ SyntheticDataClient . baseURL } /api/finetuning/upload-training-file` , {
method: 'POST' ,
headers: { 'Authorization' : `Bearer ${ SyntheticDataClient . headers . Authorization } ` },
body: formData
});
return response . json ();
}
}
Advanced Use Cases
1. Multi-Modal Data Generation
Generate coordinated datasets across multiple data types:
class MultiModalDataGenerator {
async generateEcommerceDataset ( config ) {
const dataset = {
customers: [],
products: [],
orders: [],
reviews: [],
support_tickets: []
};
// Step 1: Generate customers
const customerJob = await this . generateCustomers ( config . numCustomers );
dataset . customers = await this . waitForJobCompletion ( customerJob );
// Step 2: Generate products based on customer interests
const productJob = await this . generateProducts ({
count: config . numProducts ,
categories: this . extractCategories ( dataset . customers )
});
dataset . products = await this . waitForJobCompletion ( productJob );
// Step 3: Generate realistic order history
const orderJob = await this . generateOrders ({
customers: dataset . customers ,
products: dataset . products ,
timeRange: config . orderTimeRange
});
dataset . orders = await this . waitForJobCompletion ( orderJob );
// Step 4: Generate reviews based on orders
const reviewJob = await this . generateReviews ({
orders: dataset . orders ,
sentiment_distribution: config . reviewSentiment
});
dataset . reviews = await this . waitForJobCompletion ( reviewJob );
// Step 5: Generate support tickets based on orders and reviews
const ticketJob = await this . generateSupportTickets ({
orders: dataset . orders ,
reviews: dataset . reviews . filter ( r => r . rating < 3 ),
issue_probability: config . supportTicketRate
});
dataset . support_tickets = await this . waitForJobCompletion ( ticketJob );
return dataset ;
}
}
2. Time-Series Data Generation
Create realistic time-series data for analytics and forecasting:
async function generateTimeSeriesData ( config ) {
const generator = new TimeSeriesGenerator ({
startDate: '2023-01-01' ,
endDate: '2024-12-31' ,
frequency: 'daily' ,
metrics: [
{
name: 'daily_revenue' ,
baseValue: 10000 ,
trend: 0.002 , // 0.2% daily growth
seasonality: {
weekly: { sunday: 0.7 , saturday: 1.3 },
monthly: { december: 1.8 , january: 0.6 }
},
noise: 0.1
},
{
name: 'customer_count' ,
baseValue: 1000 ,
trend: 0.001 ,
correlation: { daily_revenue: 0.8 }
}
]
});
const data = await generator . generate ();
// Add realistic anomalies
const anomalies = [
{ date: '2023-11-24' , metric: 'daily_revenue' , multiplier: 3.5 }, // Black Friday
{ date: '2023-12-26' , metric: 'daily_revenue' , multiplier: 2.0 }, // Boxing Day
];
return generator . injectAnomalies ( data , anomalies );
}
3. Scenario Testing Data
Generate specific scenarios for testing edge cases:
class ScenarioDataGenerator {
async generateTestScenarios () {
const scenarios = {
highValueCustomerChurn: await this . generateScenario ({
customerProfile: {
lifetime_value: { min: 10000 },
loyalty_points: { min: 5000 },
order_count: { min: 50 }
},
behavior: {
recent_activity: 'declining' ,
support_tickets: 'increasing' ,
satisfaction_trend: 'negative'
},
count: 100
}),
fraudulentPatterns: await this . generateScenario ({
customerProfile: {
account_age_days: { max: 7 },
shipping_addresses: { min: 3 },
payment_methods: { min: 4 }
},
orderPatterns: {
high_value_items: true ,
rush_shipping: true ,
different_billing_shipping: true
},
count: 50
}),
seasonalSurge: await this . generateScenario ({
timeframe: 'holiday_season' ,
traffic_multiplier: 5 ,
conversion_rate: 0.08 ,
average_order_value: 1.5 ,
support_ticket_rate: 2.0 ,
count: 10000
})
};
return scenarios ;
}
}
Monitoring & Analytics
Real-Time Progress Monitoring
class SyntheticDataMonitor {
constructor ( jobId ) {
this . jobId = jobId ;
this . metrics = {
recordsGenerated: 0 ,
qualityScore: 0 ,
estimatedTimeRemaining: 0
};
}
async monitor () {
// WebSocket connection for real-time updates
const ws = new WebSocket ( `ws://localhost:8000/ws/jobs/ ${ this . jobId } ` );
ws . onmessage = ( event ) => {
const update = JSON . parse ( event . data );
switch ( update . type ) {
case 'progress' :
this . updateProgress ( update );
break ;
case 'quality_check' :
this . updateQuality ( update );
break ;
case 'completed' :
this . handleCompletion ( update );
break ;
case 'error' :
this . handleError ( update );
break ;
}
};
// Periodic status checks via REST API
this . statusInterval = setInterval ( async () => {
const status = await this . checkJobStatus ();
this . updateMetrics ( status );
}, 5000 );
}
async checkJobStatus () {
const response = await fetch ( ` ${ SyntheticDataClient . baseURL } /jobs/ ${ this . jobId } ` , {
headers: { 'Authorization' : `Bearer ${ SyntheticDataClient . headers . Authorization } ` }
});
return response . json ();
}
}
Quality Metrics Dashboard
async function getDataQualityMetrics ( projectId ) {
const response = await fetch ( ` ${ SyntheticDataClient . baseURL } /projects/ ${ projectId } /quality-metrics` , {
headers: { 'Authorization' : `Bearer ${ SyntheticDataClient . headers . Authorization } ` }
});
const metrics = await response . json ();
return {
overall_quality_score: metrics . overall_score ,
data_distribution: {
statistical_validity: metrics . distribution . ks_test_score ,
diversity_index: metrics . distribution . diversity ,
balance_score: metrics . distribution . balance
},
field_quality: metrics . fields . map ( field => ({
name: field . name ,
completeness: field . completeness ,
uniqueness: field . uniqueness ,
validity: field . validity ,
consistency: field . consistency
})),
recommendations: metrics . recommendations
};
}
Best Practices
1. Data Generation Strategy
// Good: Incremental generation with validation
async function generateDataIncrementally ( totalRecords , batchSize = 1000 ) {
const batches = Math . ceil ( totalRecords / batchSize );
const generatedData = [];
for ( let i = 0 ; i < batches ; i ++ ) {
const batch = await generateBatch ({
size: Math . min ( batchSize , totalRecords - i * batchSize ),
offset: i * batchSize
});
// Validate each batch
const validation = await validateBatch ( batch );
if ( validation . isValid ) {
generatedData . push ( ... batch );
} else {
console . error ( `Batch ${ i } failed validation:` , validation . errors );
// Retry or handle error
}
// Progress update
console . log ( `Generated ${ generatedData . length } / ${ totalRecords } records` );
}
return generatedData ;
}
// Bad: Generating all data at once
async function generateAllAtOnce ( totalRecords ) {
return generateBatch ({ size: totalRecords }); // May timeout or OOM
}
2. Quality Assurance
class DataQualityAssurance {
async validateSyntheticData ( data , requirements ) {
const validations = {
schema: await this . validateSchema ( data , requirements . schema ),
statistics: await this . validateStatistics ( data , requirements . statistics ),
business_rules: await this . validateBusinessRules ( data , requirements . rules ),
privacy: await this . validatePrivacy ( data )
};
const report = {
passed: Object . values ( validations ). every ( v => v . passed ),
validations ,
recommendations: this . generateRecommendations ( validations )
};
return report ;
}
async validateStatistics ( data , expectedStats ) {
const actualStats = calculateStatistics ( data );
const deviations = {};
for ( const [ metric , expected ] of Object . entries ( expectedStats )) {
const actual = actualStats [ metric ];
const deviation = Math . abs ( actual - expected ) / expected ;
deviations [ metric ] = {
expected ,
actual ,
deviation ,
acceptable: deviation < 0.1 // 10% tolerance
};
}
return {
passed: Object . values ( deviations ). every ( d => d . acceptable ),
deviations
};
}
}
// Use streaming for large datasets
async function* streamSyntheticData ( config ) {
const pageSize = 1000 ;
let offset = 0 ;
while ( offset < config . total ) {
const response = await fetch ( ` ${ SyntheticDataClient . baseURL } /synthdata/stream` , {
method: 'POST' ,
headers: {
'Authorization' : `Bearer ${ SyntheticDataClient . headers . Authorization } ` ,
'Content-Type' : 'application/json'
},
body: JSON . stringify ({
... config ,
offset ,
limit: pageSize
})
});
const data = await response . json ();
if ( data . records . length === 0 ) break ;
yield data . records ;
offset += data . records . length ;
}
}
// Process data as it's generated
async function processStreamingData () {
const stream = streamSyntheticData ({
type: 'customers' ,
total: 1000000
});
for await ( const batch of stream ) {
await processBatch ( batch );
console . log ( `Processed ${ batch . length } records` );
}
}
Error Handling
Comprehensive Error Management
class SyntheticDataErrorHandler {
async handleAPIError ( error ) {
const errorHandlers = {
RATE_LIMITED : async () => {
const retryAfter = error . headers [ 'X-RateLimit-Reset' ];
await this . delay ( retryAfter * 1000 );
return { retry: true };
},
INVALID_INPUT : () => {
console . error ( 'Invalid input:' , error . detail );
return { retry: false , fix: this . suggestInputFix ( error ) };
},
INTERNAL_ERROR : async () => {
await this . reportError ( error );
return { retry: true , delay: 5000 };
},
SERVICE_UNAVAILABLE : () => {
return { retry: true , delay: 30000 , useBackup: true };
}
};
const handler = errorHandlers [ error . error_code ] || errorHandlers . INTERNAL_ERROR ;
return handler ();
}
suggestInputFix ( error ) {
// Analyze error and suggest fixes
const suggestions = {
'missing_required_field' : `Add required field: ${ error . field } ` ,
'invalid_format' : `Expected format: ${ error . expected_format } ` ,
'value_out_of_range' : `Value must be between ${ error . min } and ${ error . max } `
};
return suggestions [ error . validation_error ] || 'Check API documentation' ;
}
}
Troubleshooting
Common Issues and Solutions
API Authentication Errors
Symptom : 401 Unauthorized responses
Solution : Verify your API key is correctly set in the environment variables and not expired. Regenerate if necessary from your dashboard.
Job Timeout or No Progress
Symptom : WebSocket shows no updates, or job stuck at 0%
Solution : Check server status in the dashboard. For large jobs, increase timeout settings or split into smaller batches.
Invalid Data Format
Symptom : 400 Bad Request with format errors
Solution : Validate your input data against the schema. Use the preview endpoint to test small samples.
Rate Limit Exceeded
Symptom : 429 Too Many Requests
Solution : Implement exponential backoff in your client code. Upgrade your plan for higher limits.
WebSocket Disconnection
Symptom : Monitoring stops unexpectedly
Solution : Implement reconnection logic in your WebSocket handler with exponential backoff.
If issues persist, contact support with your job ID and error details.
Security & Compliance
Privacy-Preserving Generation
class PrivacyPreservingSynthData {
async generateCompliantData ( config ) {
const privacyRules = {
// No real PII patterns
email_format: 'synthetic_[hash]@example.com' ,
phone_format: '555-0[random]' ,
// Differential privacy for statistics
differential_privacy: {
epsilon: 1.0 ,
delta: 1e-5
},
// K-anonymity for demographics
k_anonymity: {
k: 5 ,
quasi_identifiers: [ 'age' , 'zipcode' , 'gender' ]
}
};
const data = await this . generateWithPrivacy ( config , privacyRules );
// Validate compliance
const compliance = await this . validateCompliance ( data , {
gdpr: true ,
ccpa: true ,
hipaa: config . industry === 'healthcare'
});
return {
data ,
compliance_report: compliance
};
}
}
Pricing & Limits
Free Tier Growth Enterprise
10,000 records/month
Basic customer profiles
Standard QA generation
Community support
Next Steps
Pro Tip : Start with small batches to validate your data generation parameters, then scale up. Use preview endpoints to check data quality before generating large datasets.
For support and examples, visit our GitHub repository or contact support@stateset.com .