StateSet Agents
StateSet Agents is a production‑oriented RL stack for training and serving LLM‑backed agents that improve through multi‑turn interaction. The library provides:- Async‑first agent APIs (
MultiTurnAgent,ToolAgent) with Hugging Face and stub backends. - Environments for conversational and task‑oriented episodes.
- Trajectories and value/advantage utilities tailored to dialogue.
- Composable reward functions (heuristic, domain, multi‑objective, neural).
- A family of group‑based policy‑optimization trainers (GRPO, GSPO, GEPO, DAPO, VAPO) plus PPO and RLAIF.
- Offline RL algorithms for learning from logged conversations (BCQ, BEAR, CQL, IQL, Decision Transformer).
- Sim‑to‑Real transfer for training in simulation and deploying to real users (domain randomization, system identification, progressive transfer).
- Continual learning + long‑term planning utilities (replay/LwF/EWC, plan context injection).
- Optional performance layers (vLLM generation, Rust acceleration, distributed training, HPO, FastAPI service).
Why group‑based optimization?
Traditional RLHF/PPO trains on one sampled response at a time. In long conversations this leads to high‑variance updates and brittle behavior.StateSet Agents implements group‑relative methods:
- GRPO (Group Relative Policy Optimization): sample a group of trajectories per prompt, compute advantages relative to the group baseline, then apply clipped policy‑gradient updates.
- GSPO (Group Sequence Policy Optimization): a more stable sequence‑level variant (Alibaba Qwen team) that avoids token‑level collapse on long outputs and MoE models.
Core concepts
- Agent: wraps a causal LM and exposes
initialize()andgenerate_response().MultiTurnAgenthandles conversation history and state.ToolAgentadds function/tool calling.
- Environment: defines episode reset/step logic and optional reward hooks.
ConversationEnvironmentships with scenario‑driven multi‑turn conversations.TaskEnvironmentis for goal‑oriented tasks.
- Trajectory: a multi‑turn record of turns, rewards, and metadata (
MultiTurnTrajectory). - Rewards:
RewardFunctionsubclasses and factories; combined viaCompositeRewardor multi‑objective reward models. - Training: trainers in
stateset_agents.trainingimplement GRPO‑family updates, GAE/value heads, KL regularization, LoRA support, and optional distributed/vLLM execution.
Installation
Core (lightweight, stub‑ready)
Training / real models
Optional extras
Quick start
1) Stub hello world (no downloads)
Runs without Torch/transformers and is ideal for CI or prototyping.2) Chat with a real model
Train a multi‑turn agent with GRPO
The high‑leveltrain(...) helper chooses single‑turn vs multi‑turn GRPO automatically.
examples/complete_grpo_training.py and examples/production_ready_customer_service.py.
Continual learning + long‑term planning (optional)
Enable planning context and replay/LwF in the trainer with config overrides:Other training algorithms
All algorithms are available understateset_agents.training when training deps are installed:
- GSPO: stable sequence‑level GRPO variant (
GSPOTrainer,GSPOConfig,train_with_gspo) - GEPO: expectation‑based group optimization for heterogeneous/distributed setups
- DAPO: decoupled clip + dynamic sampling for reasoning‑heavy tasks
- VAPO: value‑augmented group optimization (strong for math/reasoning)
- PPO baseline: standard PPO trainer for comparison
- RLAIF: RL from AI feedback via judge/reward models
docs/GSPO_GUIDE.md, docs/ADVANCED_RL_ALGORITHMS.md, and examples/train_with_gspo.py for full configs.
Offline RL: Learn from logged conversations
Train agents from historical conversation logs without online interaction. Useful when:- You have existing customer service transcripts
- Online training is expensive or risky
- You want to bootstrap before online fine‑tuning
Available Algorithms
| Algorithm | Best For | Key Innovation |
|---|---|---|
| BCQ | Conservative learning | VAE‑constrained action space |
| BEAR | Distribution matching | MMD kernel regularization |
| CQL | Pessimistic Q‑values | Conservative Q‑function penalty |
| IQL | Expectile regression | Implicit value learning |
| Decision Transformer | Sequence modeling | Return‑conditioned generation |
Quick Start
Hybrid Offline + Online Training
Combine offline pretraining with online GRPO fine‑tuning:docs/OFFLINE_RL_SIM_TO_REAL_GUIDE.md for complete documentation.
Sim‑to‑Real Transfer
Train in simulation, deploy to real users. The framework provides:Domain Randomization
Generate diverse training scenarios with randomized user personas:Conversation Simulator
Calibratable simulator with adjustable realism:Progressive Transfer
Gradually transition from simulation to real interactions:docs/OFFLINE_RL_SIM_TO_REAL_GUIDE.md for complete documentation.
Hyperparameter optimization (HPO)
Install withstateset-agents[hpo], then:
docs/HPO_GUIDE.md and examples/hpo_training_example.py.
Custom rewards
Use the decorator for quick experiments:CompositeReward.
Custom environments
SubclassEnvironment for task‑specific dynamics:
Checkpoints
train(..., save_path="...")saves an agent checkpoint.- Load later:
CLI
The CLI is a thin wrapper around the Python API:Examples and docs
Good starting points:examples/hello_world.py– stub mode walkthroughexamples/quick_start.py– basic agent + environmentexamples/complete_grpo_training.py– end‑to‑end GRPO trainingexamples/train_with_gspo.py– GSPO + GSPO‑token trainingexamples/train_with_trl_grpo.py– Hugging Face TRL GRPO integration
docs/USAGE_GUIDE.mddocs/RL_FRAMEWORK_GUIDE.mddocs/GSPO_GUIDE.mddocs/OFFLINE_RL_SIM_TO_REAL_GUIDE.mddocs/HPO_GUIDE.mddocs/CLI_REFERENCE.mddocs/ARCHITECTURE.md
Related Projects
- stateset-nsr - Neuro‑symbolic reasoning engine for explainable tools.
- stateset-api - Commerce/operations API that agents can drive.
- stateset-sync-server - Multi‑tenant orchestration and integrations.
- core - Cosmos SDK blockchain for on‑chain commerce.
- Public API docs: https://docs.stateset.com
Contributing
SeeCONTRIBUTING.md. Please run pytest -q and format with black/isort before opening a PR.
License
Business Source License 1.1. Non‑production use permitted until 2029‑09‑03, then transitions to Apache 2.0. SeeLICENSE.