Usage
1. Defining nonlinear dynamics
The dynamics is defined in repr_control/define_problem.py.
The following items should be defined:
Dynamics
Reward function
Initial distributions
State and action bounds
Maximum rollout steps
Noise level
The following is a detailed instruction on how to define the stochastic inverted pendulum dynamics.
The pendulum dynamics is:
where \(\theta\) is the angle, \(g\) is the gravity constant, \(m\) is the pendulum mass, \(l\) is the pendulum length, and \(T\) is the input torque. To deal with the unbounded \(\theta\), The observation is defined as \([\cos\theta,\sin\theta, \dot \theta]\).
We use euler discretization, combined with the stochastic dynamics,
where \(f\) is the continuous time nonlinear dynamics, and \(\epsilon\sim \mathcal N(0, \sigma^2 I_n)\).
Define dynamics and reward functions
Note that the dynamics must be written in
pyTorchand all the inputs should betorch.Tensor. The dynamics must support batch operations, which means the inputtorch.Tensorshould be in shape[batch_size, state_dim]and[batch_size, action_dim].Define dynamics:
1def dynamics(state: torch.Tensor, action: torch.Tensor) -> torch.Tensor: 2 g = 10.0 3 m = 1. 4 l = 1. 5 max_a = 2. 6 dt = 0.05 7 max_speed = 8 8 cos_th, sin_th, thdot = state[:, 0], state[:, 1], state[:, 2] 9 th = torch.atan2(sin_th, cos_th) 10 action = torch.reshape(action, (action.shape[0],)) 11 u = torch.clip(action, -max_a, max_a) 12 newthdot = thdot + (3. * g / (2 * l) * torch.sin(th) + 3.0 / (m * l ** 2) * u) * dt 13 newthdot = torch.clip(newthdot, -max_speed, max_speed) 14 newth = th + newthdot * dt 15 next_state = torch.vstack([torch.cos(newth), torch.sin(newth), newthdot]).T 16 assert next_state.shape == state.shape 17 return next_stateDefine rewards:
1def rewards(state: torch.Tensor, action: torch.Tensor) -> torch.Tensor: 2 cos_th, sin_th, thdot = state[:, 0], state[:, 1], state[:, 2] 3 th = torch.atan2(sin_th, cos_th) 4 action = torch.reshape(action, (action.shape[0],)) 5 reward = -0.3 * (th ** 2 + 0.1 * thdot ** 2 + 0.001 * action ** 2) 6 return reward
2. Start training
The training can be started with a single line
$ python solve.py
Define training hyperparameters
The hyper parameters can be set through command line arguments, for example
$ python solve.py --max_timesteps 2e5 --rf_num 1024
The --max_timesteps 2e5 means the total number of iterations is set to 2e5, and --rf_num 1024 means the
truncated finite dimension of random features are 1024.
For all the hyperparameters can be tuned, run
$ python solve.py --help
Experimental: vectorized solution:
$ python solve_vec.py --max_timesteps 2e5 --device cuda
use vectorized rollout and evaluation to speed up the training.
3. Monitoring and evaluating the training results
After training starts, the results will look like
repr-control/
├── repr-control/
│ ├── log/
│ │ ├── rfsac/
│ │ │ ├── seed_SEED_DATE-TIME # folder title
│ │ │ │ ├── summary/ # save tensorboard summaries
│ │ │ │ ├── best_actor.pth # actor with the best evaluations
│ │ │ │ ├── best_critic.pth # critic with the best evaluations
│ │ │ │ ├── last_actor.pth # actor after all training steps
│ │ │ │ ├── last_critic.pth # critic after all training steps
└── └── └── └── └── train_params.yaml # training parameters
Run the follwoing script to evaluate the trained results,
$ python scripts/eval.py $LOG_PATH
where $LOG_PATH is the path of folder title seed_SEED_DATE-TIME.
Monitoring the training process
$ tensorboard --logdir $LOG_PATH
You can inspect the training process via tensorboard.
Note
Monitoring the training process is very helpful for tuning the hyperparameters. Some rules of thumb if you don’t have experience playing with the RL hyper parameteters:
If the value loss is too large, try to scale the rewards to be smaller (or increase the learning rate).
If the agent always get stuck, try to adapt the initial distriution to cover more of the state space.
Evaluating the training results:
$ python scripts/eval.py $LOG_PATH
I placed a example results in the examples folder, you can run the following to see the results,
$ tensorboard --logdir ./examples/example_results/rfsac/Pendulum/seed_0_2024-07-18-14-50-35
$ python scripts/eval.py ./examples/example_results/rfsac/Pendulum/seed_0_2024-07-18-14-50-35
4. Use controller elsewhere
Add the following line to your python code to load training results as a controller,
import numpy as np from repr_control.scripts.eval import get_controller log_path = '$LOG_PATH' agent = get_controller(log_path)To generate control command from states,
state = np.zeros([3]) # a sample state with all zero. action = agent.select_action(state, explore=False)