Usage

1. Defining nonlinear dynamics

The dynamics is defined in repr_control/define_problem.py. The following items should be defined:

Dynamics
Reward function
Initial distributions
State and action bounds
Maximum rollout steps
Noise level

The following is a detailed instruction on how to define the stochastic inverted pendulum dynamics.

The pendulum dynamics is:

\[\ddot \theta = \frac{3g}{2l}\sin\theta + \frac{3}{ml^2} T\]

where $\theta$ is the angle, $g$ is the gravity constant, $m$ is the pendulum mass, $l$ is the pendulum length, and $T$ is the input torque. To deal with the unbounded $\theta$, The observation is defined as $[\cos\theta,\sin\theta, \dot \theta]$.

We use euler discretization, combined with the stochastic dynamics,

\[x' = f(x, u)\Delta t + \epsilon\]

where $f$ is the continuous time nonlinear dynamics, and $\epsilon\sim \mathcal N(0, \sigma^2 I_n)$.

Define problem related constants

import torch
import numpy as np
########################################################################################################################
# 1. define problem-related constants
########################################################################################################################
state_dim = 3                       # state dimension
action_dim = 1                      # action dimension
state_range = [[-1, -1, -8],
               [1, 1, 8]]           # low and high. We set bound on the state to ensure stable training.
action_range = [[-2], [2]]          # low and high
max_step = 200                      # maximum rollout steps per episode
sigma = 0.05                          # noise standard deviation.
env_name = 'Pendulum'
assert len(action_range[0]) == len(action_range[1]) == action_dim

The following constants are defined:

variable	format	meaning
state_dim	int	state dimension
action_dim	int	action dimension
state_range	[list, list]	state upper and lower bounds. Sampling will be reset if bound is achieved.
action_range	[list, list]	action upper and lower bounds.
max_step	int	maximum step per episode.
sigma	float	Gaussian noise variance $\sigma^2$.
env_name	str	Name of the dynamics

Define dynamics and reward functions

Note that the dynamics must be written in pyTorch and all the inputs should be torch.Tensor. The dynamics must support batch operations, which means the input torch.Tensor should be in shape [batch_size, state_dim] and [batch_size, action_dim].

Define dynamics:

def dynamics(state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:
    g = 10.0
    m = 1.
    l = 1.
    max_a = 2.
    dt = 0.05
    max_speed = 8
    cos_th, sin_th, thdot = state[:, 0], state[:, 1], state[:, 2]
    th = torch.atan2(sin_th, cos_th)
    action = torch.reshape(action, (action.shape[0],))
    u = torch.clip(action, -max_a, max_a)
    newthdot = thdot + (3. * g / (2 * l) * torch.sin(th) + 3.0 / (m * l ** 2) * u) * dt
    newthdot = torch.clip(newthdot, -max_speed, max_speed)
    newth = th + newthdot * dt
    next_state = torch.vstack([torch.cos(newth), torch.sin(newth), newthdot]).T
    assert next_state.shape == state.shape
    return next_state

Define rewards:

def rewards(state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:
    cos_th, sin_th, thdot = state[:, 0], state[:, 1], state[:, 2]
    th = torch.atan2(sin_th, cos_th)
    action = torch.reshape(action, (action.shape[0],))
    reward = -0.3 * (th ** 2 + 0.1 * thdot ** 2 + 0.001 * action ** 2)
    return reward

2. Start training

The training can be started with a single line

$ python solve.py

Define training hyperparameters

The hyper parameters can be set through command line arguments, for example

$ python solve.py --max_timesteps 2e5 --rf_num 1024

The --max_timesteps 2e5 means the total number of iterations is set to 2e5, and --rf_num 1024 means the truncated finite dimension of random features are 1024.

For all the hyperparameters can be tuned, run

$ python solve.py --help

Experimental: vectorized solution:

$ python solve_vec.py --max_timesteps 2e5 --device cuda

use vectorized rollout and evaluation to speed up the training.

3. Monitoring and evaluating the training results

After training starts, the results will look like

repr-control/
├── repr-control/
│   ├── log/
│   │   ├── rfsac/
│   │   │   ├── seed_SEED_DATE-TIME          # folder title
│   │   │   │   ├── summary/                 # save tensorboard summaries
│   │   │   │   ├── best_actor.pth           # actor with the best evaluations
│   │   │   │   ├── best_critic.pth          # critic with the best evaluations
│   │   │   │   ├── last_actor.pth           # actor after all training steps
│   │   │   │   ├── last_critic.pth          # critic after all training steps
└── └── └── └── └── train_params.yaml        # training parameters

Run the follwoing script to evaluate the trained results,

$ python scripts/eval.py $LOG_PATH

where $LOG_PATH is the path of folder title seed_SEED_DATE-TIME.

Monitoring the training process

$ tensorboard --logdir $LOG_PATH

You can inspect the training process via tensorboard.

Note

Monitoring the training process is very helpful for tuning the hyperparameters. Some rules of thumb if you don’t have experience playing with the RL hyper parameteters:

If the value loss is too large, try to scale the rewards to be smaller (or increase the learning rate).
If the agent always get stuck, try to adapt the initial distriution to cover more of the state space.

Evaluating the training results:

$ python scripts/eval.py $LOG_PATH

I placed a example results in the examples folder, you can run the following to see the results,

$ tensorboard --logdir ./examples/example_results/rfsac/Pendulum/seed_0_2024-07-18-14-50-35

$ python scripts/eval.py ./examples/example_results/rfsac/Pendulum/seed_0_2024-07-18-14-50-35

4. Use controller elsewhere

Add the following line to your python code to load training results as a controller,
import numpy as np
from repr_control.scripts.eval import get_controller
log_path = '$LOG_PATH'
agent = get_controller(log_path)
To generate control command from states,
state = np.zeros([3]) # a sample state with all zero.
action = agent.select_action(state, explore=False)