Usage

1. Defining nonlinear dynamics

The dynamics is defined in repr_control/define_problem.py. The following items should be defined:

  • Dynamics

  • Reward function

  • Initial distributions

  • State and action bounds

  • Maximum rollout steps

  • Noise level

The following is a detailed instruction on how to define the stochastic inverted pendulum dynamics.

The pendulum dynamics is:

\[\ddot \theta = \frac{3g}{2l}\sin\theta + \frac{3}{ml^2} T\]

where \(\theta\) is the angle, \(g\) is the gravity constant, \(m\) is the pendulum mass, \(l\) is the pendulum length, and \(T\) is the input torque. To deal with the unbounded \(\theta\), The observation is defined as \([\cos\theta,\sin\theta, \dot \theta]\).

We use euler discretization, combined with the stochastic dynamics,

\[x' = f(x, u)\Delta t + \epsilon\]

where \(f\) is the continuous time nonlinear dynamics, and \(\epsilon\sim \mathcal N(0, \sigma^2 I_n)\).

Define dynamics and reward functions

Note that the dynamics must be written in pyTorch and all the inputs should be torch.Tensor. The dynamics must support batch operations, which means the input torch.Tensor should be in shape [batch_size, state_dim] and [batch_size, action_dim].

Define dynamics:

 1def dynamics(state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:
 2    g = 10.0
 3    m = 1.
 4    l = 1.
 5    max_a = 2.
 6    dt = 0.05
 7    max_speed = 8
 8    cos_th, sin_th, thdot = state[:, 0], state[:, 1], state[:, 2]
 9    th = torch.atan2(sin_th, cos_th)
10    action = torch.reshape(action, (action.shape[0],))
11    u = torch.clip(action, -max_a, max_a)
12    newthdot = thdot + (3. * g / (2 * l) * torch.sin(th) + 3.0 / (m * l ** 2) * u) * dt
13    newthdot = torch.clip(newthdot, -max_speed, max_speed)
14    newth = th + newthdot * dt
15    next_state = torch.vstack([torch.cos(newth), torch.sin(newth), newthdot]).T
16    assert next_state.shape == state.shape
17    return next_state

Define rewards:

1def rewards(state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:
2    cos_th, sin_th, thdot = state[:, 0], state[:, 1], state[:, 2]
3    th = torch.atan2(sin_th, cos_th)
4    action = torch.reshape(action, (action.shape[0],))
5    reward = -0.3 * (th ** 2 + 0.1 * thdot ** 2 + 0.001 * action ** 2)
6    return reward

2. Start training

The training can be started with a single line

$ python solve.py

Define training hyperparameters

The hyper parameters can be set through command line arguments, for example

$ python solve.py --max_timesteps 2e5 --rf_num 1024

The --max_timesteps 2e5 means the total number of iterations is set to 2e5, and --rf_num 1024 means the truncated finite dimension of random features are 1024.

For all the hyperparameters can be tuned, run

$ python solve.py --help
  • Experimental: vectorized solution:

$ python solve_vec.py --max_timesteps 2e5 --device cuda

use vectorized rollout and evaluation to speed up the training.

3. Monitoring and evaluating the training results

After training starts, the results will look like

repr-control/
├── repr-control/
│   ├── log/
│   │   ├── rfsac/
│   │   │   ├── seed_SEED_DATE-TIME          # folder title
│   │   │   │   ├── summary/                 # save tensorboard summaries
│   │   │   │   ├── best_actor.pth           # actor with the best evaluations
│   │   │   │   ├── best_critic.pth          # critic with the best evaluations
│   │   │   │   ├── last_actor.pth           # actor after all training steps
│   │   │   │   ├── last_critic.pth          # critic after all training steps
└── └── └── └── └── train_params.yaml        # training parameters

Run the follwoing script to evaluate the trained results,

$ python scripts/eval.py $LOG_PATH

where $LOG_PATH is the path of folder title seed_SEED_DATE-TIME.

Monitoring the training process

$ tensorboard --logdir $LOG_PATH

You can inspect the training process via tensorboard.

Note

Monitoring the training process is very helpful for tuning the hyperparameters. Some rules of thumb if you don’t have experience playing with the RL hyper parameteters:

  • If the value loss is too large, try to scale the rewards to be smaller (or increase the learning rate).

  • If the agent always get stuck, try to adapt the initial distriution to cover more of the state space.

Evaluating the training results:

$ python scripts/eval.py $LOG_PATH

I placed a example results in the examples folder, you can run the following to see the results,

$ tensorboard --logdir ./examples/example_results/rfsac/Pendulum/seed_0_2024-07-18-14-50-35
$ python scripts/eval.py ./examples/example_results/rfsac/Pendulum/seed_0_2024-07-18-14-50-35

4. Use controller elsewhere

Add the following line to your python code to load training results as a controller,

import numpy as np
from repr_control.scripts.eval import get_controller
log_path = '$LOG_PATH'
agent = get_controller(log_path)

To generate control command from states,

state = np.zeros([3]) # a sample state with all zero.
action = agent.select_action(state, explore=False)