The ultimate goal in RL is to learn a policy π:S→A that maximises the expected return in an MDP (S,A,P,r,ρ0,γ) with reward r:S×A×S→R provided by the environment. At time t the agent samples at∼π(⋅∣st) and the environment transitions to st+1∼P(⋅∣st,at) where at∈A and st∈S.
It’s convenient to bundle one full episode into a trajectory as τ=(s0,a0,s1,a1,…,sT) for t∈[0,T].
In general,
Observe the current state st (s0∼ρ0(⋅) at episode start)
Select the action at
Transition to the next state st+1∼P(⋅∣st,at)
Retrieve reward from the environment rt=r(st,at,st+1)
Repeat
So far I described the stochastic setting. There also exists a deterministic special case with the policy at=μ(st), and if dynamics are deterministic, st+1=f(st,at)when environments are noise free. However in most cases, especially in the real world, stochasticity is impossible to ignore.
From the oversimplified RL algorithm above 2) is the chief concern of most RL research. Since this post is titled “Policy Gradients” I’m going to focus on action selection. There’s plenty of room for later posts on reward functions (stay tuned).
Action Selection
Select the action at
The best action, should by definition yield the best reward or more concretely the expected return and it comes from the policy at∼π(⋅∣st). The effectiveness of this policy can be quantified from the value function,
Vπ(s)=Eτ∼π[t=0∑∞γtrts0=s].
This explains how valuable a state st is when the action is sampled directly from the policy at∼π(⋅∣st) assuming the start of the episode is at t=0.
Eq. 1 employs a discounted reward with γ to determine how myopic (γ≈0) or foresighted (γ≈1) the agent should be. The infinite sum case requires that γ∈[0,1) to converge. If the present reward isn’t more valuable than a future reward then discounting can be removed in favour of a finite time horizon,
This is assuming the episode starts t=0, otherwise we could be more general for any t,
R(τ)={∑k=0∞γkrt+k,∑k=0T−trt+k.
The expected value Eτ∼π[R(τ)] determines how effective the policy is performing. Formally, we want to solve for the optimal policy,
π∗=πargmaxEτ∼π[R(τ)].
What if we want to select an arbitrary a without relying on the state-action pair from the value Vπ of an invariant policy. Enter the Q-function,
Qπ(s,a)=Eτ∼π[R(τ)∣s0=s,a0=a].
Unless stated otherwise, Q(s,a) denotes Qπ(s,a) for the current policy; Q∗ denotes the optimal action-value.
If the action space A and state space S are both small and discrete, its feasible to sample a large quantity of Q-values into a Q-table. This can be used to produce the next best action based on,
a=a∈AargmaxQπ(s,a).
In discrete action spaces argmaxa∈AQπ(s,a) can be chosen via a finite O(∣A∣) scan of the Q-table which can be defined in one line of numpy.
In continuous spaces argmaxa∈AQπ(s,a) is only tractable when Qπ(s,⋅) has a concave form allowing convex optimisation to solve for the best a.
Most real world problems (especially in robotics, the field I’m most interested in) unfortunately sit within extremely large continuous action spaces of torque control, action controls and gripper commands. State spaces are likewise high dimensional and continuous, spanning joint angles and velocities, torques, motor currents and other sensors readings.
This means a pure argmaxaQπ(s,a) approach is often impractical, motivating actor-critic and deterministic policy gradients; I’ll cover when value-based control is the right tool (SARSA, Q-learning, DQN) in a separate post.
We’re here to talk about policy gradients.
Policy Gradients
Real environments are high dimensional and continuous, so we cannot sample the value function densely enough to find a global best action without catastrophe. Instead we can borrow concepts from supervised learning with gradient descent to optimise a policy parametrised over θ∈Rd directly from interaction, updating πθ with gradients of the expected return estimated from rollouts. In short, policy gradients learn by doing.
If we have a expected return,
J(θ)=Eτ∼πθ[R(τ)],
we can compute its gradient ∇θJ(θ) and use gradient ascent to maximise the expected return,
θ←θ+α∇θJ(θ).
Alternatively, in supervised learning we use gradient descent θ←θ−η∇θJ(θ) since we’re minimising a loss function.
Computing ∇θJ(θ) isn’t straight-forward. First we need to expand out the expression,
The bottleneck exists in Eq. 5 where all possible trajectories are integrated over ∫τP(τ∣θ)R(τ)dτ. The sample space is enormous and has no closed-form solution since the dynamics are not know initially. We can use the policy gradient theorem to simplify Eq. 2 to,
We can instead augment the definition of P(τ∣θ) where ρ0 is the initial state distribution with the log-derivative trick to removeterms with unknown dynamics,