proximal policy optimization

07 Jun, 2025

understanding the math behind proximal policy optimization (ppo)

hey there! welcome to this deep dive into proximal policy optimization (ppo), a powerhouse in reinforcement learning (rl). i’m stoked to break this down for you, especially if you’re new to rl or just want to geek out on the math. since ppo builds on rl, we’ll start with a refresher to get everyone on the same page, then plunge into ppo’s heavy-duty math and explore its variants like ppo-clip and ppo-penalty.

this post is gonna be thorough—i’ll explain every concept, notation, and equation step by step, starting with the intuition behind why we’re doing it. my goal is to make it feel like a chat with a friend, not a textbook. we’ll dig into the heaviest math i can muster, but i’ll keep it clear for beginners too. it’ll be long, so grab a coffee and let’s roll!

reinforcement learning refresher

let’s lay the groundwork with rl basics. this is the foundation for ppo, so we’ll make it solid.

what is reinforcement learning?

rl is about an agent learning to make decisions by interacting with an environment. the agent picks actions, gets rewards (or penalties), and aims to find a policy—a strategy—that maximizes total reward over time.

intuition: imagine training a dog to fetch. you throw the ball (environment sets the state), the dog runs (action), and you give a treat if it grabs the ball (reward). over time, the dog learns what earns treats. rl formalizes this with math.

key components

here’s the rl lingo, explained simply:

agent: the learner (the dog).
environment: the world (the yard).
state ( $s$ ): a snapshot, like “ball is 10 feet away.”
action ( $a$ ): what the agent does, like “run.”
reward ( $r$ ): feedback, like a treat.
policy ( $π$ ): the strategy, mapping states to actions, like “if ball’s far, run.”
value function: measures how good states or actions are based on future rewards.

intuition: the agent moves through time steps. at step $t$ , it’s in state $s_{t}$ , picks action $a_{t}$ using its policy, gets reward $r_{t}$ , and lands in state $s_{t + 1}$ . it’s a loop where the agent tries to max out its score.

maximizing cumulative reward

intuition: the agent wants to pile up as much reward as possible. but future rewards are less valuable—like $10 now beats $10 in a year. we use a discount factor $γ$ (between 0 and 1) to weigh future rewards less, making the math tidy.

why it’s needed: we need one number to represent all future rewards so the agent can optimize its choices. discounting balances short-term and long-term gains.

what we’re going to do: define the return ( $G_{t}$ ), the total discounted reward from time $t$ onward.

math:

G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}

math breakdown: this says: add all future rewards, but discount them more the further out they are. the agent’s goal is to maximize $G_{t}$ .

$r_{t}$ : reward at time $t$ (e.g., +1 for fetching).
$γ$ : discount factor (e.g., 0.9). smaller $γ$ means less focus on the future.
$γ^{k}$ : discounts reward $r_{t + k}$ by $γ^{k}$ , so later rewards count less.
$\sum_{k = 0}^{\infty}$ : sums rewards forever (or until the episode ends in finite tasks).

policies and value functions

intuition: the agent needs a plan (policy) and a way to judge its options (value functions). the policy is the playbook; value functions are the scorecard.

why it’s needed: the policy guides actions, and value functions evaluate their worth, helping the agent choose high-reward paths.

what we’re going to do: define policies and two value functions.

policies

a policy $π$ maps states to actions. it can be:

deterministic: same action every time, like “always run.”
stochastic: actions based on probabilities, like “80% run, 20% stop.”

we write a stochastic policy as $π (a | s)$ , the probability of action $a$ in state $s$ . e.g., $π (run | ball far) = 0.8$ .

intuition: stochastic policies enable exploration—trying new actions to find better ones.

value functions

value functions estimate quality:

state-value function $v^{π} (s)$ : expected return starting from state $s$ , following $π$ .
action-value function $q^{π} (s, a)$ : expected return starting from state $s$ , taking action $a$ , then following $π$ .

intuition: $v^{π} (s)$ is “how good is this state?” $q^{π} (s, a)$ is “how good is this action here?”

math:

state-value:

v^{π} (s) = 𝔼_{π} [G_{t} ∣ s_{t} = s]

action-value:

q^{π} (s, a) = 𝔼_{π} [G_{t} ∣ s_{t} = s, a_{t} = a]

math breakdown:

$𝔼_{π}$ : expected value under policy $π$ , averaging over possible futures.
$G_{t}$ : the return from earlier.
$s_{t} = s$ : we’re in state $s$ .
$a_{t} = a$ : we take action $a$ .
$π$ drives future actions, affecting the expectation.

policy gradient methods

intuition: ppo is a policy gradient method, so let’s understand these. instead of learning value functions first, we directly tweak the policy to favor high-reward actions. it’s like coaching the dog to fetch faster by adjusting its strategy.

why it’s needed: directly optimizing the policy can be faster than learning values then deriving actions. but big tweaks can destabilize learning.

what we’re going to do: introduce a parameterized policy and derive the policy gradient.

we model the policy as $π_{θ} (a | s)$ , where $θ$ is parameters (e.g., neural network weights). the goal is to maximize the expected return:

J (θ) = 𝔼_{π} [v^{π} (s_{0})]

where $s_{0}$ is the initial state.

intuition: $J (θ)$ is the policy’s “score”—average reward starting from $s_{0}$ . we use gradient ascent to adjust $θ$ to increase $J (θ)$ .

math: the policy gradient theorem gives:

\nabla_{θ} J (θ) = 𝔼_{π} [\sum_{t = 0}^{\infty} \nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) q^{π} (s_{t}, a_{t})]

derivation intuition: we want to know how changing $θ$ affects $J$ . the log probability $\log π_{θ} (a_{t} ∣ s_{t})$ tells us how $θ$ influences action choices. multiplying by $q^{π} (s_{t}, a_{t})$ weights updates by how good the action is.

math breakdown:

$\nabla_{θ}$ : gradient with respect to $θ$ .
$\log π_{θ} (a_{t} ∣ s_{t})$ : log probability of action $a_{t}$ . its gradient shows how to make $a_{t}$ more/less likely.
$q^{π} (s_{t}, a_{t})$ : value of the action, guiding the update direction.
$𝔼_{π}$ : expectation over states and actions under $π$ .

problem: using $q^{π}$ directly is noisy, and big updates can destabilize the policy. ppo fixes this with a trust region approach.

proximal policy optimization (ppo)

now for ppo—the star of the show! we’ll dive into its math, then explore variants.

what is ppo?

ppo, introduced by openai in 2017, is a policy gradient method balancing stability and efficiency. it’s popular for tasks like game-playing or robotics.

intuition: imagine coaching the dog to fetch, but you don’t overhaul its training daily—it’d get confused. ppo makes small, safe policy updates, improving without breaking what works.

why it’s needed: vanilla policy gradients can be unstable—big changes to $θ$ might worsen performance. ppo constrains updates to a “trust region.”

the advantage function

intuition: we need the advantage function to measure how much better an action is than the average in a state. it’s like saying, “fetching now beats sniffing around.”

why it’s needed: advantages focus updates on specific actions, reducing noise compared to raw returns.

what we’re going to do: define the advantage and discuss estimation.

math:

A_{t} = q^{π} (s_{t}, a_{t}) - v^{π} (s_{t})

math breakdown:

$q^{π} (s_{t}, a_{t})$ : return for taking $a_{t}$ in $s_{t}$ , then following $π$ .
$v^{π} (s_{t})$ : baseline return for $s_{t}$ .
$A_{t} > 0$ : action was better than average; $A_{t} < 0$ : worse.

estimation: we often use generalized advantage estimation (gae):

A_{t}^{GAE (γ, λ)} = \sum_{l = 0}^{\infty} (γ λ)^{l} δ_{t + l}

where $δ_{t} = r_{t} + γ v (s_{t + 1}) - v (s_{t})$ is the temporal difference error, and $λ$ (0 to 1) balances bias and variance.

intuition: gae smooths advantage estimates, making updates more stable. $λ$ controls how much we rely on multi-step estimates.

derivation intuition: $δ_{t}$ measures prediction error in value estimates. summing discounted errors gives a robust advantage estimate.

math breakdown:

$δ_{t}$ : error in predicting the value of $s_{t}$ .
$γ λ$ : combines discounting and gae smoothing.
$\sum_{l = 0}^{\infty}$ : aggregates errors over future steps.

ppo-clip: the clipped surrogate objective

intuition: ppo-clip (the standard ppo) updates the policy by comparing the new policy $π_{θ}$ to the old $π_{θ_{old}}$ , limiting changes with a clipping mechanism. it’s like tweaking the dog’s fetching but keeping it close to the old routine.

why it’s needed: unconstrained updates can lead to bad policies. clipping ensures stability by enforcing a trust region.

what we’re going to do: derive the clipped surrogate objective.

math:

L^{clip} (θ) = 𝔼_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]

where:

r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})}

math breakdown:

$r_{t} (θ)$ : probability ratio. $r_{t} > 1$ means the new policy favors $a_{t}$ more; $r_{t} < 1$ means less.
$A_{t}$ : advantage, guiding whether to increase ( $A_{t} > 0$ ) or decrease ( $A_{t} < 0$ ) the action’s probability.
$ϵ$ : clipping parameter (e.g., 0.2).
$clip (r_{t}, 1 - ϵ, 1 + ϵ)$ : caps $r_{t}$ between $1 - ϵ$ (e.g., 0.8) and $1 + ϵ$ (e.g., 1.2).
$min$ : uses the clipped term if $r_{t} A_{t}$ is too extreme, limiting updates.
$𝔼_{t}$ : expectation over collected experiences.

derivation intuition: we want to maximize $J (θ)$ but stay close to $π_{θ_{old}}$ . the ratio $r_{t}$ measures policy divergence, and clipping prevents $r_{t}$ from straying too far, enforcing a trust region.

how it works:

if $A_{t} > 0$ , we want $r_{t} > 1$ . clipping at $1 + ϵ$ stops over-enthusiasm.
if $A_{t} < 0$ , we want $r_{t} < 1$ . clipping at $1 - ϵ$ avoids over-penalizing.
this makes ppo-clip simple and effective.

ppo-penalty: an alternative approach

intuition: ppo-penalty uses a penalty term to enforce the trust region, instead of clipping. it’s like adding a leash to the dog’s training—freedom to move, but not too far.

why it’s needed: clipping can be conservative, ignoring some good updates. ppo-penalty offers flexibility but is harder to tune.

what we’re going to do: define the ppo-penalty objective and derive the kl divergence.

math:

L^{penalty} (θ) = 𝔼_{t} [r_{t} (θ) A_{t} - β KL (π_{θ_{old}}, π_{θ})]

math breakdown:

$r_{t} (θ) A_{t}$ : same as ppo-clip, encouraging good actions.
$KL (π_{θ_{old}}, π_{θ})$ : kullback-leibler divergence, measuring policy difference.
$β$ : penalty coefficient, controlling divergence penalty.
$𝔼_{t}$ : expectation over experiences.

kl divergence:

KL (π_{θ_{old}}, π_{θ}) = 𝔼_{π_{θ_{old}}} [\log \frac{π_{θ_{old}} (a | s)}{π_{θ} (a | s)}]

derivation intuition: kl divergence quantifies how much $π_{θ}$ diverges from $π_{θ_{old}}$ . a high penalty keeps policies close.

math breakdown:

$\log \frac{π_{θ_{old}} (a | s)}{π_{θ} (a | s)}$ : log probability ratio, positive when $π_{θ_{old}}$ assigns higher probability.
$𝔼_{π_{θ_{old}}}$ : expectation under the old policy’s distribution.

trade-offs:

ppo-clip: simpler, no $β$ tuning, but conservative.
ppo-penalty: more flexible, but $β$ needs adaptive tuning (e.g., increase if kl is too high).

ppo algorithm

intuition: ppo’s workflow is like a training loop: collect data, assess actions, update safely, repeat.

why it’s needed: we need a practical process to apply the math.

what we’re going to do: outline the steps (for ppo-clip).

run $π_{θ}$ to collect experiences (states, actions, rewards).
estimate advantages $A_{t}$ (e.g., using gae).
for a few epochs, optimize $L^{clip} (θ)$ with gradient ascent (using adam).
update $π_{θ_{old}} \leftarrow π_{θ}$ , repeat.

intuition: it’s like practicing fetching, checking what worked, tweaking slightly, and trying again.

conclusion

we’ve tackled rl basics, ppo’s heavy math, and its variants. ppo-clip’s clipping and ppo-penalty’s kl penalty both ensure stable updates, making ppo a go-to for rl. the math is intense, but with intuition, it’s approachable.

if you’re new, take it slow—each section builds on the last. got questions? want more? let me know!

references

schulman, j., wolski, f., dhariwal, p., radford, a., & klimov, o. (2017). proximal policy optimization algorithms. arxiv:1707.06347.
sutton, r. s., & barto, a. g. (2018). reinforcement learning: an introduction (2nd ed.). mit press.
engstrom, l., et al. (2020). implementation matters in deep rl: a case study on ppo. iclr 2020.
openai spinning up: https://spinningup.openai.com/