Mon Apr 13 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

why reinforcement learning is weirder than you think

so i've been knee deep in RL stuff for a while now and i keep running into people who kind of half understand it. like they know the buzzwords: reward, policy, agent but then things get murky real fast once you move past the gridworld examples. so let me just brain dump what actually matters here.

the core loop (that nobody explains properly)

the basic RL setup is this: you have an agent living inside an environment. at every timestep, the agent looks at the current state s_t, takes an action a_t, the environment spits back a new state s_{t+1} and a reward r_t, and this repeats forever (or until some terminal condition).

the goal is to learn a policy π(a | s), a probability distribution over actions given a state that maximises cumulative future reward. specifically, we usually care about the expected discounted return:

G_t = r_t + γ·r_{t+1} + γ²·r_{t+2} + ...

that γ (gamma) is the discount factor, somewhere between 0 and 1. it controls how myopic your agent is. γ = 0 means it only cares about immediate reward. γ → 1 means it cares equally about rewards happening 1000 steps from now. in practice you want something like 0.99 patient, but not infinitely so.

this seems clean. it is not.

the credit assignment problem is genuinely hard

here's the thing that breaks everyone's brain initially. suppose your agent plays a game of chess and wins. the reward comes at the very end +1. but which of the 60 moves it made actually caused the win? was it move 3 that set up a tactical advantage? move 47 where it captured a critical piece? move 59 where it avoided a blunder?

you have no idea. this is the credit assignment problem, and it's one of the oldest open problems in RL. everything we do algorithmically is basically a heuristic attempt to solve this.

the simplest approach: just use the total return G_t from step t onward as the signal for every action taken at step t. this is Monte Carlo policy gradient, it's unbiased but has horrific variance because G_t is a sum of many random variables.

value functions are load-bearing

to actually make progress you need to think in terms of value functions. the state-value function:

V^π(s) = E_π[G_t | s_t = s]

is the expected return you'll get starting from state s if you follow policy π forever. the action-value function (Q-function):

Q^π(s, a) = E_π[G_t | s_t = s, a_t = a]

is the same but you also fix the first action. the reason Q-functions are useful is that once you have a good Q, you can derive a greedy policy directly: π(s) = argmax_a Q(s, a). no distribution to sample from, just take the best action.

the key identity connecting these is the Bellman equation:

Q^π(s, a) = E[r + γ · V^π(s')]

this is recursive. the value of a state-action pair equals the immediate reward plus the discounted value of wherever you end up. this gives you a way to bootstrap, use your current value estimates to update themselves — which is the foundation of everything in TD learning.

on-policy vs off-policy (this distinction actually matters)

one thing that trips people up: on-policy methods (SARSA, PPO, A3C) only learn from data generated by the current policy. you take a step, you learn from it, you update your policy, and now that data is stale. this is sample-inefficient because you throw data away constantly.

off-policy methods (Q-learning, DQN, SAC) can learn from data generated by any policy, including a completely random one or a policy you ran last week. you store everything in a replay buffer and sample random batches to train on. this is way more sample efficient.

the catch: off-policy methods need to handle the fact that the data distribution doesn't match your current policy. that's fine mathematically (that's what importance weighting is for), but in practice it introduces instability. DQN famously dealt with this by using a target network a lagged copy of your Q-network that you freeze for some steps so your learning targets don't chase themselves in circles.

policy gradient methods: directly optimizing what you care about

the value-based approach (learn Q, derive policy) is great for discrete action spaces. but what if your action space is continuous? you can't take argmax over all possible torques to apply to a robot joint.

enter policy gradient methods. instead of learning a value function and deriving the policy, you directly parameterize the policy as a neural network π_θ(a | s) and take gradient steps on expected return:

∇_θ J(θ) = E_π[∇_θ log π_θ(a|s) · G_t]

this is the REINFORCE estimator. it's beautiful and it's very high variance. you'd typically subtract a baseline (often the value function V(s)) to reduce variance without introducing bias:

∇_θ J(θ) = E_π[∇_θ log π_θ(a|s) · (G_t - V(s))]

G_t - V(s) is called the advantage — how much better was this action than what you'd expect on average from this state. this is the backbone of actor-critic methods (A2C, A3C, PPO).

PPO (Proximal Policy Optimization) in particular became the default workhorse because it adds a clipping trick that prevents your policy update from being too aggressive:

L_CLIP = E[min(r_t(θ) · A_t, clip(r_t(θ), 1-ε, 1+ε) · A_t)]

where r_t(θ) = π_θ(a|s) / π_θ_old(a|s) is the probability ratio between new and old policy. if the ratio wanders too far from 1.0, the clip kicks in. crude but very effective in practice.

the exploration problem is unsolved

i'd be lying if i ended here without mentioning exploration. none of this works if your agent doesn't visit enough of the state space. the greedy policy will exploit whatever it's learned, which might be a local optimum it stumbled into early on.

the classic fix is ε-greedy: with probability ε, take a random action instead of the greedy one. works okay in small discrete settings. terrible in anything sparse or high-dimensional.

modern approaches are weirder. intrinsic motivation gives the agent a bonus reward for visiting novel states, you can measure novelty as prediction error on a random network (RND), or as information gain, or as counts of state visitations in some compressed embedding space. the agent becomes curious by design.

curiosity-driven exploration (Pathak et al., 2017) learns a dynamics model in feature space and rewards the agent for transitions the model can't predict. it's elegant and works really well on games like Montezuma's Revenge that famously defeat ε-greedy agents.

still: in truly large state spaces, exploration remains one of the hard open problems. we don't have a general solution.

where this goes from here

if you want to go deeper: read Sutton & Barto (freely available online, seriously just go read it), then look at the Spinning Up in Deep RL repo from OpenAI for clean implementations of PPO, SAC, and DDPG. and once you've broken your first environment, look into proper logging with Weights & Biases because debugging RL without good metrics is basically impossible.

the field has gotten quietly enormous. RL is underneath AlphaFold 2's structure refinement, it's behind a lot of modern LLM fine tuning (RLHF), it's in industrial robot control, chip design (Google's AlphaChip), compiler optimization. the gridworld stuff is just how you learn to walk. the real applications are somewhere between impressive and alarming.

anyway. that's the download.