Skip to content

GSPO Gradient Derivation

Introduction

Consider the Group Sequence Policy Optimization (GSPO) objective for reinforcement learning with large language models. For a query xx, let {yi}i=1Gπθold(x)\{ y_i \}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid x) denote GG sampled responses from an old policy. The GSPO objective is defined as

JGSPO(θ)=ExD,{yi}i=1Gπθold(x)[1Gi=1Gmin(si(θ)A^i,clip(si(θ),1ε,1+ε)A^i)],J_{\text{GSPO}}(\theta) = \mathbb{E}_{x \sim D, \{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid x)} \left[ \frac{1}{G} \sum_{i=1}^G \min\left( s_i(\theta) \hat{A}_i, \text{clip}(s_i(\theta), 1 - \varepsilon, 1 + \varepsilon) \hat{A}_i \right) \right],

where the importance ratio si(θ)s_i(\theta) is computed at the sequence level using:

si(θ)=(πθ(yix)πθold(yix))1/yi=exp(1yit=1yilogπθ(yi,tx,yi,<t)πθold(yi,tx,yi,<t)),s_i(\theta) = \left( \frac{\pi_\theta(y_i \mid x)}{\pi_{\theta_{\text{old}}}(y_i \mid x)} \right)^{1 / |y_i|} = \exp\left( \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \log \frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} \mid x, y_{i,<t})} \right),

and the normalized advantage is defined as

A^i=r(x,yi)mean({r(x,yj)}j=1G)std({r(x,yj)}j=1G).\hat{A}_i = \frac{r(x, y_i) - \operatorname{mean}(\{ r(x, y_j) \}_{j=1}^G)}{\operatorname{std}(\{ r(x, y_j) \}_{j=1}^G)}.

The final expression for a valid subgradient of the GSPO objective is:

θJGSPO(θ)=ExD,{yi}i=1Gπθold(x)[1Gi=1GθLi(θ)],\nabla_\theta J_{\text{GSPO}}(\theta) = \mathbb{E}_{x \sim D, \{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid x)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_\theta L_i(\theta) \right],

where θLi(θ)\nabla_\theta L_i(\theta) is a valid subgradient of the loss for a single response, given by:

θLi(θ)=Ci(θ)A^isi(θ)(1yit=1yiθlogπθ(yi,tx,yi,<t)),\nabla_\theta L_i(\theta) = C_i(\theta) \cdot \hat{A}_i \cdot s_i(\theta) \cdot \left( \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \nabla_\theta \log \pi_\theta(y_{i,t} \mid x, y_{i,<t}) \right),

and Ci(θ)C_i(\theta) is an indicator function that captures the clipping effect:

Ci(θ)=1A^i>0,si(θ)1+ε+1A^i<0,si(θ)1ε.C_i(\theta) = \mathbf{1}_{\hat{A}_i > 0, s_i(\theta) \le 1+\varepsilon} + \mathbf{1}_{\hat{A}_i < 0, s_i(\theta) \ge 1-\varepsilon}.

This indicator is 11 if the update is not clipped and 00 otherwise.

Proof Sketch

The derivation of the gradient of the GSPO objective function proceeds through the following logical steps:

  1. Interchange of Subgradient and Expectation: The GSPO objective JGSPO(θ)J_{\text{GSPO}}(\theta) is an expectation over data sampled from a distribution independent of the policy parameters θ\theta. The inner loss function is non-differentiable due to the min and clip operations. We use subgradient calculus and justify interchanging the subgradient and expectation operators by invoking the Dominated Convergence Theorem for subgradients.

  2. Rigorous Justification: To rigorously apply the theorem, we first prove a key lemma bounding the subgradient of the scalar loss function.

    • Lemma (Subgradient Bound): Let F(s)=min(sA^i,clip(s,1ε,1+ε)A^i)F(s) = \min(s \hat{A}_i, \text{clip}(s, 1-\varepsilon, 1+\varepsilon) \hat{A}_i). For any sRs \in \mathbb{R} and any subgradient element ksF(s)k \in \partial_s F(s), the bound kA^i|k| \le |\hat{A}_i| holds. We provide a full proof of this lemma via case analysis on the sign of A^i\hat{A}_i.
    • Dominating Function: Using this lemma and the chain rule for subgradients, we construct a dominating function g(ω)g(\omega) for the norm of any element in the subgradient set of the full loss term.
    • Integrability Proof: We prove that g(ω)g(\omega) is integrable under a set of standard, formal assumptions, thus validating the interchange of subgradient and expectation.
  3. Derivation of a Specific Subgradient: With the interchange justified, we compute a valid subgradient of the inner loss term, Li(θ)=min ⁣(si(θ)A^i,clip(si(θ),1ε,1+ε)A^i)L_i(\theta) = \min\!\bigl(s_i(\theta)\hat{A}_i,\operatorname{clip}(s_i(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i\bigr).

    • Complete Case Analysis: We perform a complete case analysis based on the sign of the advantage A^i\hat{A}_i: A^i>0\hat{A}_i > 0, A^i<0\hat{A}_i < 0, and A^i=0\hat{A}_i = 0.
    • Subgradient at Boundaries: At points of non-differentiability, we select a specific, valid element from the subgradient set (a one-sided derivative), which is a standard and theoretically sound choice for optimization algorithms.
    • Indicator Function: The outcome of the case analysis is concisely expressed using an indicator function Ci(θ)C_i(\theta), which is 11 when the gradient is passed through and 00 when it is clipped or when A^i=0\hat{A}_i = 0.
  4. Final Assembly: We derive the gradient of the importance ratio, θsi(θ)\nabla_\theta s_i(\theta), and substitute all components back into the main expression to obtain the final form of θJGSPO(θ)\nabla_\theta J_{\text{GSPO}}(\theta).

Detailed Proof

The GSPO objective function is defined as

JGSPO(θ)=ExD,  {yi}πθold ⁣[1Gi=1Gmin ⁣(si(θ)A^i,clip(si(θ),1ε,1+ε)A^i)].J_{\text{GSPO}}(\theta)= \mathbb{E}_{x \sim D,\;\{y_i\}\sim\pi_{\theta_{\text{old}}}} \!\Bigl[ \tfrac{1}{G}\sum_{i=1}^G \min\!\bigl( s_i(\theta)\hat{A}_i,\, \operatorname{clip}(s_i(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i \bigr) \Bigr].

Let ω=(x,{yi}i=1G)\omega=(x,\{y_i\}_{i=1}^G) denote a sample from the data-generating distribution P(ω)P(\omega), independent of θ\theta. Let
L(θ;ω)=1Gi=1GLi(θ;ω)L(\theta;\omega)=\tfrac{1}{G}\sum_{i=1}^G L_i(\theta;\omega), where
Li(θ;ω)=min ⁣(si(θ)A^i,clip(si(θ),1ε,1+ε)A^i)L_i(\theta;\omega)=\min\!\bigl(s_i(\theta)\hat{A}_i,\operatorname{clip}(s_i(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i\bigr).
The objective is JGSPO(θ)=EωP[L(θ;ω)]J_{\text{GSPO}}(\theta)=\mathbb{E}_{\omega\sim P}[L(\theta;\omega)].

Justification for Interchanging Subgradient and Expectation

The function L(θ;ω)L(\theta;\omega) is non-differentiable due to the min and clip operations. We therefore work with subgradients. To compute the gradient of the objective we must justify interchanging the subgradient and expectation operators:

θE[L(θ;ω)]  =  E ⁣[θL(θ;ω)],\nabla_\theta\mathbb{E}[L(\theta;\omega)] \;=\; \mathbb{E}\!\bigl[\partial_\theta L(\theta;\omega)\bigr],

where θL\partial_\theta L denotes the subgradient set. The Dominated Convergence Theorem for subgradients allows this interchange if two conditions hold:

  1. For each sample ω\omega, the function L(θ;ω)L(\theta;\omega) is locally Lipschitz in θ\theta.
  2. There exists an integrable function g(ω)g(\omega) with E[g(ω)]<\mathbb{E}[g(\omega)]<\infty that dominates every subgradient: ζg(ω)\|\zeta\|\le g(\omega) for all ζθL(θ;ω)\zeta\in\partial_\theta L(\theta;\omega) and all θ\theta.

Condition 1: Local Lipschitz Continuity

Under standard assumptions of policy smoothness, si(θ)s_i(\theta) is a composition of differentiable functions, hence locally Lipschitz. The clip and min functions are globally Lipschitz (constant 11). Their composition is therefore locally Lipschitz, so L(θ;ω)L(\theta;\omega) satisfies Condition 1.

Condition 2: Dominating Integrable Function

We first establish a key lemma:

Lemma: Bounding the Subgradient of the Scalar Loss

Let F(s)=min(sA^i,clip(s,1ε,1+ε)A^i)F(s)=\min(s\hat{A}_i,\operatorname{clip}(s,1-\varepsilon,1+\varepsilon)\hat{A}_i).
For any sRs\in\mathbb{R} and any ksF(s)k\in\partial_s F(s) we have kA^i|k|\le|\hat{A}_i|.

Proof: (sketch).* We split on the sign of A^i\hat{A}_i.

Case 1: A^i>0\hat{A}_i>0 F(s)=A^imin ⁣(s,clip(s,1ε,1+ε))F(s)=\hat{A}_i\min\!\bigl(s,\operatorname{clip}(s,1-\varepsilon,1+\varepsilon)\bigr).
Write G(s)=min(s,clip(s,1ε,1+ε))G(s)=\min(s,\operatorname{clip}(s,1-\varepsilon,1+\varepsilon)).
Then sF(s)=A^isG(s)\partial_s F(s)=\hat{A}_i\,\partial_s G(s) and
sG(s)[0,1]\partial_s G(s)\subseteq[0,1], so kA^i|k|\le\hat{A}_i.

Case 2: A^i<0\hat{A}_i<0 F(s)=A^imax ⁣(s,clip(s,1ε,1+ε))F(s)=\hat{A}_i\max\!\bigl(s,\operatorname{clip}(s,1-\varepsilon,1+\varepsilon)\bigr).
A symmetric argument gives kA^i|k|\le|\hat{A}_i|.

Case 3: A^i=0\hat{A}_i=0
F(s)0F(s)\equiv0, so k=0k=0.

\square

Using the lemma and standard bounds (e.g.\ θlogπθGmax\|\nabla_\theta\log\pi_\theta\|\le G_{\max}, A^iAmax|\hat{A}_i|\le A_{\max}) we construct

g(ω)  =  AmaxGmaxGi=1GsupθΘsi(θ),g(\omega)\;=\; \frac{A_{\max}G_{\max}}{G} \sum_{i=1}^G\sup_{\theta'\in\Theta}s_i(\theta'),

which dominates θL(θ;ω)\|\partial_\theta L(\theta;\omega)\| and is integrable under the usual finite-importance-ratio assumption, establishing Condition 2.

Derivation of a One-Sample Subgradient

We compute a valid selection from θLi(θ)\partial_\theta L_i(\theta) by a case analysis on A^i\hat{A}_i.

Case A^i>0\hat{A}_i>0

θLi(θ)={A^iθsi(θ)si(θ)1+ε,0si(θ)>1+ε.\nabla_\theta L_i(\theta)= \begin{cases} \hat{A}_i\,\nabla_\theta s_i(\theta) & s_i(\theta)\le 1+\varepsilon,\\[4pt] \mathbf{0} & s_i(\theta)>1+\varepsilon. \end{cases}

Case A^i<0\hat{A}_i<0

θLi(θ)={0si(θ)<1ε,A^iθsi(θ)si(θ)1ε.\nabla_\theta L_i(\theta)= \begin{cases} \mathbf{0} & s_i(\theta)<1-\varepsilon,\\[4pt] \hat{A}_i\,\nabla_\theta s_i(\theta) & s_i(\theta)\ge 1-\varepsilon. \end{cases}

Case A^i=0\hat{A}_i=0 Gradient is zero.

These three branches combine into

θLi(θ)=Ci(θ)A^iθsi(θ),\nabla_\theta L_i(\theta)= C_i(\theta)\,\hat{A}_i\,\nabla_\theta s_i(\theta),

with

Ci(θ)=1 ⁣[A^i>0si(θ)1+ε]+1 ⁣[A^i<0si(θ)1ε].C_i(\theta)= \mathbf{1}\![\hat{A}_i>0\land s_i(\theta)\le1+\varepsilon] +\mathbf{1}\![\hat{A}_i<0\land s_i(\theta)\ge1-\varepsilon].

Gradient of the Importance Ratio

θsi(θ)=si(θ)1yit=1yiθlogπθ(yi,tx,yi,<t).\nabla_\theta s_i(\theta)= s_i(\theta)\, \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \nabla_\theta\log\pi_\theta(y_{i,t}\mid x,y_{i,<t}).

Final Gradient

Substituting θsi(θ)\nabla_\theta s_i(\theta) into Ci(θ)A^iθsi(θ)C_i(\theta)\hat{A}_i\nabla_\theta s_i(\theta) and averaging over ii gives

θJGSPO(θ)=Ex,{yi}[1G ⁣i=1GCi(θ)A^isi(θ)(1yi ⁣t=1yiθlogπθ(yi,tx,yi,<t))].\nabla_\theta J_{\text{GSPO}}(\theta)= \mathbb{E}_{x,\{y_i\}} \Bigl[ \tfrac{1}{G}\!\sum_{i=1}^G C_i(\theta)\,\hat{A}_i\,s_i(\theta)\, \bigl(\tfrac{1}{|y_i|}\!\sum_{t=1}^{|y_i|}\nabla_\theta\log\pi_\theta(y_{i,t}\mid x,y_{i,<t})\bigr) \Bigr].

Next Post
Likelihood Ratio Trick a.k.a REINFORCE