Consider the Group Sequence Policy Optimization (GSPO) objective for reinforcement learning with large language models. For a query x, let {yi}i=1G∼πθold(⋅∣x) denote G sampled responses from an old policy. The GSPO objective is defined as
and Ci(θ) is an indicator function that captures the clipping effect:
Ci(θ)=1A^i>0,si(θ)≤1+ε+1A^i<0,si(θ)≥1−ε.
This indicator is 1 if the update is not clipped and 0 otherwise.
Proof Sketch
The derivation of the gradient of the GSPO objective function proceeds through the following logical steps:
Interchange of Subgradient and Expectation: The GSPO objective JGSPO(θ) is an expectation over data sampled from a distribution independent of the policy parameters θ. The inner loss function is non-differentiable due to the min and clip operations. We use subgradient calculus and justify interchanging the subgradient and expectation operators by invoking the Dominated Convergence Theorem for subgradients.
Rigorous Justification: To rigorously apply the theorem, we first prove a key lemma bounding the subgradient of the scalar loss function.
Lemma (Subgradient Bound): Let F(s)=min(sA^i,clip(s,1−ε,1+ε)A^i). For any s∈R and any subgradient element k∈∂sF(s), the bound ∣k∣≤∣A^i∣ holds. We provide a full proof of this lemma via case analysis on the sign of A^i.
Dominating Function: Using this lemma and the chain rule for subgradients, we construct a dominating function g(ω) for the norm of any element in the subgradient set of the full loss term.
Integrability Proof: We prove that g(ω) is integrable under a set of standard, formal assumptions, thus validating the interchange of subgradient and expectation.
Derivation of a Specific Subgradient: With the interchange justified, we compute a valid subgradient of the inner loss term, Li(θ)=min(si(θ)A^i,clip(si(θ),1−ε,1+ε)A^i).
Complete Case Analysis: We perform a complete case analysis based on the sign of the advantage A^i: A^i>0, A^i<0, and A^i=0.
Subgradient at Boundaries: At points of non-differentiability, we select a specific, valid element from the subgradient set (a one-sided derivative), which is a standard and theoretically sound choice for optimization algorithms.
Indicator Function: The outcome of the case analysis is concisely expressed using an indicator function Ci(θ), which is 1 when the gradient is passed through and 0 when it is clipped or when A^i=0.
Final Assembly: We derive the gradient of the importance ratio, ∇θsi(θ), and substitute all components back into the main expression to obtain the final form of ∇θJGSPO(θ).
Let ω=(x,{yi}i=1G) denote a sample from the data-generating distribution P(ω), independent of θ. Let L(θ;ω)=G1∑i=1GLi(θ;ω), where Li(θ;ω)=min(si(θ)A^i,clip(si(θ),1−ε,1+ε)A^i).
The objective is JGSPO(θ)=Eω∼P[L(θ;ω)].
Justification for Interchanging Subgradient and Expectation
The function L(θ;ω) is non-differentiable due to the min and clip operations. We therefore work with subgradients. To compute the gradient of the objective we must justify interchanging the subgradient and expectation operators:
∇θE[L(θ;ω)]=E[∂θL(θ;ω)],
where ∂θL denotes the subgradient set. The Dominated Convergence Theorem for subgradients allows this interchange if two conditions hold:
For each sample ω, the function L(θ;ω) is locally Lipschitz in θ.
There exists an integrable function g(ω) with E[g(ω)]<∞ that dominates every subgradient: ∥ζ∥≤g(ω) for all ζ∈∂θL(θ;ω) and all θ.
Condition 1: Local Lipschitz Continuity
Under standard assumptions of policy smoothness, si(θ) is a composition of differentiable functions, hence locally Lipschitz. The clip and min functions are globally Lipschitz (constant 1). Their composition is therefore locally Lipschitz, so L(θ;ω) satisfies Condition 1.
Condition 2: Dominating Integrable Function
We first establish a key lemma:
Lemma: Bounding the Subgradient of the Scalar Loss
Let F(s)=min(sA^i,clip(s,1−ε,1+ε)A^i).
For any s∈R and any k∈∂sF(s) we have ∣k∣≤∣A^i∣.
Proof: (sketch).* We split on the sign of A^i.
Case 1: A^i>0F(s)=A^imin(s,clip(s,1−ε,1+ε)).
Write G(s)=min(s,clip(s,1−ε,1+ε)).
Then ∂sF(s)=A^i∂sG(s) and ∂sG(s)⊆[0,1], so ∣k∣≤A^i.
Case 2: A^i<0F(s)=A^imax(s,clip(s,1−ε,1+ε)).
A symmetric argument gives ∣k∣≤∣A^i∣.
Case 3: A^i=0 F(s)≡0, so k=0.
□
Using the lemma and standard bounds (e.g.\ ∥∇θlogπθ∥≤Gmax, ∣A^i∣≤Amax) we construct
g(ω)=GAmaxGmaxi=1∑Gθ′∈Θsupsi(θ′),
which dominates ∥∂θL(θ;ω)∥ and is integrable under the usual finite-importance-ratio assumption, establishing Condition 2.
Derivation of a One-Sample Subgradient
We compute a valid selection from ∂θLi(θ) by a case analysis on A^i.