If you see this, something is wrong
Motivated by recent advances of reinforcement learning and direct data-driven control, we propose policy gradient adaptive control (PGAC) for the linear quadratic regulator (LQR), which uses online closed-loop data to improve the control policy while maintaining stability. Our method adaptively updates the policy in feedback by descending the gradient of the LQR cost and is categorized as indirect, when gradients are computed via an estimated model, versus direct, when gradients are derived from data using sample covariance parameterization. Beyond the vanilla gradient, we also showcase the merits of the natural gradient and Gauss-Newton methods for the policy update. Notably, natural gradient descent bridges the indirect and direct PGAC, and the Gauss-Newton method of the indirect PGAC leads to an adaptive version of the celebrated Hewer’s algorithm. To account for the uncertainty from noise, we propose a regularization method for both indirect and direct PGAC. For all the considered PGAC approaches, we show closed-loop stability and convergence of the policy to the optimal LQR gain. Simulations validate our theoretical findings and demonstrate the robustness and computational efficiency of PGAC.
Problem 1
design a gradient-based method leveraging online closed-loop data such that the control policy converges to the optimal LQR gain, while ensuring the stability of the closed-loop system.
Assumption 1
The covariance of the process noise \( w_t\) is upper-bounded, i.e., \( \|W_{0,t}W_{0,t}^{\top}/t\| \leq \delta_t^2\) for some \( \delta_t \geq 0\) .
Assumption 2
The control input is persistently exciting such that \( \underline{\sigma}(\Phi_t) \geq \gamma_t^2 \) for some \( \gamma_t > 0\) .
Lemma 1
Consider the identification problem (9) with data \( (X_{0,t}, U_{0,t}, X_{1,t})\) and denote the model estimate as \( (\hat{A}_t, \hat{B}_t)\) . Then, it holds that
Remark 1
We discuss the information \( \text{SNR}_t\) under different noise and excitation settings that are widely adopted in adaptive control [1, 2, 3]. For time-uniformly bounded noise \( \delta_t \sim \mathcal{O}(1)\) and constant excitation \( \gamma_t \sim \mathcal{O}(1)\) , we have constant \( \text{SNR}_t \sim \mathcal{O}(1)\) . For i.i.d. Gaussian noise, it holds that \( \delta_t \sim \mathcal{O}(1/\sqrt{t})\) with high probability [3]. In this case, we have \( \text{SNR}_t \sim \mathcal{O}(\sqrt{t})\) for constant excitation \( \gamma_t \sim \mathcal{O}(1)\) and \( \text{SNR}_t \sim \mathcal{O}(t^{\frac{1}{4}})\) for diminishing excitation \( r_t \sim \mathcal{O}(t^{-\frac{1}{4}})\) . height6pt width 6pt depth 0pt
Assumption 3
The initial gain is stabilizing, i.e., \( K_{t_0} \in \mathcal{S}\) .
Lemma 2 ({[4, Lemma 1]})
For any \( K\in\mathcal{S}\) , the gradient of \( C(K)\) is given by
(18)
where \( {\Sigma}\) satisfies (4), and \( P\) is the positive definite solution to the Lyapunov equation
(19)
Lemma 3 (Gradient dominance)
For \( K \in \mathcal{S}\) , it holds that
where \( \mu := \|\Sigma^*\|/\underline{\sigma}(R) \) is the gradient dominance constant, and \( \Sigma^*\) is the closed-loop covariance matrix (4) with \( K^*\) .
Lemma 4 ({Local smoothness})
Define the sublevel set \( \mathcal{S}(a):=\{K\in \mathcal{S}|C(K)\leq a\}\) . For any \( K,K' \in \mathcal{S}(a)\) satisfying \( K+b(K'-K) \in \mathcal{S}(a), \forall b \in [0,1]\) , it holds that
where the smoothness constant \( l(a)\) is a polynomial in \( a\) .
Definition 1 (Strong stability)
A policy \( K\) is \( (\kappa, \alpha)\) -strongly stable if there exist constants \( \kappa \geq 1\) , \( 0< \alpha \leq 1\) , and matrices \( H \succ 0\) and \( L\) , such that \( A +BK = HLH^{-1}\) with \( \|L\|\leq 1-\alpha\) , \( \|H\|\|H^{-1}\| \leq \kappa\) , and \( \|K\|\leq \kappa\) .
Definition 2 (Sequential stability)
A sequence of policies \( \{K_t\}\) for the system (1) is sequentially stable if there exist constants \( \kappa \geq 1\) , \( 0< \alpha \leq 1\) , and matrices \( \{H_t\succ 0\}\) and \( \{L_t\}, \forall t\) , such that \( A+BK_t = H_tL_tH_t^{-1}\) , and
Theorem 1
There exist constants \( \nu_i>0,i\in\{1,2,\dots,5\}\) with \( \nu_4 <1\) depending on \( (A,B,Q,R,K_{t_0})\) , such that, if \( \text{SNR}_t \geq \nu_1,\) for all \( t\) and \( \eta \leq \min\{\nu_2, 2\mu\}\) , then the sequence \( \{K_t\}\) is sequentially stable, and the state is bounded as
(22)
Moreover, the policy \( \{K_t\}\) converges in the sense that
(23)
Lemma 5
For \( V \in \mathcal{V}\) , the gradient of \( J(V)\) is
(24)
where \( \Sigma\) is the solution to (16), and \( P\) is the positive definite solution to the Lyapunov equation
(25)
Further, we define the projection \( \Pi_{\overline{X}_{0}}: = I_{n+m}-\overline{X}_{0}^{\dagger}\overline{X}_{0}\) onto the constraint \( \overline{X}_0V= I_n\) in (15).
Lemma 6
The policy update (27)-(29) is equivalent to
(26)
where \( M_t:= \overline{U}_{0,t} \Pi_{\overline{X}_{0,t}} \overline{U}_{0,t}^{\top}\) is a positive definite matrix depending only on data. Moreover, the minimal singular value of \( M_t\) satisfies \( \underline{\sigma}(M_t) \geq {\gamma_t^4}\) .
Theorem 2
There exist constants \( \nu_i>0,i\in\{1,2,\dots,6\}\) with \( \nu_4<1\) depending on \( (A,B,Q,R,K_{t_0})\) , such that, if \( \text{SNR}_t \geq \max\{\nu_1, \nu_2\|M_{t}\|/\underline{\sigma}(M_{t}) \} \) and \( \eta_t \leq \|M_{t}\|^{-1} \min \{\nu_5, 2\mu\}\) for all \( t\) , then \( \{K_t\}\) is sequentially stable, and (22) holds. Moreover, the policy converges as
Remark 2
Compared with the vanilla gradient descent, the natural gradient descent (31) is computationally more efficient, as it can be computed with only one Lyapunov equation (19) instead of two for the vanilla gradient in Lemma 2. Moreover, the stepsize for natural gradient (31) can be chosen more aggressively than that of the vanilla gradient, leading to an improved convergence rate [4]. height6pt width 6pt depth 0pt
Lemma 7
For Algorithm 2, the policy update (27)-(29) with (28) replaced by natural gradient descent is equivalent to (32).
Theorem 3
With different values of \( \nu_i, i \in \{1,2,\dots,5\}\) , the results in Theorem 1 also apply to PGAC with the natural gradient descent (32).
Theorem 4
With different values of \( \nu_i, i \in \{1,2,\dots,5\}\) and \( \eta < 2\|\Sigma^*\|\) , the results in Theorem 1 apply to indirect PGAC with the Gauss-Newton policy update (34), with the convergence certificate
(35)
Theorem 5
There exist constants \( \nu_i>0,i\in\{1,2,\dots,5\}\) with \( \nu_4<1\) depending on \( (A,B,Q,R,K_0)\) , such that, if \( \text{SNR}_t \geq \nu_1, \forall t\) and \( C(K_{t_0})-C^*\leq \nu_2\) , then \( \{K_t\}\) is sequentially stable, and (22) holds. Moreover, the policy \( \{K_t\}\) converges in the sense that
(36)
Lemma 8
For \( K\in \{K|\rho(\hat{A}+\hat{B}K)<1\}\) , the gradient of \( \hat{C}(K;\lambda)\) is given by
where \( \Sigma\) and \( P\) are the solutions to (11) and \( P = Q_{\lambda} + K^{\top}R_{\lambda}K + \lambda K^{\top}(\Phi^{-1})_{ux} + \lambda(\Phi^{-1})_{xu}K + (\hat{A}+\hat{B}K)^{\top}P (\hat{A}+\hat{B}K) \) , respectively.
Theorem 6
Consider the regularized indirect PGAC algorithm, where we replace the gradient \( \nabla\hat{C}_{t+1}(K_t)\) in (20) with \( \nabla\hat{C}_{t+1}(K_t;\lambda_{t+1})\) . With different values of \( \nu_i, i \in \{1,2,\dots,5\}\) and the condition \( \lambda_t \leq \nu_6 \delta_t\gamma_t\) for \( \nu_6>0\) , the results in Theorem 1 apply, i.e., the boundedness of the state (22) and the convergence certificate of \( C(K_t)\) (23) hold.
Lemma 9
For \( V\in \mathcal{V}\) , the gradient of \( J(V;\lambda)\) is
where \( \Sigma\) and \( P\) are the solution to (16) and \( P = Q + V^{\top}(\lambda \Phi + \overline{U}_0^{\top}R\overline{U}_0)V + V^{\top}\overline{X}_1^{\top}P\overline{X}_1V \) , respectively.
Theorem 7
Consider the regularized direct PGAC algorithm, where we replace the gradient \( \nabla J_{t+1}(V_{t+1})\) in (28) with \( \nabla J_{t+1}(V_{t+1};\lambda_{t+1})\) in Lemma 9. With different values of \( \nu_i, i \in \{1,2,\dots,6\}\) and the condition \( \lambda_t \leq \nu_7 \delta_t\gamma_t\) for \( \nu_7>0\) , the boundedness of the state and the convergence certificate of \( C(K_t)\) in Theorem 2 apply.
Lemma 10
For \( K \in \mathcal{S}\) , it holds (\romannumeral1) \( \|\Sigma\| \leq {C(K)}/\underline{{\sigma}(Q)}, \) (\romannumeral2) \( \|P\| \leq C(K), \) and (\romannumeral3) \( \|K\|_F \leq ({C(K)}/\underline{{\sigma}(R)})^{{1}/{2}}. \)
Lemma 11
Let \( A\in \mathbb{R}^{n \times n}\) be stable and \( \Sigma(A)\) be the unique positive definite solution to \( \Sigma(A) = I_n + A\Sigma(A) A^{\top}\) . If \( \|A'-A\| \leq 1/({4\|\Sigma(A)\|(1+\|A\|)}), \) then \( A'\) is stable and \( \|\Sigma(A')-\Sigma(A)\| \leq 4 \|\Sigma(A)\|^2(1+\|A\|)\|A'-A\| . \)
Lemma 12
Let \( K \in \mathcal{S}\) . Then, there exists a polynomial \( p_3 = \text{poly}(C(K)/\underline{\sigma}(Q), \|A\|, \|B\|, \|R\|, 1/\underline{\sigma}(R))\) such that, if \( \delta / \gamma \leq p_2\) , then \( \| \nabla{C}(K) - \nabla\hat{C}(K) \| \leq {p_3\delta}/{\gamma}.\)
Lemma 13
Let \( K \in \mathcal{S}\) . There exist polynomials \( p_4 = \text{poly}(C(K)/\underline{\sigma}(Q), \|A\|^{-1}, \|B\|^{-1}, \|R\|^{-1}, \underline{\sigma}(R) )\) and \( p_5,p_6 = \text{poly}(C(K)/\underline{\sigma}(Q), \|A\|, \|B\|, \|R\|, 1/\underline{\sigma}(R))\) such that, if \( \|\widetilde{K} - K\| \leq p_4\) , then \( \widetilde{K} \in \mathcal{S}\) and
Lemma 14
Let \( K \in \mathcal{S}\) . There exists a polynomial \( p_7\) in \( (\underline{{\sigma}(Q)}/{C(K)}, \|A\|^{-1}, \|B\|^{-1}, \|R\|^{-1}, \underline{\sigma}(R))\) such that, if
then it holds that \( \left| C(K'') - C(K')\right| \leq \eta p_3p_6 \delta/{\gamma}. \)
Lemma 15 (Boundedness of the cost)
If
(48)
then \( C(K_t)\) of Algorithm 1 has a uniform upper bound, i.e., \( C(K_t) \leq \overline{C}\) with \( \overline{C}\) defined in (47).
Lemma 16
The policy \( K \in \mathcal{S}\) is \( (\kappa, \alpha)\) -strongly stable with
Lemma 17
There exist \( \bar{p}_{8}\) as a function of \( \overline{C}\) such that, if
(49)
then \( \{K_t\}\) of Algorithm 1 is sequentially strong stable with parameters \( (\bar{\kappa}, \underline{\alpha})\) , where \( \bar{\kappa}, \underline{\alpha}\) are the quantities of Lemma 16 evaluated at \( \overline{C}\) .
Lemma 18
Under the condition (49), it holds that
(50)
Lemma 19
For \( K\in \mathcal{S}\) and a stepsize \( \eta \leq 1/(l\|M\|)\) , the policy gradient update (52) satisfies
Lemma 20
Let \( K \in \mathcal{S}\) . If
then it holds that \( \left| C(K'') - C(K')\right| \leq {\eta p_6p_3 \delta \|M\|}/{\gamma}. \)
Lemma 21 (Boundedness of the cost)
If
(53)
then \( C(K_t)\) has a uniform upper bound, i.e., \( C(K_t) \leq \overline{C}\) .
Lemma 22
Suppose that
then \( \{K_t\}\) of Algorithm 2 is sequentially strong stable with parameters \( (\bar{\kappa}, \underline{\alpha})\) . Moreover, the state is bounded as (50).
Lemma 23
Let \( K \in \mathcal{S}\) . There exists a polynomial \( p_9=\text{poly}(C(K)/\underline{\sigma}(Q), \|A\|, \|B\|, \|R\|, 1/\underline{\sigma}(R))\) such that if \( \delta / \gamma \leq p_2\) , then \( \| \hat{E}-E \| \leq {p_9\delta}/{\gamma}. \)
Lemma 24
Let \( K \in \mathcal{S}\) and \( p_{10} = {1}/{\|R+B^{\top}PB\|}\) . If
then it holds that \( \left| C(K'') - C(K')\right| \leq {\eta p_6p_9 \delta}/{\gamma}. \)
Lemma 25 (Boundedness of the cost)
If
(59)
then \( C(K_t)\) of PGAC with natural gradient has a uniform upper bound, i.e., \( C(K_t) \leq \overline{C}, \forall t\geq t_0\) .
Lemma 26
There exists \( \bar{p}_{11}\) as a function of \( \overline{C}\) such that, if
(60)
then \( \{K_t\}\) is sequentially strong stable with parameters \( (\bar{\kappa}, \underline{\alpha})\) , where \( \bar{\kappa}, \underline{\alpha}\) are the quantities at \( \overline{C}\) . Moreover, the state is bounded as (50).
Lemma 27
Let \( K \in \mathcal{S}\) . Then, there exist \( p_{12}\) as a monotonously decreasing function of \( C(K)\) and \( p_{13}=\text{poly}(C(K)/\underline{\sigma}(Q), \|A\|, \|B\|, \|R\|, 1/\underline{\sigma}(R))\) , such that if \( \delta / \gamma \leq p_{12}\) , then \( \| (R+\hat{B}^{\top}\hat{P}\hat{B})^{-1}\hat{E}-(R+B^{\top}PB)^{-1}E \| \leq {p_{13}\delta}/{\gamma}.\)
Lemma 28
Let \( K \in \mathcal{S}\) . If \( {\delta}/{\gamma} \leq \min\{p_{12}, {p_4}/{p_{13}} \}, \) then it holds that \( | C(K'') - C(K')| \leq { p_6p_{13} \delta}/{\gamma}. \)
Lemma 29 (Boundedness of the cost)
If
(64)
then \( C(K_t)\) of Algorithm 3 satisfies \( C(K_t) \leq \overline{C}\) .
Lemma 30
There exists a constant \( c\) and \( \bar{p}_{13}\) as a function of \( \overline{C}\) such that, if
(65)
then \( \{K_t\}\) of Algorithm 3 is sequentially strong stable with parameters \( (\bar{\kappa}, \underline{\alpha})\) , where \( \bar{\kappa}, \underline{\alpha}\) are the quantities at \( \overline{C}\) .
Lemma 31
Let \( K \in \mathcal{S}\) . Then, there exist \( p_{14}=\text{poly}(C(K)/\underline{\sigma}(Q), \|A\|, \|B\|, \|R\|, 1/\underline{\sigma}(R))\) and a constant \( c_0>0\) such that, if \( \delta / \gamma \leq p_2\) and \( \lambda \leq c_0\delta \gamma\) , then \( \| \nabla\hat{C}_{\lambda}(K) - \nabla{C}(K) \| \leq {p_{14}\delta}/{\gamma}. \)
[1] Naive exploration is optimal for online LQR International Conference on Machine Learning 2020 8937–8948 PMLR
[2] Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR arXiv preprint arXiv:2401.14871 2024
[3] Almost Surely \( \sqrt{T}\) Regret Bound for Adaptive LQR IEEE Transactions on Automatic Control (early access) 2025
[4] Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator International Conference on Machine Learning 2018 1467–1476