If you see this, something is wrong
To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.
Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.
Generally speaking, anything that is blue is clickable.
Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.
Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.
Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.
By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.
If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.
The blue button below that says "table of contents" is your tool to navigate in a publication.
The left arrow brings you to the previous document in the publication, and the right one brings you to the next. Both cycle over the publication list.
The middle button that says "table of contents" reveals the publication table of contents. This table is hierarchical structured. It has sections, and sections can be collapsed or expanded. If you are a registered user, you can save the layout of the table of contents.
First published on Wednesday, Feb 26, 2025 and last modified on Thursday, Apr 10, 2025
Division of Decision and Control Systems, Digital Futures, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden and Laboratory for Information and Decision Systems, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Email
Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA Email
Division of Decision and Control Systems, Digital Futures, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden Email
Division of Decision and Control Systems, Digital Futures, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden Email
Nonlinear systems, state estimation, KKL observers, physics-informed learning
This paper proposes a novel learning approach for designing Kazantzis-Kravaris/Luenberger (KKL) observers for autonomous nonlinear systems. The design of a KKL observer involves finding an injective map that transforms the system state into a higher-dimensional observer state, whose dynamics is linear and stable. The observer’s state is then mapped back to the original system coordinates via the inverse map to obtain the state estimate. However, finding this transformation and its inverse is quite challenging. We propose to sequentially approximate these maps by neural networks that are trained using physics-informed learning. We generate synthetic data for training by numerically solving the system and observer dynamics. Theoretical guarantees for the robustness of state estimation against approximation error and system uncertainties are provided. Additionally, a systematic method for optimizing observer performance through parameter selection is presented. The effectiveness of the proposed approach is demonstrated through numerical simulations on benchmark examples and its application to sensor fault detection and isolation in a network of Kuramoto oscillators using learned KKL observers.
Observers play a crucial role in applications such as output feedback control [1], fault diagnosis [2], and digital twins [3]. Accurate state estimation is critical for effective control, monitoring, and decision-making in various engineering domains. Full-state measurement of real-world systems is often impractical, or even impossible, due to limitations in sensing resources and capabilities. This necessitates the use of observers that employ available sensor measurements and the system’s model to estimate the state. A variety of observer designs exist, including Luenberger-like observers, extended Kalman filters, and high-gain observers, each with its own strengths and weaknesses depending on the specific application and system characteristics [4, 5, 6, 7]. This work focuses on Kazantzis-Kravaris/Luenberger (KKL) observers due to their applicability to a very general class of nonlinear systems and the well-developed theoretical framework for their design and analysis [8, 9, 10, 11, 12].
Despite the advancements, synthesizing KKL observers remains a challenging task. The existing techniques assume the knowledge of an injective map that transforms a general nonlinear system into a special form required by the KKL observer. However, analytically finding this map is an inherently difficult problem. Moreover, in cases when this transformation map is known [13], finding its inverse so that one can obtain the state estimate in the original, physically meaningful coordinates is also a difficult problem [14]. This highlights the need for novel approaches that can overcome these limitations.
In this paper, we present a novel learning approach for designing KKL observers for autonomous nonlinear systems. At the core of KKL observer design is an injective transformation that lifts the nonlinear system to a higher-dimensional space. The transformed system exhibits two key properties: its dynamics is linear up to output injection and is bounded-input-bounded-state stable. The KKL observer operates by replicating the transformed system in the higher-dimensional space. To reconstruct the state estimate in the original coordinates, the left-inverse of the transformation map is applied to the observer’s state. The transformation map’s injectivity property guarantees the accuracy of the state estimate.
Given a KKL observer in an appropriate high-dimensional space, the transformation map can be obtained by solving a specific partial differential equation (PDE). However, this PDE is computationally challenging to solve in practice, and finding the left-inverse of the transformation map in real time presents additional difficulties. To address these challenges, we develop a framework for learning these maps using synthetic data generated from system dynamics and sensor measurements. We employ physics-informed neural networks to directly integrate the PDE constraint into the learning process for the transformation map and its inverse.
This paper makes several key contributions to the learning-based design and implementation of KKL observers. In addition to proposing a physics-informed learning method for designing KKL observers, we establish theoretical robustness guarantees of the learned KKL observer against approximation errors and system uncertainties. To optimize observer performance, we develop a systematic method for parameter selection that enhances robustness while preserving the learning characteristics of the observer. Through extensive numerical simulations, we validate the effectiveness of our learning-based approach to KKL observer design. Additionally, we extend the practical utility of our method by demonstrating its application to sensor fault detection and isolation.
The rest of the paper is organized as follows. Section 2 provides background on KKL observers and situates our work within the existing literature. Section 3 reviews the theoretical foundations of KKL observers. In Section 4, we present our novel learning method for KKL observer design, followed by Section 5 which develops theoretical robustness guarantees. Section 6 introduces our methodology for optimizing KKL observer parameters to find a trade-off between robustness and learnability. The effectiveness of our approach is demonstrated through numerical examples in Section 7, and with an application to fault detection and isolation presented in Section 8. Finally, Section 9 concludes the paper with a discussion of implications and future research directions.
KKL observers generalize the theory of Luenberger observers [15, 16, 17] to nonlinear systems. Although the idea was initially proposed in [18] and [19], KKL observers were subsequently rediscovered by Kazantis and Kravaris [8], who provided local guarantees around an equilibrium point of the estimation error dynamics via Lyapunov’s Auxiliary Theorem. Later, [20] relaxed the restrictive assumptions of [8] to some extent; however, the analysis remained local until [21] proposed the first global result under the assumption of so-called finite complexity, which also turned out to be quite restrictive for general nonlinear systems. Subsequently, Andrieu and Praly [9] provided a comprehensive treatment of the KKL observer problem in 2006. Their key contribution lies in relating the existence of an injective transformation map – a requirement for KKL observers – to an observability-like property of the system known as backward distinguishability. In addition, Andrieu [10] proved that KKL observers converge exponentially and are tunable if the system is also differentially observable. The existence conditions are further refined in [12] and KKL observers with contracting dynamics are developed in [22]. When the observability conditions are not satisfied, [23] proposes a set-valued KKL observer that estimates a set of possible state trajectories that could have produced the observed output trajectories. Finally, extensions of KKL observers to non-autonomous and controlled nonlinear systems are presented in [24, 25, 13]; however, such systems are out of the scope of our present paper.
The main challenge in synthesizing KKL observers for autonomous nonlinear systems is not only finding the transformation map that puts the system into a normal form but also finding an inverse map. Both problems turn out to be challenging in practice. To this end, [26, 27, 28, 29, 30, 31] have proposed several methods to approximate the transformation map and its inverse via deep neural networks.
By fixing the dynamics of the KKL observer, [26] proposes to generate synthetic data trajectories by numerically integrating both the system model and the KKL observer, where both are initialized at multiple points in their corresponding state spaces. Then, using a supervised learning approach, a neural network is trained on the synthetic data to approximate the transformation map and its left-inverse. For discrete-time nonlinear systems, [27] proposes an unsupervised learning approach that enables proper exploration of the state space during training, whereas [31] extends a similar approach by allowing for switching observers. However, noisy measurements can lead to erroneous switching decisions, causing the observer to select the wrong observer mode or introduce instability due to chattering. Similarly, [28] proposes another unsupervised learning approach for tuning a KKL observer while adding the PDE associated with the transformation map as a design constraint. The unsupervised approach exhibits poor generalization capabilities. Specifically, when the real system’s initial state deviates significantly from the conditions observed during training, the observer’s performance deteriorates considerably. When the system dynamics are partially or fully unknown, [30] proposes a neural ODE-based approach to design KKL observers, which allows analyzing the trade-off between convergence speed and robustness, leading to improved observer training for robust performance.
Our prior work [29] leveraged physics-informed learning to design KKL observers with improved accuracy, generalization, and training efficiency. In contrast to other learning-based approaches, we show that our method avoids overfitting and achieves better generalization performance across the entire state space. However, [29] employed a joint encoder-decoder neural network architecture to simultaneously learn the transformation map and its inverse, resulting in conflicting gradients between different components of the objective function. This caused the optimization to sometimes get stuck in a bad local minimum, which degraded the observer’s accuracy under certain cases [31]. In our current work, one of our contributions is to introduce a sequential learning approach for the KKL observer. We first learn the transformation map and subsequently utilize it to learn its inverse. This sequential approach significantly improves the observer’s accuracy.
For \( v\in\mathbb R^n\) , \( \|v\|\) denotes its Euclidean norm and \( \|v\|_\infty\) its max norm. For a matrix \( M\in\mathbb R^{m\times n}\) , \( \|M\|\) denotes its spectral norm and \( \text{cond}(M)\) its condition number. For a square matrix \( M\in\mathbb R^{n\times n}\) , \( \text{eig}(M)\subset\mathbb C\) denotes the set of its eigenvalues and \( \lambda_{\min}(M) \in \min_{\lambda\in\text{eig}(M)} |\lambda| \) denotes an eigenvalue of \( M\) closest to the origin. We denote the set \( [p]\mathrm{:=}\{1,\dots, p\}\) for some \( p\in\mathbb Z_{>0}\) . Given a signal \( s:\mathbb R_{\geq 0}\to \mathbb R^n\) , we denote its restriction to \( [0,t]\) by \( s_{[0,t]}\) , where \( t\in\mathbb R_{\geq 0}\) . Moreover, its essential supremum (or \( L^\infty\) ) norm is defined as \( \|s_{[0,t]}\|_{L^\infty}\mathrm{:=}\inf\{c\in\mathbb R_{\geq 0}: \|s(\tau)\|_\infty\leq c \text{for almost all} \tau\in[0,t]\}\) .
Consider a compact set \( \mathcal X\subset\mathbb R^{n_x}\) and a nonlinear system
(1.a)
(1.b)
where \( x(t)\in\mathcal X\) is the state at time \( t\in\mathbb R_{\geq 0}\) , \( x_0\in\mathcal X\) is an unknown point from where the system’s state is initialized, \( y(t)\in\mathbb R^{n_y}\) is the measured output, and \( f:\mathcal X\to\mathbb R^{n_x}\) and \( h:\mathcal X\to\mathbb R^{n_y}\) are smooth functions. The state observation problem involves designing an observer
(2.a)
(2.b)
with an internal state \( \hat z(t)\in\mathbb R^{n_z}\) initialized at \( \hat z_0\in\mathbb R^{n_z}\) , which takes the output \( y(t)\) of (1) as its input and provides an estimate \( \hat x(t)\in\mathbb R^{n_x}\) of the state \( x(t)\) as its output. Designing an observer means choosing functions \( \Phi:\mathbb{R}^{n_z}\times\mathbb{R}^{n_y}\to\mathbb{R}^{n_z}\) and \( \Psi:\mathbb{R}^{n_z}\times\mathbb{R}^{n_y}\to\mathbb{R}^{n_x}\) so that the estimation error
(3)
globally asymptotically converges to zero as \( t\to\infty\) . That is, for all \( x_0\in\mathcal X\) and \( \hat z_0\in\mathbb R^{n_z}\) ,
(4)
Designing a KKL observer involves transforming the nonlinear system (1) to a higher-dimensional state space where its dynamics is linear up to output injection and bounded-input bounded-state stable. This transformation \( \mathcal T : \mathcal X\to\mathcal Z\) , which must be injective , maps every point \( x\) in the state space \( \mathcal X\subseteq\mathbb{R}^{n_x}\) of (1) to a point \( z=\mathcal T(x)\) in the new state space \( \mathcal Z\subseteq\mathbb{R}^{n_z}\) , where in general \( n_z\gg n_x\) . The dynamics in the new state space \( \mathcal Z\) is linear up to output injection and given by
(5)
where \( A\in\mathbb{R}^{n_z\times n_z}\) and \( B\in\mathbb{R}^{n_z\times n_y}\) are chosen such that \( A\) is Hurwitz and the pair \( (A,B)\) is controllable . Although the transformed system (5) in the new coordinates is bounded-input bounded-state stable, the original nonlinear system (1) need not be Lyapunov stable. Moreover, (5) is not a linear system, but its linearity is only up to output injection, i.e., it is linear when the injected output \( y=h(x)=h(\mathcal T^*(z))\) is ignored, where \( \mathcal T^*\) is the left-inverse of \( \mathcal T\) .
Since
it follows from (5) that, for a given \( A\in\mathbb{R}^{n_z\times n_z}\) and \( B\in\mathbb{R}^{n_z\times n_y}\) , \( \mathcal T\) must satisfy the following PDE:
(6)
with \( \mathcal T(0_{n_x})=0_{n_z}\) .
To obtain a state estimate \( \hat x(t)\) in the original coordinates \( \mathbb{R}^{n_x}\) , the map \( \mathcal T\) must be injective, which implies the existence of its left-inverse \( \mathcal T^*\) , i.e., \( \mathcal T^*(\mathcal T(x))=x\) . When the inverse exists, the KKL observer is obtained as
(7.a)
where \( \hat z(t)\) denotes the estimate of \( z(t)\) . Notice that KKL observer (7) is a special case of (2), where \( \Phi(\hat z,y)\) is affine in \( \hat z\) and \( \Psi(\hat z,y) \mathrm{:=} \mathcal T^*(\hat z)\) .
We explore sufficient conditions that ensure the existence of an injective transformation map \( \mathcal T\) such that the estimate \( \hat x(t)\) obtained from the KKL observer (7) satisfies the estimation error requirement (4).
Let \( x(t;x_0)\) denote the solution trajectory of (1.a) initialized at \( x_0\in\mathcal X\) .
Definition 1
The system (1) is forward complete in \( \mathcal X\) if, for every \( x_0\in\mathcal X\) , the solution \( x(t;x_0)\) exists for every \( t\in\mathbb{R}_{\geq 0}\) and remains inside \( \mathcal X\) .
Assumption 1
There exists a compact set \( \mathcal X\subset\mathbb{R}^{n_x}\) such that the system (1) is forward complete in \( \mathcal X\) .
This assumption restricts our attention to nonlinear systems whose state \( x(t)\) remains bounded in forward time.
Definition 2
A map \( \mathcal T:\mathcal X\to\mathcal Z\) is said to be uniformly injective if there exists a class \( \mathcal K\) function \( \rho:\mathbb R_{\geq 0}\to\mathbb R_{\geq 0}\) such that, for every \( x,\hat x\in\mathcal X\) ,
(8)
Notice that (8) implies the existence of a class \( \mathcal K\) function \( \varrho\) such that, for every \( z,\hat z\in\mathcal Z\) ,
(9)
Remark 1
For the existence of a KKL observer (7) satisfying the estimation error requirement (4), it is sufficient that (1) is forward complete and the map \( \mathcal T\) satisfying the PDE (6) is uniformly injective; see [9, Theorem 1]. \( \diamond\)
We have \( \dot{\hat z}(t)-\dot z(t)=A[\hat z(t)-z(t)]\) . Since \( A\) is chosen to be a Hurwitz matrix, \( \|\hat z(t)-z(t)\|\) converges to zero exponentially. Thus, if \( \mathcal T\) is uniformly injective, the estimation error \( \xi(t)\) also converges to zero asymptotically because, from (8), we have
and \( \rho(\|z(t)-\hat z(t)\|)\rightarrow 0\) as \( \|z(t)-\hat z(t)\|\to 0\) .
Definition 3
Given an open set \( \mathcal O\supset\mathcal X\) , the system (1) is backward \( \mathcal O\) -distinguishable in \( \mathcal X\) if, for every pair of distinct initial conditions \( x_0^1,x_0^2\in\mathcal X\) , there exists \( \tau\in\mathbb R_{<0}\) such that the backward solutions \( x(t;x_0^1), x(t;x_0^2)\in\mathcal O\) exist for \( t\in[\tau,0]\) , and
Although we assumed that (1) is forward complete in \( \mathcal X\) , it may not be backward complete in \( \mathcal X\) . That is, the state trajectories of (1) may leave \( \mathcal X\) , and may even go unbounded, in negative time (\( t\in\mathbb R_{<0}\) ). The notion of backward \( \mathcal O\) -distinguishability guarantees the existence of a finite negative time such that the output maps, corresponding to a pair of trajectories initialized at different points in \( \mathcal X\) , can be distinguished before any of the trajectories leaves \( \mathcal O\supset\mathcal X\) in backward time. Backward \( \mathcal O\) -distinguishability is related to the notions of determinability or constructability in linear systems [32, 11].
Assumption 2
There exists an open bounded set \( \mathcal O\supset\mathcal X\) such that (1) is backward \( \mathcal O\) -distinguishable in \( \mathcal X\) .
It turns out that Assumptions 1 and 2 are sufficient for the existence of a uniformly injective map \( \mathcal T\) satisfying (6). This result is obtained in slightly different forms in [9] and [11]. The most recent result provided by [12] can be restated as follows.
Theorem 1 (Brivadis et al. [12])
Let Assumptions 1 and 2 hold. Then, for almost any \( (A,B)\in(\mathbb{R}^{n_z\times n_z},\mathbb{R}^{n_z\times n_y})\) such that
there exists a uniformly injective map \( \mathcal T:\mathcal X\to\mathcal Z\) that satisfies the PDE (6).
Theorem 1 shows that there exists an injective map \( \mathcal T\) and its left-inverse \( \mathcal T^*\) such that the estimate \( \hat x(t)\) obtained from the KKL observer (7) asymptotically converges to the true state \( x(t)\) . Therefore, relying on Assumptions 1 and 2, and fixing \( n_z = n_y(2n_x+1)\) , \( A\in\mathbb{R}^{n_z\times n_z}\) a Hurwitz matrix, and \( B\in\mathbb{R}^{n_z\times n_y}\) such that \( (A,B)\) is controllable, we can learn the transformation map \( \mathcal T\) and its left-inverse \( \mathcal T^*\) .
The critical steps of designing KKL observers involve finding the injective map \( \mathcal T: \mathcal X\to \mathcal Z\) that satisfies PDE (6), so that the nonlinear system (1) admits a linear up to output injection representation (5), and finding its left-inverse \( \mathcal T^*\) , so that a state estimate can be obtained in the original state coordinates. For finding \( \mathcal T\) , one must solve PDE (6), whose explicit solution derived in [9] is given by
(10)
where \( \breve{x}(\tau;x)\in\mathcal X\) is the backward solution trajectory initialized at \( x\in\mathcal X\) , for \( \tau\in\mathbb R_{\leq 0}\) , to the modified dynamics \( \dot{\breve{x}}(\tau)=g(\breve{x}(\tau))\) with \( g(\breve{x}(\tau))=f(\breve{x}(\tau))\) if \( \breve{x}(\tau)\in\mathcal X\) and \( g(\breve{x}(\tau))=0\) otherwise.
However, computing (10) is not practically possible (see [11]) due to the inaccessibility of the backward output map \( h(\breve{x}(\tau;x))\) for \( \tau<0\) and the infeasibility of computing the integral (10) for every initial point \( x\in\mathcal X\) . Secondly, even if \( \mathcal T\) is known in some other form than (10), finding the left-inverse \( \mathcal T^*\) is difficult both analytically and numerically [14]. To avoid these challenges, we propose to instead learn (approximate) the maps \( \mathcal T\) and \( \mathcal T^*\) using neural networks.
Let \( \hat{\mathcal T}_{\theta}\) and \( \hat{\mathcal T}^*_{\eta}\) be parametrized neural networks that approximate \( \mathcal T\) and \( \mathcal T^*\) , respectively. Here, \( \theta\in\mathbb{R}^{n_\theta}\) and \( \eta\in\mathbb{R}^{n_\eta}\) are vectors containing all the weights and biases of the neural networks. Then, the optimal values of \( \theta\) and \( \eta\) are defined as follows:
(11.a)
(11.b)
where \( x(\tau;\chi)\in\mathcal X\) denotes the state at time \( \tau\in[0,\infty)\) initialized at \( x(0)=\chi\in\mathcal X\) , and
(12)
is the residual of PDE (6) at \( x\in\mathcal X\) .
The optimization problems in (11) are mathematically well-defined and can be numerically solved up to local minima by sampling solution trajectories of (1) and (5). Since the matrices \( A\) and \( B\) are fixed and the function \( f\) is known, it is possible to generate trajectories of \( x\) and \( z\) numerically. The only problem comes from the fact that, given \( x_0\in\mathcal X\) , we do not know the initial condition of \( z\) because \( z_0 = {\mathcal T} (x_0)\) . For an arbitrary \( z_0\) , note that the solution of (5) satisfies
Since \( A\) is Hurwitz, \( \|\exp(At)z_0\| \to 0\) exponentially fast, and the effect of the initial condition \( z_0\) vanishes from \( z(t;z_0)\) over time \( t\) . Consequently, for any \( \varepsilon > 0\) and \( t > t_*(\varepsilon, z_0)\) with
we have
with \( \varepsilon>0\) the upper bound of the impact of the initial condition \( z_0\) on the norm \( \|z(t;z_0)\|\) , \( \lambda_{\min}(A)\) the real part of the eigenvalue of \( A\) with smallest magnitude, and \( V\) obtained from the eigendecomposition of \( A=V\Lambda V^{-1}\) , which is assumed to be diagonalizable. Therefore,
where \( t_*\mathrm{:=} t_*(\varepsilon,z_0)\) . As time progresses, the trajectory \( z(t;z_0^i)\) becomes almost independent of the choice of the initial condition \( z_0\) . This observation leads to the following data generation procedure.
Initial condition sampling: We sample \( p\in\mathbb Z_{>0}\) initial points \( \left\{ (x_0^i,z_0^i) \right\}_{i \in [p]}\) uniformly in \( \mathcal X \times \mathcal Z\) using Latin hypercube, orthogonal, or random sampling methods.
Trajectory generation: For all \( i \in [p]\) , we simulate (1) and (5) with initial condition \( (x_0^i, z_0^i)\) from \( t_1 = 0\) to \( t_{\tau} = T\) , where \( T\in\mathbb R_{>0}\) is chosen to be large.
Truncation: For all \( i \in [p]\) , the values of \( z\) -trajectory for \( t_k \geq t_{k_*}\) are kept and others are discarded, where
The truncation leads to the following datasets:
(13.a)
Note that, for all \( i\in[p]\) and \( t_k\geq t_{k_*}\) , we have \( z(t_k,z_0^i) \approx \mathcal T(x(t_k,x_0^i))\) .
The optimization problems in (11) minimize the loss functions that account for the deviation of the neural networks \( \hat{\mathcal T}_\theta\) and \( \hat{\mathcal T}_\eta^*\) with respect to the data trajectories. To this end, we can exploit the sampled data
for learning \( \hat{\mathcal T}_\theta\) and \( \hat{\mathcal T}_\eta^*\) . We first learn \( \hat{\mathcal T}_\theta\) and then use it to learn \( \hat{\mathcal T}_\eta^*\) by approximating the learning problems (11.a) and (11.b), respectively.
The empirical risk associated with (11.a) is given by
(14)
where \( \upsilon>0\) is a hyperparameter, \( X\) and \( Z\) are datasets given in (13), and \( \mathcal P_\theta\) is the PDE residual defined in (12). The estimate of \( \theta^\star\) is given by minimizing the empirical risk \( \mathcal L_\theta\) in (14), i.e.,
(15)
The empirical risk associated with (11.b) is given by
(16)
where the input to the neural network \( \hat{\mathcal T}_\eta^*\) is generated from the neural network \( \hat{\mathcal T}_{\hat\theta^\star}\) already learned in (15). The estimate of \( \eta^\star\) is given by minimizing the empirical risk \( \mathcal L_\eta\) in (16), i.e.,
(17)
Remark 2
Without the explicit supervision to connect the system’s state space \( \mathcal X\) to the observer’s state space \( \mathcal Z\) through PDE residual (12) in empirical risk (14), the neural network \( \hat{\mathcal T}_\eta^*\) would only minimize the reconstruction loss \( \|x-\hat{\mathcal T}_\eta^*(z)\|\) using the limited number of training points. This makes the model overfit on the training data and hinders the generalization to the unseen data, as is observed in [28]. Moreover, in the extreme case, without adding the PDE, if the neural network \( \hat{\mathcal T}_\eta^*\) is rich enough, one could essentially recover the \( x\) observation from arbitrary noise \( z\) . Thus, the learned \( \hat{\mathcal T}_\eta^*\) may not necessarily map \( \mathcal T_\theta(x')\) close to the point \( x'\) , where \( x'\) pertains to the test data. \( \diamond\)
Remark 3
In our previous work [29], we proposed to simultaneously learn both neural networks \( \hat{\mathcal T}_\theta\) and \( \hat{\mathcal T}_\eta^*\) . However, it may result in suboptimal parameters \( \eta\) because in that case, the empirical risk \( \mathcal L_\eta\) would also depend on \( \theta\) , which is being simultaneously updated during learning. Therefore, as also reported in [31], the method presented in [29] results in larger approximation (as well as estimation) errors. In the current paper, we exploit the strict hierarchy in the learning problems (11.a) and (11.b). That is, we observe that learning \( \hat{\mathcal T}_\theta\) is completely decoupled from \( \hat{\mathcal T}_\eta^*\) . Thus, we propose to first learn \( \hat{\mathcal T}_\theta\) and then use the learned \( \hat{\mathcal T}_{\hat\theta^\star}\) to generate \( z\) -data by passing every \( x\) -data point through the learned neural network. Once the \( \hat{\mathcal T}_{\hat\theta^\star}\) is fixed, we can optimally learn the neural network \( \hat{\mathcal T}_\eta^*\) pertaining to the inverse map \( \mathcal T^*\) . \( \diamond\)
Fig. 1 illustrates the workflow of our proposed learning algorithm. Our procedure deconstructs the autoencoder framework used in previous works such as [27, 29] into learning the transformations \( \hat{\mathcal T}_\theta\) and \( \hat{\mathcal T}_\eta^*\) disjointly. This deconstruction serves to address two fundamental issues with the previous design choices:
In our decoupled learning procedure, we first learn \( \hat\theta^\star\) in (15) for the forward map \( \hat{\mathcal{T}}_\theta:\mathcal{X} \rightarrow \mathcal{Z}\) by fitting the dataset \( (X,Z)\) using (14) as a loss function. The learned forward map is then used to generate a new dataset \( (Z',X')\) , with sampled initial conditions
The change in order between \( Z'\) and \( X'\) in the dataset notation means that \( Z'\) serves in this instance as the features and \( X'\) the labels. Our access to the forward map \( \hat{\mathcal{T}}_{\hat\theta^\star}:\mathcal{X}\to\mathcal Z\) allows us to be independent of the truncation process and directly map the sampled initial conditions in \( \mathcal{X}\) to their values in \( \mathcal{Z}\) , therefore allowing us to choose the distribution of the initial conditions while avoiding system blow-ups in backward time. The parameters \( \hat\eta^\star\) in (17) for the inverse map \( \hat{\mathcal{T}}_\eta^*\) is then learned by minimizing (16) on the new dataset \( (Z',X')\) . The act of decoupling the training of \( \hat{\mathcal{T}}_\theta\) and \( \hat{\mathcal{T}}_\eta^*\) therefore avoids the risk of conflicting gradients between (14) and (16), while ensuring that \( \hat{\mathcal{T}}_\eta^*\) is indeed trained using data from the correct distribution.
The neural networks \( \hat{\mathcal T}_\theta\) and \( \hat{\mathcal T}_\eta^*\) are mere approximations of \( \mathcal T\) and \( \mathcal T^*\) , respectively. The approximation errors will influence the estimation performance of the observer if we use the neural networks instead of the original transformation maps. However, only the approximation error associated with the inverse \( \hat{\mathcal T}_\eta^*\) explicitly affects the state estimation error, and the approximation error associated with \( \hat{\mathcal T}_\theta\) only implicitly affects the state estimation error as is apparent in the proof of Proposition 1 below.
For any \( z\in\mathcal Z\) , the true inverse \( \mathcal T^*(z)\) can be written as
(18)
where \( \mathcal{E}^*(z)\) is the approximation error of the learned inverse map \( \hat{\mathcal T}_\eta^*\) at \( z\in\mathcal Z\) . Using (18), the KKL observer (7) can be written as
(19)
where the approximation error \( \mathcal{E}^*(\hat z(t))\) is an unknown signal corrupting the state estimate \( \hat x(t)\) . Moreover, the model (1) of the system is never perfect in real-world applications, and there are underlying uncertainties that could also influence the state estimation error. In this section, we provide robustness guarantees in terms of explicit input-to-state stability bounds for the estimation error under both the approximation error and the system uncertainties.
In practice, there are uncertainties in system dynamics, i.e.,
(20.a)
(20.b)
where \( w(t)\in\mathbb R^{n_x}\) and \( v(t)\in\mathbb R^{n_y}\) denote the model uncertainty and the measurement noise, respectively. In the presence of uncertainties, the objective of the observer changes from steering the estimation error to zero as in (4) to steering it “close” to zero. That is, the observer must ensure that the estimation error (3) satisfies an input-to-state stability bound
(21)
where \( \xi(t)\) is the estimation error defined in (3), \( \xi_0=\xi(0)\) is the initial estimation error, \( \beta\) is a class-\( \mathcal{KL}\) function that asymptotically converges to zero as \( t\to\infty\) , and \( \gamma_1,\gamma_2\) are class-\( \mathcal K_\infty\) functions. From (21), it is evident that in the absence of uncertainty \( w(t)\) and noise \( v(t)\) , the estimate \( \hat x(t)\) asymptotically converges to the true state \( x(t)\) , thus satisfying (4). Moreover, (21) imposes robustness criterion on the observer’s state estimate under uncertainties. In other words, as the magnitudes of the uncertainties increase, (21) ensures that there is a graceful degradation of the state estimate, so the estimation error remains bounded as long as the uncertainties are bounded.
Let \( x(t;x_0,w)\) denote the state of (20.a) at time \( t\in\mathbb R_{\geq 0}\) initialized at \( x(0)=x_0\) and driven by \( w_{[0,t]}\) .
Assumption 3
The model uncertainty \( w(t)\) and the measurement noise \( v(t)\) satisfy the following:
Bounded effect of \( w\) : There exists a class \( \mathcal K_\infty\) function \( \psi\) such that, for every \( t\in\mathbb{R}_{\geq 0}\) ,
Assumption 3((I)) is a standard assumption in robust state estimation [33]. Assumption 3((II)) requires that the deterministic system (1) does a good job in describing the uncertain system (20). Providing such a guarantee is widely studied in system identification, and the reader is referred to [34, 35, 36, 37] for more details.
Define the error in the \( z\) -coordinate as \( \tilde z(t)\mathrm{:=} z(t)-\hat z(t)\) . Then, from (5) and (7), we have
(22)
where
(23)
with \( x(t)\mathrm{:=} x(t;x_0,w)\) and \( \bar{x}(t)\mathrm{:=} x(t;x_0,0)\) .
Note that \( \sigma(t)\in\mathbb{R}^{n_y}\) in (23) remains bounded for bounded uncertainty \( w(t)\) and noise \( v(t)\) due to Assumption 3. That is, for every \( x(t),\bar{x}(t)\in\mathcal X\) and \( t\in\mathbb{R}_{\geq 0}\) ,
(24)
(25)
where \( \overline w,\overline v\) are given in Assumption 3(i) and \( \ell_h<\infty\) with
(26)
because \( h(\cdot)\) is smooth and \( \mathcal X\) is a compact set. In the last step of (24), we used
The lemmas below, though familiar in various forms, are presented here with precise quantification of parameters, which are essential for deriving input-to-state stability bounds of the estimation error in the following sections. Moreover, these bounds are also used in Section 8 to explicitly compute the thresholds for detecting and isolating sensor faults.
Recall \( z=\mathcal T(x)\) , where
with \( \bar{x}(t)\mathrm{:=} x(t;x_0,0)\) the noise-free state trajectory. Also, recall the dynamics of \( \hat z(t)=\hat{\mathcal T}_\theta(\hat x(t))\) given in (19).
Lemma 1
Suppose the matrix \( A\) is Hurwitz and diagonalizable with eigenvalue decomposition \( A=V\Lambda V^{-1}\) . Then,
and
where \( \text{cond}(V)=\|V\|\|V^{-1}\|\) is the condition number of \( V\) .
The proof of this lemma is straightforward. A similar result can be obtained when \( A\) is not diagonalizable by using the Jordan form of \( A\) ; see [38, Appendix C.5]. However, since diagonalizable matrices are dense in the space of square matrices and also since \( A\) is a design matrix, the diagonalizability assumption is mild.
Lemma 2
Let Assumption 3 hold. Then, the error \( \tilde z(t)\mathrm{:=} z(t)-\hat z(t)\) in the \( z\) -coordinate satisfies
(27)
where \( \overline w, \overline v, \psi\) are given in Assumption 3, and \( \ell_h\) is the Lipschitz constant of \( h(x)\) for \( x\in\mathcal X\) given in (26).
Proof 1
Define \( \sigma(t)\coloneq h(\bar{x}(t))-y(t) = h(\bar{x}(t)) - h(x(t))-v(t)\) , where \( x(t)\mathrm{:=} x(t;x_0,w)\) is the solution of (20). Then, from the dynamics \( \dot{\tilde z}(t)=A\tilde z(t) + B\sigma(t)\) , we have that the solution \( \tilde z(t;\tilde z_0,\sigma)\) satisfies
(28)
The second term on the right-hand side of (28) satisfies
(29)
From (24), it follows that
The result follows by substituting the above inequality in (29), then in (28), and using Lemma 1.
Given that the activation functions used in the neural network \( \hat{\mathcal T}_\eta^*\) are Lipschitz continuous, e.g., Rectified Linear Unit (ReLU) networks, we show that the KKL observer (7) for the noise-free system (1) is robust to the neural network approximation error. We base our result on the following lemma.
Lemma 3
Given a neural network \( \hat{\mathcal T}_\eta^*\) with \( l\) layers and, at each layer \( i\) , the weight matrix and the activation function are denoted by \( W_\eta^{(i)}\) and \( \sigma^{(i)}:\mathbb{R}\to\mathbb{R}\) , respectively. Then, if \( \sigma^{(i)}\) is Lipschitz continuous with constant \( \ell_\eta^i\) , for every \( i\in\{1,\dots,l\}\) , it holds that \( \hat{\mathcal T}^*\) is Lipschitz continuous with constant \( \ell_\eta = \ell_\eta^1 \dots \ell_\eta^l \|W_\eta^{(1)}\| \dots \|W_\eta^{(l)}\|\) , i.e., \( \forall\) \( \hat z,z\in\mathbb{R}^{n_z}\) ,
(30)
Proof 2
Let \( \Sigma^{(i)}(b^{(i)}) = [\begin{array}{ccc} \sigma^{(i)}(b_1^{(i)}) & \dots & \sigma^{(i)}(b_{m^{(i)}}^{(i)}) \end{array}]^\top\) be the output at layer \( i\) , where \( b^{(i)}=W^{(i)}a^{(i-1)}\in\mathbb{R}^{m^{(i)}}\) with \( m^{(i)}\) the number of neurons at layer \( i\) , and \( a^{(i-1)}\) the input \( z\in\mathbb{R}^{n_z}\) if \( i=1\) or the activated output of layer \( i-1\) if \( i\geq 2\) . Then, for every \( \hat{a}^{(i-1)},a^{(i-1)}\) ,
The result follows by applying the above inequality at each layer.
Remark 4
Consider the ReLU activation function given by \( \Sigma(b) = [\begin{array}{ccc} \max(0,b_1) & \dots & \max(0,b_m) \end{array}]^\top\) , which is Lipschitz continuous with a Lipschitz constant equal to \( 1\) as \( \|\Sigma(\hat{b}) - \Sigma(b)\| \leq \|\hat{b}-b\|, \) where \( b\in\mathbb{R}^m\) . Thus, for ReLU networks, the Lipschitz constant \( \ell_\eta\) is equal to the product of the maximum singular values of the weight matrices. Moreover, the weights \( W_\eta^{(i)}\) for layer \( i\in[l]\) can be regularized while training \( \hat{\mathcal T}_\eta^*\) to restrict their singular values. Nevertheless, the Lipschitz constant given by (30) could be conservative and may result in very large values if the number of layers is large. Although it is an NP-hard problem [39], estimating a tighter bound of the Lipschitz constant of neural networks is a topic of interest in the machine learning community [40, 39, 41, 42]. If one is interested in the local Lipschitz continuity of a feedforward ReLU network, then [43] shows that the local Lipschitz constant estimation problem can be reduced to a semi-definite programming problem. \( \diamond\)
Assumption 4
The neural networks \( \hat{\mathcal T}_\theta\) and \( \hat{\mathcal T}_\eta^*\) are Lipschitz continuous over \( \mathcal X\) and \( \mathcal Z\) , respectively, where \( \mathcal Z\supseteq \mathcal T(\mathcal X)\) . In particular, for every \( x,\hat x\in\mathcal X\) , there exists \( \ell_\theta\in\mathbb{R}_{>0}\) such that
and, for every \( z,\hat z\in\mathcal Z\) , there exists \( \ell_\eta\in\mathbb{R}_{>0}\) such that
There exists a class \( \mathcal K_\infty\) function \( \alpha\) such that, for every \( z\in\mathcal Z\) , the neural network approximation error \( \mathcal{E}^*(z)\) appearing in (18) satisfies
This is because \( \|\mathcal{E}^*(z)\|\leq \|\mathcal T^*(z)\| + \|\hat{\mathcal T}_\eta^*(z)\|\) . Then, the above inequality follows from (8) and (30) with \( \alpha(\|z\|)=\rho(\|z\|)+\ell_\eta \|z\|\) .
Because the state space \( \mathcal X\subset\mathbb{R}^{n_x}\) is bounded, \( h(\cdot)\) is a smooth map, and \( A\) is Hurwitz, there exists a compact set \( \mathcal Z\subset\mathbb{R}^{n_z}\) containing the trajectory \( z(t;\mathcal T(x_0))\) of (5) for every \( t\in\mathbb{R}_{\geq 0}\) and every \( x_0\in\mathcal X\) . Thus, as a consequence of (8) and (30), there exists a finite approximation bound \( \epsilon^*>0\) satisfying
(31)
There have been several attempts [44, 45, 46, 47] to estimate \( \epsilon^*\) and show that it can be reduced by improving the design and learning technique of the neural network, and also by increasing the size (number of initialization points \( p\) and sampling frequency \( \tau/T\) ) of the dataset \( (X,Z)\) (see [48]).
Proposition 1
Let Assumptions 1 and 2 hold. Suppose \( A\) is diagonalizable with eigendecomposition \( A=V\Lambda V^{-1}\) . Let \( x(t)\) be the state of the noise-free system (1) and \( \hat x(t)\) be the estimate provided by (19). Then, for every \( t\in\mathbb{R}_{\geq 0}\) , the estimation error \( \xi(t)=x(t)-\hat x(t)\) satisfies
(32)
where \( \epsilon^* \) is given in (31) and \( a=\ell_\eta \|\tilde z_0\| \text{cond}(V)\) with \( \tilde z_0=z(t)-\hat z(0)\) the initialization error in the \( z\) -coordinate.
Proof 3
We have
(33)
where the first step is due to (18), the second step is due to the triangle inequality, and the last step is due to (30) and (31). Finally, using the inequality (27) with \( \overline w=\overline v=0\) in (33), we obtain (32).
Inequality (32) says that the estimation error exponentially converges to a ball whose radius is dictated by a supremum of the approximation error of the inverse map. This indicates that one can achieve better estimation performance by improving the learning of \( \hat{\mathcal T}_\eta^*\) . A general rule of thumb for improving the learning of \( \hat{\mathcal T}^*\) is to choose a deeper neural network and generate more training data through simulations.
Remark 5
Notice that uniformly injective \( \mathcal T\) implies the uniform injectivity of its inverse \( \mathcal T^*\) . It is important to note that uniform injectivity and Assumption 4 imply boundedness of approximation errors, i.e.,
and
where \( \mathbb{R}^{n_z}\supset\mathcal Z\supseteq \mathcal T(\mathcal X)\) . This holds because \( \mathcal X\subset\mathbb{R}^{n_x}\) is a compact set and
where \( \varrho\) is given in (9) and \( \ell_\theta\) is given in Assumption 4. Further improvement on this bound can be achieved practically by using the universal approximation property of neural networks, given that an ample amount of data is generated and an appropriate network architecture is chosen. \( \diamond\)
We now analyze the robustness of the learned observer (19) to estimate the state of an uncertain nonlinear system (20). Note that the design method of KKL observers as presented in Sections 3 and 4 remains the same for (20).
Proposition 2
Let Assumptions 1, 2, 3, and 4 hold. Suppose \( A\) is diagonalizable with eigendecomposition \( A=V\Lambda V^{-1}\) . Let \( x(t)\) be the state of the uncertain system (20) and \( \hat x(t)\) be the estimate provided by (19). Then, for every \( t\in\mathbb{R}_{\geq 0}\) , the estimation error \( \xi(t)=x(t)-\hat x(t)\) satisfies
(34)
where \( \overline w,\overline v,\psi\) are given in Assumption 3, \( \epsilon^*\) is given in (31), and
(35.a)
Proof 4
By substituting (27) into (33), we obtain (34).
Given that the model uncertainties and sensor noise are bounded, the above result shows that the KKL observer is robust in terms of input-to-state stability of the estimation error; see [49]. Moreover, it can be observed that improving the learning of \( \hat{\mathcal T}_\eta^*\) while choosing \( |\lambda_{\min}(A)|\) to be large and \( \text{cond}(V)\) to be small reduces the estimation error.
Choosing matrices \( A\) and \( B\) such that KKL observer (19) is robust to uncertainties is crucial. However, increasing robustness may inadvertently degrade learning accuracy and introduce substantial approximation errors. From (34), we observe that the stability of \( A\) yields robustness, but it degrades controllability of the pair \( (A,B)\) , which is also a crucial requirement for KKL observers. To elucidate, more stable \( A\) means more energy is required to steer the state \( z(t)\) of (5) through the injected output \( y(t)\) . In this section, we present a design method for \( A\) and \( B\) that finds a trade-off between robustness and learnability of a KKL observer subject to the constraints that \( A\) is Hurwitz and \( (A,B)\) is controllable. To this end, we adopt techniques from \( \mathcal{H}_\infty\) -observer design and minimum energy control [50].
Recall the error dynamics (22) in the \( z\) -coordinates. To minimize the effect of \( \sigma\) on the error \( \tilde z\) , we use the \( \mathcal{H}_\infty\) -based design of \( A\) and \( B\) under the constraint that \( (A,B)\) is controllable. Consider the transfer function \( G(s) = (sI-A)^{-1}B\) of (22). Then, one has
if and only if [50, Theorem 5.3] the bilinear matrix inequality (BMI)
(36)
is feasible for matrices \( A\in\mathbb{R}^{n_z\times n_z}\) and \( B\in\mathbb{R}^{n_z\times n_y}\) , and a positive definite matrix \( P=P^\top\in\mathbb R^{n_z\times n_z}\) . Because \( PA+A^\top P\) must be negative definite for (36) to hold, the resulting matrix \( A\) will be Hurwitz by the Lyapunov theorem [51, Theorem 2.2.1].
Notice that merely solving (36) will yield a Hurwitz \( A\) typically with very large eigenvalues, which may destroy the “practical” controllability from an energy point-of-view. That is, when the magnitude of the eigenvalues of \( A\) is large, the required energy of the signal \( y(t)\) to steer the state \( z(t)\) from \( z(0)=0\) to some \( z(\tau)\in\mathbb{R}^{n_z}\) , \( \tau\in\mathbb R_{>0}\) , would be impractically large. This will decrease the effect of the measured output \( y(t)\) of the state estimation algorithm (7), leading to slow convergence to the true state. Thus, it is recommended to maximize the minimum eigenvalue \( \lambda_{\min}(W)\) of the controllability gramian \( W\) of (22), which is a worst-case controllability metric [52]. The controllability gramian \( W=W^\top>0\) is the unique solution of the Lyapunov equation
We approximately rewrite this Lyapunov equation as a BMI
(37)
for a small \( \varepsilon > 0\) . The minimum eigenvalue of \( W\) is lower bounded by \( \lambda>0\) , i.e., \( \lambda\leq \lambda_{\min}(W)\) , if and only if [50, Lemma 1.1]
(38)
The BMI feasibility problem for optimal \( A\) and \( B\) is then formulated as follows:
(39)
with respect to \( \gamma>0\) , \( \lambda>0\) , \( A\in\mathbb R^{n_z\times n_z}\) , \( B\in\mathbb{R}^{n_z\times n_y}\) , \( P=P^\top>0\) , \( W=W^\top>0\) , where
Here, \( \gamma\) is the \( \mathcal H_\infty\) gain or a level of noise attenuation, \( \lambda\) is the worst-case controllability metric, and \( c \in [0, 1]\) is a trade-off parameter. The choice of \( c\) close to one indicates better robustness but slower observer transient while \( c\) close to zero indicates better controllability but larger sensitivity to noise.
The problem (39) is not convex, quasi-convex, or even local-global [53]. However, there are several tools (e.g., PENLAB [54]) that can be used to obtain a locally optimal solution. Another strategy to solve (39) locally is to introduce certain relaxations such as iterative convex overbounding [55] and Young’s relation [56, 57], which convert the BMIs (36) and (37) into linear matrix inequalities (LMIs). Nonetheless, the simplest way to convert (39) into an LMI is to choose an appropriate positive definite matrix \( P\) and controllability gramian \( W\) , then solve (39) to obtain \( A\) and \( B\) . The matrices \( A\) and \( B\) obtained from (39) are guaranteed to satisfy the condition of Theorem 1 while also having good noise attenuation properties.
In this section, we demonstrate the effectiveness of our proposed methodology for the learning-based design of KKL observers using the following examples:
Reverse Duffing oscillator: The state \( x\in\mathbb R^2\) and the dynamics is given by
(40)
Van der Pol oscillator: The state \( x\in\mathbb R^2\) , a parameter \( \mu \in \mathbb{R}\) , and the dynamics is given by
(41)
Rössler attractor: The state \( x\in\mathbb R^3\) , parameters \( a,b,c\in\mathbb R\) , and the dynamics is given by
(42)
Lorenz attractor: The state \( x\in\mathbb R^3\) , parameters \( p,q,r\in\mathbb R\) , and the dynamics is given by
(43)
Using the benchmark examples listed above, we show the effectiveness of our method in estimating the states of nonlinear, chaotic systems in the presence of model uncertainties and measurement noise.
We generate synthetic datasets and train models for each of the systems (40)–(43) according to our proposed methodology in Section 4. For all systems, we construct the dataset \( (X,Z)\) by sampling initial conditions over \( [-1,1]^{n_x}\) using Latin hypercube sampling, with \( n_x\) denoting the system dimension, which is 2 and 3 for (40)–(41) and (42)–(43), respectively. We sampled 100 initial conditions for (40), (41), and (43), while we used 200 samples for (42). From each sampled initial condition, we generate system trajectories using Runge-Kutta 4 over a time interval of \( [0, 50]\) .
We model the forward map \( \hat{\mathcal{T}}_\theta\) and the inverse map \( \hat{\mathcal{T}}_\eta\) as multi-layer perceptions, trained using the synthetic datasets generated for each system. The hyperparameter specifications for each model are presented in Table 1.
| System | Layers | Layer Size | Learning Rate | Epochs |
| Duffing | 3 | 150 | \( 1e^{-3}\) | 15 |
| VdP | 2 | 350 | \( 1e^{-3}\) | 15 |
| Rössler | 3 | 250 | \( 1e^{-3}\) | 15 |
| Lorenz | 2 | 350 | \( 1e^{-3}\) | 15 |
Simulation results for each system are illustrated in Fig. 2. Each trajectory is generated from a random initial condition, different from the ones in the dataset \( X\) , and each component of the model uncertainty and measurement noise is sampled from \( \mathcal{N}(0, 0.1)\) for the reverse Duffing oscillator, van der Pol oscillator and Rössler attractor, whereas for the Lorenz attractor, it is sampled from \( \mathcal{N}(0,2)\) . The simulations demonstrate the capability of our method to estimate the unmeasured states with remarkable accuracy even in the presence of model uncertainties and measurement noise.
In this section, we demonstrate the effectiveness of the learned KKL observer in detecting and isolating sensor faults in nonlinear systems. We consider two types of sensor faults: failure, where a sensor stops transmitting measurement data, and degradation, where the measurement noise level of a sensor increases. For these faults, the output equation (20.b) can be rewritten as
(44)
where
models the failure of sensor \( i\) when \( \phi_i(t)=0\) , and
models the degradation of sensor \( i\) , affecting its measurement accuracy, when \( \zeta(t)\neq 0\) . Note that a system with fault-free sensors at time \( t\) will have \( \phi(t) = I_{n_y}\) and \( \zeta(t) = 0_{n_y}\) . Any other value of these signals would represent a degradation when \( \zeta_i(t) \neq 0\) , for \( t\geq t_i^\text{fault}\) , or a failure when \( \phi_i(t) = 0\) , for \( t\geq t_i^\text{fault}\) , where \( i\in [n_y]\) and \( t_i^\text{fault}\in\mathbb{R}_{>0}\) is the time at which a fault occurs in sensor \( i\) .
Given measured output (44) from fault-prone sensors, the output predicted by the learned KKL observer (19) is given by
(45)
Observer-based fault detection and isolation (FDI) of nonlinear systems relies on the design of observers that must exhibit accurate state estimation and output prediction performance. Using the learned KKL observer (19) for FDI, we address the following problems in this section:
To solve these problems, we define residuals to compare the measured output of the system with the predicted output of the learned KKL observer (19). The residual \( r_i\) corresponding to the \( i\) -th sensor is defined as
(46)
where \( y_i(t)\) and \( \hat y_i(t)\) are the \( i\) -th components of \( y(t)\) in (44) and \( \hat y(t)\) in (45), respectively.
Furthermore, consider an increasing sequence of discrete time samples \( t_0<t_1<t_2<\dots\) with \( \delta_k=t_k-t_{k-1}\) for \( k\in\mathbb{N}\) . Then, we define a differentiated residual \( \tilde r_i\) as the absolute value of the numerical differentiation of the residual \( r_i\) , i.e.,
(47)
Let \( \tau_i\in\mathbb{R}_{>0}\) denote the threshold for residual \( r_i\) such that, in the steady state, \( \limsup_{t\to\infty} r_i(t) \leq \tau_i\) when there are no faults. Similarly, let \( \tilde r_\Delta\in\mathbb{R}_{>0}\) denote the threshold for all differentiated residuals such that \( \limsup_{t\to\infty} \tilde r_i(t)\leq \tilde r_\Delta\) when sensor \( i\) is not faulty.
We derive upper bounds on the residuals \( r_i(t)\) .
Proposition 3
Suppose Assumptions 3 and 4 hold and the matrix \( A\) is diagonalizable with eigenvalue decomposition \( A=V\Lambda V^{-1}\) . Then, in a fault-less case,
(48)
where \( \overline w, \overline v, \psi\) are from Assumption 3, \( \ell_{h_i}\) is the Lipschitz constant of \( h_i(x)\) for \( x\in\mathcal X\) , \( \epsilon^*\) is defined in (31), and \( a\) and \( b(t)\) are given in (35).
Proof 5
In a fault-less case, \( \phi(t)=I_{n_y}\) and \( \zeta(t)=0_{n_y}\) . Thus, from (46),
Then, using (34), we obtain (48).
Remark 6
In the right-hand side of (48), the approximation error bound \( \epsilon^*\) is an unknown quantity. However, it can be numerically estimated as \( \hat\epsilon^*\) by generating test data and evaluating the maximum error \( \|x-\hat{\mathcal T}_{\hat\eta^\star}^*(\hat{\mathcal T}_{\hat\theta^\star}(x))\|\) where \( x\) are points in the test dataset. \( \diamond\)
We use Proposition 3 for computing the threshold \( \tau_i(t)\) on the residuals since
where \( \hat\epsilon^*\) is an estimate of \( \epsilon^*\) . Therefore, on the test dataset, we can choose the threshold as
(49)
In a steady state when \( t\) becomes large, it holds that \( r_i(t) \leq \tau_i\) when there are no sensor faults. When the residual \( r_i(t)\) surpasses \( \tau_i\) , it signifies that the measured output is significantly different from the predicted output. Since \( \tau_i\) is the threshold in a fault-free case, sensor faults can be detected by measuring the residuals \( r_i(t)\) and raising an alert whenever \( r_i(t) > \tau_i\) for any \( i\) and any \( t\in\mathbb{R}_{\geq 0}\) .
A sensor fault is detected whenever the residual \( r_i(t)\) surpasses the threshold \( \tau_i\) . However, \( r_i(t)\) exceeding \( \tau_i\) and having the highest value among other residuals do not mean that sensor \( i\) is faulty. This is because a fault in sensor \( j\) distorts the output \( y_j(t)\) , which is filtered through the observer dynamics and then transformed by \( (h_j\circ \hat{\mathcal T}_\eta^*)(\cdot)\) . Since the neural network \( \hat{\mathcal T}_\eta^*\) is not necessarily a diagonal function, the fault in sensor \( j\) can induce a large residual \( r_i(t)\) , \( i\neq j\) , even when sensor \( i\) is not faulty. This phenomenon can also be observed in Fig. 7(c) and 8(c), which are discussed below. Due to interdependencies caused by the non-diagonal structure of \( \hat{\mathcal T}_\eta^*\) , the transients after the occurrence of a fault may persist above the threshold, which causes difficulty in precisely isolating sensor faults merely from the residuals \( r_i(t)\) .
Based on our empirical observations, differentiated residuals \( \tilde r_i(t_k)\) are better suited for fault isolation. A spike is generated in the differentiated residual \( \tilde r_i(t_k)\) whenever an abrupt fault (\( \phi_i(t)=0\) or \( \zeta_i(t)\neq 0\) ) is introduced in sensor \( i\) at time \( t_k\) . The explanation for this spike is straightforward because, in the differentiated residual, the fault in the output also gets numerically differentiated. Therefore, an abrupt fault results in a spike in the differentiated residual. However, when the fault signal \( \zeta_i(t)\) is smooth and of low magnitude, it is difficult to isolate it. Nevertheless, in many cases, faults are irregular, non-smooth signals.
We generate test datasets, where we denote state trajectories \( x^j(t_k)\mathrm{:=} x(t_k;x_0^j,0)\) that are initialized at \( q\) initial points \( x(0)=x_0^j\) , for \( j=1,\dots,q\) , and the corresponding observer estimates as \( \hat x^j(t_k)\mathrm{:=} \hat x(t_k;\hat{\mathcal T}_\eta^*(z_0^j))\) . We simulate the KKL observer and compute the residuals \( r_1^j(t_k),\dots,r_{n_y}^j(t_k)\) , for \( k=0,\dots,T\) , where \( T\in\mathbb{N}\) and
Here, \( t_T>0\) is chosen large enough to allow the observer to converge to a neighborhood of the original state. Then, the corresponding differentiated residuals \( \tilde r_1^j(t_k),\dots,\tilde r_{n_y}^j(t_k)\) , for \( k=1,\dots,T\) , are obtained using (47). Let \( t_c\) , for \( c<T\) , denote the minimum time at which the observer is in the steady state. Then an empirical threshold for differentiated residuals is computed on the test dataset as follows:
(50)
Given that the learned KKL observer estimates the state of the system accurately, \( \tilde r_i(t)\leq r_\Delta\) remains steady when \( t_k\) becomes sufficiently large. Therefore, whenever the differentiated residual \( \tilde r_i(t)\) surpasses the empirical threshold \( r_\Delta\) given by (50), an alert can be raised that sensor \( i\) is faulty. We validate this observation via simulations in the next section.
We consider a network of Kuramoto oscillators and show that FDI of a highly nonlinear system can be achieved using the learned KKL observer (19), even in the presence of model uncertainties and measurement noise. Both sensor failure and degradation faults are demonstrated.
The network of Kuramoto oscillators describes synchronization phenomena in interconnected systems and has applications in electrical power networks, multi-agent coordination, and distributed software systems [58]. The dynamics of a network of Kuramoto oscillators with \( n\) nodes is given as
(51)
where \( \theta_{i}(t)\in\mathbb{S}^1\) is the phase angle of node \( i = 1,\dots,n\) , \( \omega_{i}\in\mathbb{R}\) is the natural frequency of node \( i\) , and \( a_{ij}\geq 0\) denotes the coupling between node \( i\) and \( j\) . In the literature, the state trajectories of (51) are often represented graphically as \( \sin(\theta_{i})\) to better illustrate their synchronization. We follow the same convention in our simulations.
We consider a network of \( n=10\) nodes with randomly generated natural frequencies \( \omega_{i}\) and couplings \( a_{ij}\) . The measurements are chosen as \( y= \begin{bmatrix} \theta_{1} & \theta_{2} & \theta_{3} & \theta_{4} & \theta_{5} \end{bmatrix}^\top\) . A set of 100 initial conditions is generated in \( [-2,2]^{10}\) using Latin hypercube sampling. We choose Runge-Kutta as our numerical ODE solver to simulate (51) and (7) over a time interval \( [0,30]\) , partitioned into \( 4000\) sample points for each trajectory. The neural network \( \hat{\mathcal{T}}^*_{\eta}\) is chosen as a fully connected feed-forward network, consisting of 3 hidden layers of 250 neurons with ReLU activation functions. Model training is facilitated by data standardization and learning rate scheduling. Following [11], the parameters of KKL observer (19) are chosen as
where \( \Lambda \in \mathbb{R}^{(2n_{x}+1) \times (2n_{x}+1)}\) is a diagonal matrix with diagonal elements linearly distributed in \( [-15,-21]\) , \( \Gamma \in \mathbb{R}^{2n_{x}+1}\) is a column vector of ones, and \( I_{n_{y}}\) is the identity matrix of size \( n_{y} \times n_{y}\) . Here, \( n_x\) and \( n_{y}\) are 10 and 5, respectively, and \( n_z = n_y(2n_x+1)=105\) .
Figures 7–9 show the results of FDI for a network of Kuramoto oscillators using the learned KKL observer. The theoretical thresholds \( \tau_i\) of the residuals \( r_i(t)\) computed by (49) are shown in the figures in the third column, while the empirical threshold \( r_\Delta=4.74\) is computed using (50) according to the method described in Section 8.2. Going from left to right, the figures show (a) the predicted output \( \hat{y}_i\) obtained by the learned KKL observer and the true output \( y_i\) measured by the faulty sensors, (b) the active fault-inducing signals \( \phi_i(t)\) or \( \zeta_i(t)\) , (c) the generated residuals \( r_i\) and their respective thresholds \( \tau_i\) and (d) the differentiated residuals \( \tilde{r}_i\) and their common threshold \( \tilde r_\Delta\) .
Fig. 7 shows that our method is capable of detecting and isolating sensor failures, which is demonstrated by inducing sensor \( {4}\) ’s failure with \( \phi_{4}(t) = 0\) for \( t\geq 10\) . Before the fault, Fig. 7(a) shows the observer successfully tracking the noisy output of the system after a short transient period induced by the initialization \( \hat{z}(0) = 0\) . The introduction of sensor 4’s failure at time \( t=10\) (Fig. 7(b)) causes the predicted output to diverge from the actual output, producing large residuals that are observed in Fig. 7(c). Although sensor 4 failed, we notice in Fig. 7(c) that the residuals corresponding to sensor 3 and 5 crossed their respective thresholds, whereas the residual corresponding to sensor 4 remained under the threshold. This indicates that monitoring the residuals can allow fault detection but not isolation. For fault isolation, we monitor differentiated residuals in Fig. 7(d) and observe the spike in sensor 4’s differentiated residual at \( t=10\) , accurately indicating its failure.
Fig. 8 and 9 demonstrate the detection and isolation of sensor degradation. In Fig. 8(b), sensors \( {1}\) and \( {5}\) are introduced with a bias \( \zeta_{1}(t)=1\) , at \( t\geq 5\) , and \( \zeta_{5}(t)=1\) at \( t \geq 15\) . Fig. 8(a) shows output prediction. Each fault is detected at the moment of occurrence from the residuals in Fig. 8(c) and is isolated using the differentiated residuals in Fig. 8(d). Lastly, Fig. 9(b) shows the fault signal on sensor \( 3\) , which is a gradually increasing white noise. Fig. 9(a) shows output prediction and Fig. 9(c) and Fig. 9(d) illustrate fault detection and isolation, respectively.
We proposed a learning method to design KKL observers for autonomous nonlinear systems. We trained neural networks using physics-informed learning to approximate both the transformation map and its inverse, which are required for synthesizing a KKL observer. By fixing the KKL observer, we synthetically generated the training data for neural networks by numerically solving system and observer equations. Instead of simultaneously learning the transformation and its inverse, we proposed to learn these maps sequentially to avoid conflicting gradients during training. First, the transformation map is learned by a physics-informed neural network regularized by the PDE associated with the KKL observer. Then, using the learned transformation map, its inverse is learned by a neural network with an autoencoder-type architecture.
We provided theoretical robustness guarantees and showed that the state estimation error obtained by the learned KKL observer is input-to-state stable against approximation errors and system uncertainties, i.e., model uncertainties and measurement noise. In addition, we presented a method for optimal KKL observer parameter selection that achieves a trade-off between robustness and learnability. The effectiveness of our approach was validated through comprehensive state estimation experiments across multiple benchmark systems: the reverse Duffing oscillator, van der Pol oscillator, Rössler attractor, and Lorenz attractor. Furthermore, we successfully demonstrated the learned KKL observer’s capability in fault detection and isolation within a network of Kuramoto oscillators.
Our future research will address key open problems. The primary challenge lies in extending KKL observer learning to non-autonomous and controlled nonlinear systems, where the transformation maps and their inverses are time-varying. This temporal dependency results in time-dependent PDEs that probably require more complex neural network architectures. Another challenge lies in incorporating our KKL observer learning framework into the design of output feedback controllers while maintaining stability guarantees.
Acknowledgments This work is supported by the Swedish Research Council and the Knut and Alice Wallenberg Foundation, Sweden. It has also received funding from the European Union’s Horizon Research and Innovation Programme under Marie Skłodowska-Curie grant agreement No. 101062523.
[1] A. Teel and L. Praly. Tools for semiglobal stabilization by partial state and output feedback. SIAM Journal on Control and Optimization, 33(5):1443–1488, 1995.
[2] W. Chen and M. Saif. Observer-based strategies for actuator fault detection, isolation and estimation for certain class of uncertain nonlinear systems. IET Control Theory & Applications, 1(6):1672–1680, 2007.
[3] F. Dettù, S. Formentin, and S. Matteo Savaresi. Joint vehicle state and parameters estimation via twin-in-the-loop observers. Vehicle System Dynamics, 62(9):2423–2449, 2024.
[4] J.-P. Gauthier and I. Kupka. Deterministic Observation Theory and Applications. Cambridge University Press, 2001.
[5] H. K. Khalil and L. Praly. High-gain observers in nonlinear feedback control. International Journal of Robust and Nonlinear Control, 24(6):993–1015, 2014.
[6] P. Bernard. Observer Design for Nonlinear Systems, volume 479. Springer, 2019.
[7] D. Boutat and G. Zheng. Observer Design for Nonlinear Dynamical Systems. Springer, 2021.
[8] N. Kazantzis and C. Kravaris. Nonlinear observer design using Lyapunov’s auxiliary theorem. Systems & Control Letters, 34(5):241–247, 1998.
[9] V. Andrieu and L. Praly. On the existence of a Kazantzis–Kravaris/Luenberger observer. SIAM Journal on Control and Optimization, 45(2):432–456, 2006.
[10] V. Andrieu. Convergence speed of nonlinear Luenberger observers. SIAM Journal on Control and Optimization, 52(5):2831–2856, 2014.
[11] P. Bernard, V. Andrieu, and D. Astolfi. Observer design for continuous-time dynamical systems. Annual Reviews in Control, 53:224–248, 2022.
[12] L. Brivadis, V. Andrieu, P. Bernard, and U. Serres. Further remarks on KKL observers. Systems & Control Letters, 172:105429, 2023.
[13] P. Bernard and V. Andrieu. Luenberger observers for nonautonomous nonlinear systems. IEEE Transactions on Automatic Control, 64(1):270–281, 2018.
[14] V. Andrieu and P. Bernard. Remarks about the numerical inversion of injective nonlinear maps. In 60th IEEE Conference on Decision and Control (CDC), pages 5428–5434, 2021.
[15] D. Luenberger. Observing the state of a linear system. IEEE Transactions on Military Electronics, 8(2):74–80, 1964.
[16] D. Luenberger. Observers for multivariable systems. IEEE Transactions on Automatic Control, 11(2):190–197, 1966.
[17] D. Luenberger. An introduction to observers. IEEE Transactions on Automatic Control, 16(6):596–602, 1971.
[18] A. Shoshitaishvili. Singularities for projections of integral manifolds with applications to control and observation problems. In Theory of Singularities and its Applications, pages 295–333. American Mathematical Society, 1990.
[19] A. Shoshitaishvili. On control branching systems with degenerate linearization. In IFAC Symposium on Nonlinear Control Systems, pages 495–500, 1992.
[20] A. J. Krener and M. Xiao. Nonlinear observer design in the Siegel domain. SIAM Journal on Control and Optimization, 41(3):932–953, 2002.
[21] G. Kreisselmeier and R. Engel. Nonlinear observers for autonomous Lipschitz continuous systems. IEEE Transactions on Automatic Control, 48(3):451–464, 2003.
[22] V. Pachy, V. Andrieu, P. Bernard, L. Brivadis, and L. Praly. On the existence of KKL observers with nonlinear contracting dynamics. IFAC-PapersOnLine, 58(21):262–267, 2024.
[23] P. Bernard and M. Maghenem. Reconstructing indistinguishable solutions via a set-valued KKL observer. Automatica, 166:111703, 2024.
[24] R. Engel. Nonlinear observers for Lipschitz continuous systems with inputs. International Journal of Control, 80(4):495–508, 2007.
[25] P. Bernard. Luenberger observers for nonlinear controlled systems. In 56th IEEE Conference on Decision and Control (CDC), pages 3676–3681, 2017.
[26] L. d. C. Ramos, F. Di Meglio, V. Morgenthaler, L. F. F. da Silva, and P. Bernard. Numerical design of Luenberger observers for nonlinear systems. In 59th IEEE Conference on Decision and Control (CDC), pages 5435–5442, 2020.
[27] J. Peralez and M. Nadri. Deep learning-based Luenberger observer design for discrete-time nonlinear systems. In 60th IEEE Conference on Decision and Control (CDC), pages 4370–4375, 2021.
[28] M. Buisson-Fenet, L. Bahr, V. Morgenthaler, and F. Di Meglio. Towards gain tuning for numerical KKL observers. IFAC-PapersOnLine, 56(2):4061–4067, 2023.
[29] M. U. B. Niazi, J. Cao, X. Sun, A. Das, and K. H. Johansson. Learning-based design of Luenberger observers for autonomous nonlinear systems. In American Control Conference (ACC), pages 3048–3055, 2023.
[30] K. Miao and K. Gatsis. Learning robust state observers using neural ODEs. In Learning for Dynamics and Control Conference, pages 208–219, 2023.
[31] J. Peralez and M. Nadri. Deep model-free KKL observer: A switching approach. In Learning for Dynamics and Control Conference, 2024.
[32] P. J. Antsaklis and A. N. Michel. A Linear Systems Primer. Springer Birkhäuser Boston, MA, 2007.
[33] B. Chen and G. Hu. Nonlinear state estimation under bounded noises. Automatica, 98:159–168, 2018.
[34] M. Milanese, J. Norton, H. Piet-Lahanier, and É. Walter. Bounding Approaches to System Identification. Springer New York, NY, 1996.
[35] H. Mania, M. I. Jordan, and B. Recht. Active learning for nonlinear system identification with guarantees. Journal of Machine Learning Research, 23:1–30, 2022.
[36] L. Ljung. System Identification: Theory for the User. Prentice-Hall, Upper Saddle River, NJ, 2 edition, 1999.
[37] M. Abudia, J. A. Rosenfeld, and R. Kamalapurkar. Carleman lifting for nonlinear system identification with guaranteed error bounds. In American Control Conference (ACC), pages 929–934, 2023.
[38] E. D. Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems, volume 6. Springer New York, NY, 2 edition, 2013.
[39] A. Virmaux and K. Scaman. Lipschitz regularity of deep neural networks: Analysis and efficient estimation. Advances in Neural Information Processing Systems, 31, 2018.
[40] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[41] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas. Efficient and accurate estimation of Lipschitz constants for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
[42] M. Jordan and A. G. Dimakis. Exactly computing the local Lipschitz constant of ReLU networks. Advances in Neural Information Processing Systems, 33:7344–7353, 2020.
[43] Y. Ebihara, X. Dai, T. Yuno, V. Magron, D. Peaucelle, and S. Tarbouriech. Local Lipschitz constant computation of ReLU-FNNs: Upper bound computation with exactness verification. In European Control Conference (ECC), pages 2506–2511, 2024.
[44] J. Sokolić, R. Giryes, G. Sapiro, and M. R. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017.
[45] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.
[46] D. Jakubovitz, R. Giryes, and M. R. Rodrigues. Generalization error in deep learning. In Compressed sensing and its applications, pages 153–193. Springer, 2019.
[47] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
[48] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT press, 2018.
[49] E. D. Sontag and Y. Wang. On characterizations of the input-to-state stability property. Systems & Control Letters, 24(5):351–359, 1995.
[50] G.-R. Duan and H.-H. Yu. LMIs in Control Systems: Analysis, Design and Applications. CRC press, 2013.
[51] H. Roger and R. J. Charles. Topics in Matrix Analysis. Cambridge University Press, 1991.
[52] F. Pasqualetti, S. Zampieri, and F. Bullo. Controllability metrics, limitations and algorithms for complex networks. IEEE Transactions on Control of Network Systems, 1(1):40–52, 2014.
[53] K.-C. Goh, M. G. Safonov, and G. P. Papavassilopoulos. A global optimization approach for the BMI problem. In 33rd IEEE Conference on Decision and Control, volume 3, pages 2009–2014, 1994.
[54] J. Fiala, M. Kočvara, and M. Stingl. PENLAB: A MATLAB solver for nonlinear semidefinite optimization. arXiv preprint arXiv:1311.5240, 2013.
[55] E. Warner and J. Scruggs. Iterative convex overbounding algorithms for BMI optimization problems. IFAC-PapersOnLine, 50(1):10449–10455, 2017.
[56] K. Zhou and P. P. Khargonekar. Robust stabilization of linear systems with norm-bounded time-varying uncertainty. Systems & Control Letters, 10(1):17–20, 1988.
[57] A. Zemouche, R. Rajamani, B. Boulkroune, H. Rafaralahy, and M. Zasadzinski. \( \mathcal{H}_\infty\) circle criterion observer design for Lipschitz nonlinear systems with enhanced LMI conditions. In American Control Conference (ACC), pages 131–136, 2016.
[58] F. Dörfler and F. Bullo. Synchronization in complex networks of phase oscillators: A survey. Automatica, 50(6):1539–1564, 2014.
I am normally hidden by the status bar