LaTex2Web logo

LaTeX2Web, a web authoring and publishing system

If you see this, something is wrong

Collapse and expand sections

To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.

Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.

Cross-references and related material

Generally speaking, anything that is blue is clickable.

Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.

Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.

Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.

Discussions

By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.

If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.

Table of contents

First published on Friday, Mar 7, 2025 and last modified on Saturday, Apr 12, 2025 by François Chaplais.

Deterministic Global Optimization over trained Kolmogorov Arnold Networks
arXiv
Published version: 10.48550/arXiv.2503.02807

Tanuj Karia Department of Chemical Engineering, Delft University of Technology, Van der Maasweg 9, Delft, 629 HZ, The Netherlands

Giacomo Lastrucci Department of Chemical Engineering, Delft University of Technology, Van der Maasweg 9, Delft, 629 HZ, The Netherlands

Artur M. Schweidtmann Department of Chemical Engineering, Delft University of Technology, Van der Maasweg 9, Delft, 629 HZ, The Netherlands Email

Keywords: Deterministic global optimization, Mixed-Integer Nonlinear Programming, Kolmogorov Arnold Networks, Machine Learning

Abstract

To address the challenge of tractability for optimizing mathematical models in science and engineering, surrogate models are often employed. Recently, a new class of machine learning models named Kolmogorov Arnold Networks (KANs) have been proposed. It was reported that KANs can approximate a given input/output relationship with a high level of accuracy, requiring significantly fewer parameters than multilayer perceptrons. Hence, we aim to assess the suitability of deterministic global optimization of trained KANs by proposing their Mixed-Integer Nonlinear Programming (MINLP) formulation. We conduct extensive computational experiments for different KAN architectures. Additionally, we propose alternative convex hull reformulation, local support and redundant constraints for the formulation aimed at improving the effectiveness of the MINLP formulation of the KAN. KANs demonstrate high accuracy while requiring relatively modest computational effort to optimize them, particularly for cases with less than five inputs or outputs. For cases with higher inputs or outputs, carefully considering the KAN architecture during training may improve its effectiveness while optimizing over a trained KAN. Overall, we observe that KANs offer a promising alternative as surrogate models for deterministic global optimization.

1 Introduction

Surrogate models are frequently employed for mathematical optimization in various disciplines of science and engineering [1, 2, 3, 4]. Often, when solving optimization problems, one may encounter either an objective or a set of constraints, which makes the problem challenging to solve. In such cases, surrogate models are often trained by querying samples from the objective or set of constraints to approximate the underlying input-output relationship exhibited by the objective or constraint. Typically, the surrogate model has a simpler functional form and requires less computational effort to optimize, possibly at some loss of accuracy.

Various kinds of surrogate models have been used for mathematical optimization, including linear regression models [5], polynomial regression models [6], and other algebraic models [7, 8] derived via sparse regression techniques [9]. Other surrogate model forms [3] include convex region linear surrogate models [10], piecewise polynomial [11] and spline functions [12, 13], Gaussian processes [14], support vector machines [15, 16], ensemble tree models [17, 18, 19, 20], graph neural networks [21, 22], and artificial neural networks or multilayer perceptrons (MLPs) [23, 24, 25, 26, 27, 28].

Besides, the development of various modeling frameworks has greatly facilitated the integration of surrogate models into optimization formulations. For embedding ReLU networks into optimization models, several tools namely JANOS [29], optiCL [30], reluMIP [31] have been developed. Catering to more general machine learning models, OMLT [32] is capable of incorporating MLPs, tree ensemble models and graph neural networks [21, 22] into optimization models formulated in Pyomo [33]. OMLT allows for seamless switching between various formulations for the same machine learning model. MeLON [23, 14] is capable of handling Gaussian processes, MLPs and, support vector machines. MeLON interfaces with C++ for the global solver MAiNGO [34]. Recently, more solver-specific interfaces also have been developed for GUROBI [35] and SCIP [36]. Additionally, a package named MathOptAI has been published for embedding machine learning models in JuMP [37]. For a more detailed review of modeling frameworks for integrating surrogate models into optimization formulations, the reader is referred to the article by [66].

The popularity of deep learning, in combination with the availability of a wide plethora of tools for integrating MLPs, has made MLPs a popular choice as surrogate models. The most popular activation functions employed for the training of MLPs are the tanh and ReLU activations. Significant work has been done in the literature to facilitate the optimization of deeper (more layers) and wider (more neurons) MLPs with these activation functions. [23] proposed convex and concave envelopes for the tanh function and employed a factorable reduced space formulation of the trained MLP to exploit lower dimensionality to effectively optimize ANNs with tanh activation. Alternatively, using ReLU activation function results in a mixed-integer linear formulation [24] because of the use of big-M method to model the max function. Stronger [25] and more effective partition-based [26] mixed-integer linear formulations have further improved the tractability of optimizing over trained ReLU networks. Recently, a complementarity constraint-based formulation for modelling the ReLU activation leading to a nonlinear formulation was also proposed [38]. Finally, convex and concave relaxations for other activation functions such as sigmoid, GeLU, and SiLU have been developed [27, 39] to facilitate efficient optimization of MLPs with these activation functions. Generally, it has been reported in the literature that the larger the ANN, both in terms of number of neurons and layers, the more challenging it becomes to solve the resulting optimization model [23, 35].

Recently, a new class of machine-learning models, Kolmogorov-Arnold Networks (KANs), have been proposed [40]. [40] demonstrated the superiority of KANs over MLPs for different machine-learning tasks in terms of the number of parameters required to achieve a similar level of accuracy at the expense of slower training times for KANs. Herein, we investigate the suitability of KANs as surrogate models in mathematical optimization, particularly considering that KANs are expected to require smaller architectures than MLPs to approximate a given input/output relationship. KANs are network-type models similar to multi-layer perceptrons with two features distinguishing them from MLPs: (i) the activation lies on the edge (weight) connecting two neurons in a KAN rather than on the neuron itself for a MLP, and (ii) each activation is learnable via B-spline functions.

Thus, we consider optimization problems of the form:

\[ \begin{align} \min_{\mathbf{x}} & ~ f(\mathbf{x}, \textsf{KAN}(\mathbf{x})) \end{align} \]

(1)

\[ \begin{align} \text{s.t.} & ~ \mathbf{h}(\mathbf{x},\textsf{KAN}(\mathbf{x})) = 0 \end{align} \]

(2)

\[ \begin{align} & ~ \mathbf{g}(\mathbf{x},\textsf{KAN}(\mathbf{x})) \leq 0 \end{align} \]

(3)

\[ \begin{align} & ~ \mathbf{x} \in \mathcal{X} \subset \mathbb{R}^{n_0} \end{align} \]

(4)

Here, \( f\) is the objective function, \( \mathbf{h}\) and, \( \mathbf{g}\) represents the set of equality and inequality constraints, respectively, describing the formulation of a trained KAN; \( \sf{KAN}(\mathbf{x}) \in Y \subset \mathbb{R}^{n_L}\) is the output of the KAN at \( \mathbf{x}\) .

We propose a MINLP formulation for trained Kolmogorov-Arnold Networks, which is based on the Mixed-Integer Quadratically Constrained Programming (MIQCP) formulation proposed for B-splines by [13]. Further, we propose enhancements to exploit some properties of the KANs to improve the effectiveness of the proposed MINLP formulation. The proposed formulation is implemented using Pyomo [33] and tested on various test functions for optimization of varying dimensionality. Different architectures of KANs are trained and optimized using global MINLP solver SCIP [41] to conduct a rigorous examination of whether KANs are suitable as surrogate models. MLPs of varying architectures are also trained, and the computational effort needed to solve both MLPs and KANs to global optimality with SCIP is compared.

The rest of the paper is as follows: Section 2 provides a brief introduction describing KANs for the reader. Section 3 outlines the methodology in which the MINLP formulation for KANs and the enhancements to the formulation are explained. The implementation details in Pyomo and details of the computational experiments are also provided in Section 3. Section 4 provides the results and discussion on optimizing the trained KANs. Finally, we conclude and provide recommendations for future research in Section 5 of the paper.

2 Background

In this section, we provide the requisite background on Kolmogorov Arnold Networks (KANs) and B-spline functions which act as activations in a KAN.

2.1 Kolmogorov Arnold Networks

Kolmogorov Arnold Networks are based on Kolmogorov-Arnold representation theorem [42, 43] which establishes that for a given vector \( \mathbf{x}\) of \( n_0\) variables, a continuous function \( f(\mathbf{x})\) on a bounded domain \( \mathcal{X} \in \mathbb{R}^{n_0}\) such that \( f:\mathcal{X} \rightarrow \mathbb{R}\) , can be expressed with (\( 2n_0 + 1\) ) additions of a function \( \Phi\) , which is a finite composition of \( n_0\) univariate functions \( \phi\) . Specifically,

\[ \begin{equation} f(\mathbf{x}) = f(x_1, \ldots, x_{n_0}) = \sum_{j=1}^{2n_{0}+1} \Phi_j \left( \sum_{i=1}^{n_0} \phi_{j,i}(x_i) \right) \end{equation} \]

(5)

In Equation (5), \( \phi_{j,i}(x_i):[x_i^\mathrm{ L}, x_i^\mathrm{ U}] \rightarrow \mathbb{R}\) and \( \Phi_j:\mathbb{R} \rightarrow \mathbb{R}\) represents the composition of univariate functions \( \phi_{j,i}\) . Kolmogorov-Arnold representation theorem thereby opens up the possibility of learning any function in \( n_0\) -dimensions using a polynomial number of 1D functions. Hence, a network based on the Kolmogorov-Arnold representation theorem comprises one input layer of size \( n_0\) , one output layer with output \( f(\mathbf{x})\) and, one hidden layer with (\( 2n_0+1\) ) neurons. However, the 1D functions can be non-smooth or fractal, rendering them unlearnable in practice [44, 45].

Representation of a Kolmogorov-Arnold Network as a directed acyclic graph. Dashed gray line on a neuron represents the edge with the activation connecting two neurons. The network is fully connected.
Figure 1. Representation of a Kolmogorov-Arnold Network as a directed acyclic graph. Dashed gray line on a neuron represents the edge with the activation connecting two neurons. The network is fully connected.

To address this issue, [40] proposed generalizing Equation (5) by considering an arbitrarily large number of hidden layers (depth of the network) and neurons (width) of the hidden layers and applied modern backpropagation techniques to train them. A generalised KAN is illustrated as a directed acyclic graph in Figure 1.

Assuming a set of \( N_\mathrm{ data}\) input-output pairs \( (\mathbf{x}, \mathbf{y})\) is available, the aim is to approximate the relationship \( \mathbf{y} \approx f(\mathbf{x})\) using a KAN. Considering a KAN of arbitrary width and depth comprising \( L+1\) layers, including the input (\( l=0\) ) and output (\( l=L\) ) layers, such that the number of neurons in each layer \( l\) is defined by the vector \( \mathbf{N} := [n_0, n_1, …, n_l, …, n_{L-1}, n_\mathrm{ L}]\) of length \( L+1\) . As seen in Figure 1, the univariate activation function for a KAN (\( \phi\) ) lies on the edge connecting two neurons. The univariate activation function (\( \phi\) ) connects a neuron \( i\) in layer \( l\) to a neuron \( j\) in the next layer \( l+1\) . Hence, the univariate activation function (\( \phi\) ) is indexed as

\[ \begin{equation} \phi_{l,j,i}(x_{l,i}),\, l=0,\ldots,\,L-1, \, j=1,\ldots, \, n_{l+1},\, \text{and } i=1,\ldots,\, n_{l} \end{equation} \]

(6)

In Equation (6), \( x_{l,i}\) refers to the input \( x\) to the activation function in layer \( l\) from neuron \( i\) . Based on the Kolmogorov representation theorem introduced earlier, the input (\( x_{l+1,j}\) ) to the neuron \( j\) in the next layer \( l+1\) is a sum of all the incoming outputs of the activations from layer \( l\) to the neuron \( j\) ,

\[ \begin{equation} x_{l+1,j} = \sum_{i=1}^{n_l} \phi_{l,j,i}(x_{l,i}), \, j=1,\ldots,\,n_{l+1} \end{equation} \]

(7)

Equivalently, Equation (7) can be represented as matrices,

\[ \begin{equation} \begin{pmatrix} x_{l+1,1} \\ \vdots \\ x_{l+1,n_{l+1}} \end{pmatrix} = \begin{pmatrix} \phi_{l,1,1}(x_{l,1}) + \phi_{l,1,2}(x_{l,2}) + \cdots + \phi_{l,1,2}(x_{l,n_l}) \\ \phi_{l,2,1}(x_{l,1}) + \phi_{l,2,2}(x_{l,2}) + \cdots + \phi_{l,2,2}(x_{l,n_l}) \\ \vdots \\ \phi_{l,n_{l+1},1}(x_{l,1}) + \phi_{l,n_{l+1},2}(x_{l,2}) + \cdots + \phi_{l,n_{l+1},2}(x_{l,n_l}) \end{pmatrix} \end{equation} \]

(8)

Equation (8) can be equivalently represented as,

\[ \begin{equation} \begin{pmatrix} x_{l+1,1} \\ \vdots \\ x_{l+1,n_{l+1}} \end{pmatrix} = \begin{pmatrix} \phi_{l,1,1}(\cdot) & \phi_{l,1,2}(\cdot) & \cdots & \phi_{l,1,2}(\cdot) \\ \phi_{l,2,1}(\cdot) & \phi_{l,2,2}(\cdot) & \cdots & \phi_{l,2,2}(\cdot) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{l,n_{l+1},1}(\cdot) & \phi_{l,n_{l+1},2}(\cdot) & \cdots & \phi_{l,n_{l+1},2}(\cdot) \end{pmatrix} \begin{pmatrix} x_{l,1} \\ \vdots \\ x_{l,n_{l}} \end{pmatrix} \end{equation} \]

(9)

It is important to note that in Equation (9), the second matrix on the right-hand side of the equation is not multiplied to the first matrix on the right-hand side of the equation. Rather, this shows all elements in a given row in the first matrix take in the element from the corresponding column in the second matrix as input to the activation \( \phi\) . Compactly, Equation (9) is represented as,

\[ \begin{equation} \mathbf{x}_{l+1} = \Phi_l(\mathbf{x}_{l}) \end{equation} \]

(10)

In Equation (10), \( \Phi_l\) is a (\( n_{l+1} \times n_l\) ) matrix representing the composition of univariate functions \( \phi\) , which is a sum of all activations from neuron \( i\) in layer \( l\) to neuron \( j\) in layer \( l+1\) . \( n_l\) represents the number of inputs to the layer \( l\) and \( n_{l+1}\) represents the number of outputs from layer \( l\) or the number of inputs to the layer \( l+1\) . Hence, the predicted output from the KAN (\( \mathbf{y}^\mathrm{ pred}\) ) can be expressed compactly as

\[ \begin{equation} \mathbf{y}^\mathrm{ pred} \equiv \mathbf{x}_{L} = \Phi_{L-1}(\cdots(\Phi_1(\Phi_0(\mathbf{x}_0)))) \end{equation} \]

(11)

, where \( \mathbf{x}\) is equivalent to \( \mathbf{x}_0\) representing the inputs to the network.

As described in Figure 1, in a KAN, each activation function (\( \phi\) ) is trainable; thereby, each activation in the network is distinct. Each univariate activation function \( \phi\) is defined as:

\[ \begin{align} \phi(x_{l,i}) & = w^\mathrm{ b} b(x_{l,i}) + w^\mathrm{ s}s(x_{l,i}) \end{align} \]

(12)

\[ \begin{align} b(x_{l,i}) & = \text{silu}(x_{l,i}) = \frac{x_{l,i}}{1 + \exp{(-x_{l,i})}} \end{align} \]

(13)

\[ \begin{align} s(x_{l,i}) & = \sum_{g=0}^{G+k-1} c_g B_{g,k}(x_{l,i}) \end{align} \]

(14)

In Equations (12) - (14), the indices \( l,j,i\) are dropped for simplicity but these equations are applicable for all \( l=0,…,\,L-1, \, j=1,…, \, n_{l+1},\, \text{and } i=1,…,\, n_{l}\) . A KAN activation is a weighted sum of the univariate spline activation function (\( s(x_{l,i})\) ) and a univariate base activation function (\( b(x_{l,i})\) ) as described in Equation (12), where \( w^\mathrm{ s}\) and \( w^\mathrm{ b}\) represent its weights, respectively. In principle, any activation function can be chosen as the base activation function, but [40] propose the use of SiLU activation in Equation (13) to ensure stability during training. Each activation (\( \phi\) ) in a KAN is distinct due to the presence of the univariate spline activation function (\( s(x_{l,i})\) ). The univariate spline function is expressed as a sum-product of \( k\) th-order B-spline basis functions (\( B_{g,k}\) ) and their respective coefficients (\( c_g\) ) over the set \( g \in \mathcal{G}\) . The set \( \mathcal{G}\) is a non-decreasing sequence of real numbers called knots. The number of elements in the set \( \mathcal{G}\) depends on the number of grid points \( G\) and the order \( k\) of the B-spline chosen for training by the user, thereby acting as hyperparameters. The bias-variance tradeoff for KANs is determined by the number of intervals \( G\) used to train KANs [40].

For a given input-output pair \( (\mathbf{x}, \mathbf{y})\) and number of grid points \( G\) , the input \( \mathbf{x}_{0}\) corresponds to \( \mathbf{x}\) . The objective is to ensure the network output \( \mathbf{x}_{L} \equiv \mathbf{y}^\mathrm{ pred}\) closely matches the value of \( \mathbf{y}\) . This is done by adjusting the values of \( c_g\) , \( w^\mathrm{ b}\) , and \( w^\mathrm{ s}\) , thus representing the trainable parameters of a KAN. During the training of a KAN, \( B_{g,k}\) are fixed parameters, and they take values based on the value of the input (\( x_{0,i}\) ) to the network. The value of \( B_{g,k}\) is determined based on the de Boor algorithm [46] and is described in Equations (15) and (16).

\[ \begin{align} B_{g,0}(x_{l,i};\mathbf{t}) &= \left\{ \begin{array}{ll} 1, & t_g \leq x_{l,i} < t_{g+1}, \\ 0, & \text{otherwise}. \end{array}\right. \end{align} \]

(15)

\[ \begin{align} B_{g,d}(x_{l,i};\mathbf{t}) &= \frac{x_{l,i} - t_g}{t_{g+d} - t_g} B_{g,d-1}(x_{l,i};\mathbf{t}) + \frac{t_{g+d+1} - x_{l,i}}{t_{g+d+1} - t_{g+1}} B_{g+1,d-1}(x_{l,i};\mathbf{t}), \, \forall \, d=1,\ldots,k \end{align} \]

(16)

Both Equations (15) and (16) are dependent on on the knot sequence \( \mathbf{t} \in \mathcal{G}\) which is a sequence of non-decreasing numbers. For a KAN, the definition of the knot vector (\( \mathbf{t}\) ) depends on the number of grid points (\( G\) ). Considering the input to an activation \( \phi\) is bounded between \( [x_{l,i}^\mathrm{ L}, x_{l,i}^\mathrm{ U}]\) , the knot vector \( \mathbf{t}\) is defined as,

\[ \begin{align} t_g &= \left\{ \begin{array}{ll} x_{l,i}^\mathrm{ L} + \frac{(g - k)(x_{l,i}^\mathrm{ U} - x_{l,i}^\mathrm{ L})}{G}, & 0 \leq g \leq k-1, \\ t_{g-1} + \frac{x_{l,i}^\mathrm{ U} - x_{l,i}^\mathrm{ L}}{G}, & k \leq g \leq G+k, \\ x_{l,i}^\mathrm{ U} + \frac{(g - k)(x_{l,i}^\mathrm{ U} - x_{l,i}^\mathrm{ L})}{G}, & G+k+1 \leq g \leq G+2k-1 \end{array}\right. \end{align} \]

(17)

For the input layer (\( l=0\) ), it is straightforward to determine the bounds of the input to the network. For the subsequent layers, the knot vector (\( \mathbf{t}\) ) is updated online during training [40]. Once a KAN is trained, the knot vector \( \mathbf{t}\) is fixed. Finally, even though the defined number of grid points is \( G\) , the knot vector \( \mathbf{t}\) comprises \( G+2k\) elements with the first and last \( k\) elements extending the grid points from the bounds \( [x_{l,i}^\mathrm{ L}, x_{l,i}^\mathrm{ U}]\) . No justification is provided for the same in [40], herein we adopt their implementation.

Assuming, \( n_1 = … = n_{L-1} = N\) , a KAN is characterised by \( \mathcal{O}(N^2L(G+k))\) parameters. By contrast, for an MLP with \( N\) neurons each in \( L\) layers, the number of the parameters characterising the ANN is \( \mathcal{O}(N^2L)\) , which is lower than for KANs. Since KANs have the ability to approximate any given input/output relationship or a function with a polynomial number of one-dimensional functions via splines, the residual rate representing the deviation between the function prediction and the data used to train the function is independent of the number of dimensions (inputs) needed to approximate the function (c.f. Theorem 2.1 in [40]). As a result, KANs typically require lower \( N\) than MLPs; therefore, KANs need fewer parameters on balance to exhibit performance similar to that of an MLP. For additional details, the reader is referred to Section 2.3 in [40].

2.2 MIQCP formulation for B-splines

The previous sub-section outlined the architecture of a KAN and briefly outlined how they exhibit superior efficiency in terms of parameters compared to an MLP. Additionally, the calculations for B-splines used as activations in KANs were described in the previous section. In this sub-section, we describe the Mixed-Integer Quadratically Constrained Programming (MIQCP) formulation proposed by [13] for univariate splines and adopt it for describing spline activations in KANs. For a trained KAN, \( w^\mathrm{ b}\) , \( w^\mathrm{ s}\) and the set of coefficients \( c_g\) for the spline function are fixed parameters. Additionally, for a trained KAN the knot vectors \( \mathbf{t}\) for all activations \( s_{l,j,i}\) are fixed. When embedding a KAN into an optimization formulation, the \( k\) -th order basis function \( B_{g,k}(x_{l,i})\) is not a fixed parameter anymore and needs to be determined.

The spline basis functions (\( B_{g,d}(x_{l,i})\) ) of degree \( d\) can be determined using Equations (15) and (16). As described in Equation (15), zeroth-order basis functions (\( B_{g,0}(x_{l,i})\) ) is equal to one if the input to the spline activation falls within the knot interval \( [t_g, t_{g+1})\) , otherwise, it is equal to zero. For all the equations described in this sub-section we drop the indices for the layers and neurons in the spline activation function for simplicity. [13] introduce binary variables to represent the zeroth-order spline basis functions (Equation (18)).

\[ \begin{equation} B_{g,0} = \{ 0,1 \}, ~ g = 0,\ldots, G+2k-1 \end{equation} \]

(18)

To model the discontinuity in Equation (15) in the optimization formulation, two linear constraints (Equations (19) and (20)) are introduced,

\[ \begin{equation} (t_g - t_0) B_{g,0} + t_0 \leq x, ~ g = 0,\ldots, G+2k-1 \end{equation} \]

(19)

\[ \begin{equation} (t_{g+1} - t_{G+2k}) B_{g,0} + t_{G+2k} \geq x, ~ g = 0,\ldots, G+2k-1 \end{equation} \]

(20)

These linear inequality constraints enforce the corresponding zeroth-order spline basis function to be active based on the value of the input belonging to the interval \( t_g \leq x \leq t_{g+1}\) and are equivalent to the Big-M formulation for the discontinuous function given in Equation (15) with the M values corresponding to the size of the intervals. Higher-order basis functions (\( B_{g,d}(x_{l,i})\) ) are determined by introducing Equation (16), a quadratic constraint into the optimization formulation. Following Lemma 2 in [13], which describes the convex combination property of B-spline basis functions, we constrain the basis functions to be non-negative (Equation (21)).

\[ \begin{equation} B_{g,d} \geq 0, d=0,\ldots,k \end{equation} \]

(21)

Additionally, we also consider the partition of unity cuts proposed by [13] for all degrees \( d=0,…,k\) in our formulation,

\[ \begin{equation} \sum_{g=0}^{G+2k-1-d} B_{g,d} = 1, ~ d=0,\ldots, k \end{equation} \]

(22)

For the exact representation of a univariate B-spline, the partition of unity cut for \( d=0\) is necessary. The paritition of unity cuts for higher orders is considered to strengthen the formulation [13]. Finally, the kth-order basis functions along with their respective coefficients are used to determine the output from the spline activation as described in Equation (14). Other strengthening cuts proposed in [13] are not included in the formulation and will be discussed in Section 3.2.3. For additional details about the mathematical properties of B-spline basis functions the reader is referred to [13]. In the next section, we will build on this formulation for conducting the spline calculations in a KAN formulated as a MINLP.

3 A MINLP formulation for KANs

So far we provided a brief background on Kolmogrov Arnold Networks and introduced the MIQCP formulation proposed by [13] for univariate B-spline functions. In this section, we will build on the MIQCP formulation for B-splines to describe spline activation functions in KANs and propose a MINLP formulation for a fully connected KAN. Additionally, we propose several enhancements aimed at improving the effectiveness of the MINLP formulation of the KAN.

3.1 Description of the formulation

Considering the optimization problems described by Equations (1) – (4), the aim is to describe a MINLP formulation for \( \sf{KAN}(\mathbf{x})\) . We consider a KAN with \( L+1\) layers containing neurons defined by the vector \( N := [n_0,n_1,…,n_l,…,n_{L-1},n_L]\) , with \( l=0\) and \( l=L\) corresponding to the input and the output layer, respectively. A neuron \( i\) in layer \( l\) is connected by an activation \( \phi_{l,j,i}\) to neuron \( j\) in layer \( l+1\) . To model the connections in a KAN including the activations \( \phi_{l,j,i}\) , we introduce additional variables \( \tilde{x}_{l,j,i}\) representing the pre-activation value. For the input layer \( l=0\) , the pre-activated value \( x_{0,j,i}\) is

\[ \begin{equation} \tilde{x}_{l,j,i} = x_{l,i}, \, l=0,\ldots, L-1, \, i = 1,\ldots,n_0, \, \text{and } j=1,\ldots,n_1 \end{equation} \]

(23)

The activated value for a given neuron pair (\( j\) ,\( i\) ) in layer \( l\) is a weighted sum of the base activation function (\( b\) ) and the spline activation function (\( s\) ),

\[ \begin{equation} \begin{aligned} & \phi_{l,j,i}(\tilde{x}_{l,j,i}) = w^\mathrm{ b}_{l,j,i} b_{l,j,i}(\tilde{x}_{l,j,i}) + w^\mathrm{ s}_{l,j,i} s_{l,j,i}(\tilde{x}_{l,j,i}), \\ & l=0,\ldots,L-1, \, j=1,\ldots,n_{l+1}, \text{ and } i=1,\ldots,n_l \end{aligned} \end{equation} \]

(24)

Typically, SiLU activation is used for the base function \( b\) and it is described in Equation (13). For determining the spline value, the MIQCP formulation described in Equations (18) - (20), (16), (22), and (14) is employed with \( \tilde{x}_{l,j,i}\) as inputs. All the activated outputs for the layer \( l\) are combined as described in Equation (7) and can be formulated as,

\[ \begin{equation} x_{l+1,j} = \left[ \sum_{i=1}^{n_l} \phi_{l,j,i} \right] + b_{l,j}^\mathrm{ l}, l=0,\ldots,L-1 \text{ and } j=1,\ldots,n_{l+1} \end{equation} \]

(25)

While, [40] do not explicitly mention the presence of a bias term (\( \mathbf{b}^\mathrm{ l}\) ) for a layer \( l\) , it is included in the implementation for sparsity regularization [47] and is a trainable parameter. For embedding the KAN into an optimization formulation, \( \mathbf{b}^\mathrm{ l}\) is a fixed parameter. Thus, the description of the MINLP formulation that represents a trained KAN is complete. It is important to note that except for Equation (13), all other equations in the formulation are either linear or quadratic. In the next sub-section, we propose enhancements to strengthen the MINLP formulation of a KAN presented in this section.

3.2 Strengthening the formulation

In this sub-section, we propose several enhancements aimed at improving the efficiency of the MINLP formulation for a trained KAN.

3.2.1 Feasibility-based bounds tightening (FBBT)

The training of a KAN provides us with the range of values for both inputs and outputs for a given activation \( \phi_{l,j,i}\) [48]. These range of values bound the values for \( \tilde{x}_{l,j,i}\) in the formulation. These bounds can be used to infer bounds for the output \( x_{l+1},j\) from the layer \( l\) via feasibility-based bounds tightenning [49, 50]. Bounds can be inferred using:

\[ \begin{equation} \begin{aligned} x_{l+1,j}^\mathrm{ L} &= \left[ \sum_{i=1}^{n_l} \phi_{l,j,i}^\mathrm{ L} \right] + b_{l,j}^\mathrm{ l} \\ x_{l+1,j}^\mathrm{ U} &= \left[ \sum_{i=1}^{n_l} \phi_{l,j,i}^\mathrm{ U} \right] + b_{l,j}^\mathrm{ l} \end{aligned} \end{equation} \]

(26)

Moreover, the bounds for \( b\) and \( s\) can be determined using the bound values for \( \tilde{x}_{l,j,i}\) . For the base activation function \( b\) , assuming SiLU activation is used, the global minimum for SiLU function is known. The value of the global minimum for SiLU function is -0.278465 [51] and is specified as the lower bound for \( b_{l,j,i}\) . Since, SiLU aims to provide a smooth approximation of the ReLU activation function [51], the upper bound for \( b_{l,j,i}\) is defined as (Equation (27)),

\[ \begin{equation} b_{l,j,i}^\mathrm{ U} = \left\{ \begin{array}{ll} 0, & \tilde{x}_{l,j,i}^\mathrm{ U} \leq 0 \\ \tilde{x}_{l,j,i}^\mathrm{ U}, & \tilde{x}_{l,j,i}^\mathrm{ U} > 0 \end{array}\right. \end{equation} \]

(27)

To bound the spline output, the bounds for the \( k\) -th order basis functions need to be determined. From Equation (21), we know that \( B_{g,d}\) for any activation \( l\) , \( j\) and \( i\) is always non-negative. Due to the constraint \( \sum_{g=0}^{G+k-1} B_{g,k} = 1\) , the upper bound on \( B_{g,k}\) is one since \( B_{g,k}\) cannot be negative. Hence, for a set of coefficients \( \mathbf{c}\) , the spline output can be bounded as,

\[ \begin{equation} \begin{aligned} s_{l,j,i}^\mathrm{ L} &= \sum_{g=0}^{G+k-1} \min\{ 0, c_g\} \\ s_{l,j,i}^\mathrm{ U} &= \sum_{g=0}^{G+k-1} \max\{ 0, c_g\} \end{aligned} \end{equation} \]

(28)

Tighter bounds for \( x_{l+1,j}\) , \( b_{l,j,i}\) , and \( s_{l,j,i}\) can be obtained using optimality-based bounds tightening techniques [49], but is not considered in this study.

3.2.2 Convex hull reformulation

In the MIQCP formulation proposed by [13], binary variables are introduced for each interval, and based on the value of \( \tilde{x}_{l,j,i}\) the corresponding binary variables are activated via the Big-M formulation. Herein, we also consider the equivalent convex hull reformulation derived from disjunctive programming for modeling Equation (15), which may tighten the formulation [52]. Equation (15) can be modelled as a disjunctive program,

\[ \begin{equation} \left[ \begin{array}{c} B_{l,j,i,g,0}(\tilde{x}_{l,j,i}) = 0 \\ \tilde{x}_{l,j,i} \notin [t_g, t_{g+1}] \end{array} \right] \vee \left[ \begin{array}{c} B_{l,j,i,g,0}(\tilde{x}_{l,j,i}) = 1 \\ \tilde{x}_{l,j,i} \in [t_g, t_{g+1}] \end{array} \right] \end{equation} \]

(29)

For an activation \( \phi_{l,j,i}\) , we introduce a set of auxiliary variables (\( \mathbf{z}\) ) indexed over each interval in the knot vector \( \mathcal{G}\) , such that \( z_{l,j,i,g} = B_{l,j,i,g,0} \tilde{x}_{l,j,i}\) . To implement the convex hull reformulation, Equations (30) – (32) are introduced:

\[ \begin{align} z_{l,j,i,g} & \leq t_{l,j,i,g+1} B_{l,j,i,g,0} \end{align} \]

(30)

\[ \begin{align} z_{l,j,i,g} & \geq t_{l,j,i,g} B_{l,j,i,g,0} \end{align} \]

(31)

\[ \begin{align} \sum_{g=0}^{G+k-1} z_{l,j,i,g} & = \tilde{x}_{l,j,i} \end{align} \]

(32)

Similar to the Big-M formulation, the convex hull formulation also introduces mixed-integer linear constraints to model Equation (15). Convex hull reformulation is the tightest possible formulation for the disjunction in Equation (29) [52]. Whether it yields any computational benefits over the Big-M formulation will be assessed in Section 4 of this paper.

3.2.3 Local support cuts

[13] proposed the inclusion of strengthening cuts for the MIQCP formulation for B-splines. One class of strengthening cuts proposed by [13] were the local support cuts which is derived from the local support property of b-splines. The property that a basis function \( B\) is bounded in \( [0,1]\) is exploited by means of introducing these cuts [13]. Local support cuts can be appended to the formulation by the following constraint:

\[ \begin{equation} B_{l,j,i,g,d} \leq \sum_{h=g}^{g+d+1-\check{d}} B_{l,j,i,h,\check{d}}, \, \check{d}=0,\ldots,k-1, \, d=\check{d}+1,\ldots,k, \, g=0,\ldots,G+k-1 \end{equation} \]

(33)

In Section 4 of this study, we will investigate the impact of appending local support cuts for optimizing over-trained KANs.

3.2.4 Redundant cuts

Following the idea of local support cuts, herein we propose a new class of strengthening cuts. Again, we exploit the property that any basis function \( B\) is bounded in \( [0,1]\) . Considering that the constraint to determine higher order spline basis functions (Equation (16)) is a sum of two bilinear terms. The bilinear terms participating in Equation (16) are the previous order spline basis functions (\( B_{l,j,i,g,d-1}\) and \( B_{l,j,i,g+1,d-1}\) ) and the input to the spline \( \tilde{x}_{l,j,i}\) normalized depending on where it lies in the interval \( [t_{l,j,i,g}, t_{l,j,i,g+1}]\) . Since the knot vector \( \mathbf{t}\) is a sequence of non-decreasing numbers, both participating terms in the bilinear terms are also bounded in \( [0,1]\) , where \( g=0,…, G+2k-1-d\) and \( d=1,…,k\) . Hence, we can introduce the linear cuts:

\[ \begin{equation} B_{l,j,i,g,d} \leq B_{l,j,i,g,d-1} + B_{l,j,i,g+1,d-1}, \, g = 0,\ldots, G+2k-1-d, \, d=1,\ldots,k \end{equation} \]

(34)

The terms normalizing the input \( \tilde{x}_{l,j,i}\) based on where it lies in the interval \( [t_{l,j,i,g}, t_{l,j,i,g+1}]\) in Equation (34) are assumed to be equal to one. However, depending on the value of \( \tilde{x}_{l,j,i}\) , both these terms normalizing \( \tilde{x}_{l,j,i}\) cannot take the value of one simultaneously. Moreover, since a basis function \( B\) is bounded in \( [0,1]\) , Equation (34) is always true. These inequalities are redundant because they do not change the solution obtained from a B-spline. However, they may strengthen the proposed formulation [53].

Compared to the local support cuts defined in Equation (33), fewer linear constraints are introduced to the formulation for redundant cuts. For a KAN having two inputs trained with \( G=6\) grid points using a cubic (\( k=3\) ) B-spline, 180 redundant cuts are introduced to the formulation. Whereas, for the same KAN, 324 local support cuts are introduced. Thus, local support cuts result in a formulation with higher number of constraints than redundant cuts. Later in this paper, a more detailed analysis based on numerical experiments conducted will be presented to assess the effectiveness of these cuts.

3.2.5 Exploiting sparsity of B-splines in KANs

In the implementation of training a KAN, [40] extend the grid by \( k\) points to both left and right of the lower and upper bound in which the input to a given activation is bounded [54]. We exploit this implementation detail to ensure input to an activation is bounded within the range determined during the training of a KAN by fixing the binary variables for extended grid points representing the zeroth-order basis functions to zero. Consequently, the dependent higher-order basis functions determined via Equation (16) are also fixed to zero.

\[ \begin{equation} B_{\hat{d},d} = 0, \, d=0,\ldots,k \text{, } \hat{d}=0,\ldots,k-d \text{ and } \hat{d}= G+k,\ldots,G+2k-d \end{equation} \]

(35)

3.2.6 Bounding the SiLU contribution to the activation in a KAN

Any activation in a KAN consists of a contribution from the trained B-spline and a base activation function (c.f. Equation (12)). Typically, SiLU activation is used as the base activation function in a KAN. Section 3.2.1 outlined how bounds were provided for SiLU contribution to an activation in KANs. However, tighter bounds can be easily derived for the SiLU activation function. [27] derived convex and concave envelopes for the SiLU activation function. The implementation of convex and concave envelopes is more suitable within a spatial branch and bound framework where the bounds on variables are updated continuously. Herein, we employ the McCormick inequalities [55] to derive tighter bounds in the formulation for the SiLU activation function. SiLU activation function can be expressed as a bilinear term where,

\[ \begin{equation} b(x) = x \times y \end{equation} \]

(36)

\[ \begin{equation} y(x) = \text{sigmoid}(x) = \frac{1}{1 + \exp{(-x)}} \end{equation} \]

(37)

The range of sigmoid function is bounded in \( [0, 1]\) which can be used to bound \( y\) . Since sigmoid function increases monotonically, tighter bounds on \( y\) can be derived as \( [y(x^\mathrm{ L}), y(x^\mathrm{ U})]\) . The four linear inequalities that are added to the formulation are:

\[ \begin{align} b &\geq x^\mathrm{ L} y + x y(x^\mathrm{ L}) - x^\mathrm{ L} y(x^\mathrm{ L}) \end{align} \]

(38.a)

\[ \begin{align} b &\geq x^\mathrm{ U} y + x y(x^\mathrm{ U}) - x^\mathrm{ U} y(x^\mathrm{ U}) \end{align} \]

(38.b)

\[ \begin{align} b &\leq x^\mathrm{ U} y + x y(x^\mathrm{ L}) - x^\mathrm{ U} y(x^\mathrm{ L}) \end{align} \]

(38.c)

\[ \begin{align} b &\leq x^\mathrm{ L} y + x y(x^\mathrm{ U}) - x^\mathrm{ L} y(x^\mathrm{ U}) \end{align} \]

(38.d)

In Figure 2 we compare the different bounding strategies for the SiLU activation function. The introduction of auxiliary variable \( y\) leads to tighter bounds with McCormick inequalities [56] than with the nominal McCormick approach [55, 57], as also shown in Figure 2 (c.f. the dashed green lines for bounding inequalities with an auxiliary variable for sigmoid function and the dashed yellow line for the one where no auxiliary variable is introduced). [27] proposed the use of secant estimator as concave envelope to over-estimate the SiLU function. However, the use of McCormick inequalities via the introduction of auxiliary variable \( y\) for \( \text{sigmoid}(x)\) leads to a tighter over-estimation of SiLU function.

Comparison of different bounding strategies for the SiLU function in the range [-5,5]) . The McCormick under and over-estimators without the introduction of an auxiliary variable are derived using &lt;span class=&quot;monospace&quot;&gt;MC++&lt;/span&gt; Chachuat2015.
Figure 2. Comparison of different bounding strategies for the SiLU function in the range \( [-5,5]\) . The McCormick under and over-estimators without the introduction of an auxiliary variable are derived using MC++ [58].

3.3 Implementation

The MINLP formulation for a trained KAN described in Section 3.1 has been implemented using Pyomo [33]. A Python class named NeuronBlockRule is defined to create a Pyomo block to model a KAN activation connecting two neurons. Equations (13), (14), (16), and (18) - (24) are implemented in this class forming the Default formulation for a KAN. Additionally, proposed enhancements described in Sections 3.2.23.2.6 are also implemented in this class. The proposed enhancements are activated based on the input provided by the user via an options file. The options users must provide are described in Table 1.

Table 1 List of options that need to be passed through an options file
Option nameDescriptionValues allowedDefault value
reformulationFormulation to model the \( 0^{\text{th}}\) -order basis functions in a B-spline in a KAN activation
Big-M,
Convex Hull
Big-M
exploit_sparsityFixing basis functions corresponding to extended grid points to 0 (Equation (35))0, 10
redundant_cutsAppending redundant cuts to the formulation (Equation (34))0, 10
local_supportAppending local support cuts to the formulation (Equation (33))0, 10
silu_mccormickAppending McCormick linear inequalities to the formulation (Equations (37)(38))0, 10

Hence, Default configuration corresponds to when all options in Table 1 are at their default values. ConvexHull configuration is when reformulation option is changed to “Convex Hull" while others remain at the default values. Similarly, ExploitSparsity configuration is when exploit_sparsity option is set to 1 and others are at their default values. Redundant configuration is obtained by setting the option redundant_cuts to 1. On setting local_support option to 1, LocalSupport configuration is obtained. Finally, McCormick configuration is obtained by setting silu_mccormick option to 1.

LayerBlockRule class inherits the NeuronBlockRule class to create a Pyomo block for a KAN layer comprising the neurons along with the inclusion of Equation (25). Finally, the optimization formulation of a trained KAN is instantiated via the Python script create_KAN by importing the LayerBlockRule. Additionally, the user can define the objective function and constraints relating to scaling inputs and outputs during training in this script.

A Python script reads an instance of the model object resulting from training a KAN to export the values of its training parameters which include the number of layers \( L+1\) , the vector of number of neurons in each layer \( \mathbf{N}\) , coefficients to determine the spline activation value \( c_{l,j,i,g}\) , knot vectors \( \mathbf{t}_{l,j,i}\) based on Equation (17), the weights for the base and spline activation functions \( w_{l,j,i}^\mathrm{ b}\) , and \( w_{l,j,i}^\mathrm{ s}\) , respectively, biases for each layer \( b_{l,j}^\mathrm{ l}\) , the lower and upper bounds on all activations \( \phi_{l,j,i}\) as provided by the trained KAN model object, and the scaler parameters used to normalize the training data. In addition, this Python script implements one round of FBBT calculations as outlined in Section 3.2.1 to derive bounds on \( x_{l+1,j}\) , \( b_{l,j,i}\) , and \( s_{l,j,i}\) . The Python script returns a JSON file with the data comprising the values of the trained parameters and bound values as described. For instantiating a KAN, the user needs to provide the name of the JSON file in create_KAN script.

4 Results and discussion

Herein, we describe the computational experiments carried out to assess the effectiveness of the proposed formulation and all the enhancements. Moreover, we discuss the results by running the six different configurations for all trained instances. This is followed by comparing the effect of KAN architecture on the computational effort required to globally optimize over-trained KANs. We discuss the impact of grid size, number of neurons, layers, and inputs. Finally, we compare the computational effort needed to optimize over-trained MLPs and KANs.

4.1 Computational experiments

To understand the effect of computational effort required to optimize over a trained KAN for varying architectures, two standard test functions for optimization are considered, namely Rosenbrock and peaks function. Peaks function is a two-dimensional function expressed as:

\[ \begin{align} f_\mathrm{ peaks}(x_1, x_2) &= 3(1 - x_1)^2 \exp[{-x_1^2 - (x_2 + 1)^2}] - 10 \left( \frac{x_1}{5} - x_1^3 - x_2^5 \right) \exp[{-x_1^2 - x_2^2}] \\ & - \frac{\exp[{-(x_1 + 1)^2 - x_2^2}]}{3} \end{align} \]

(39)

The domain \( D\) considered is \( \{x_1, x_2 \in \mathbb{R}: -3 \leq x_1, x_2 \leq 3\}\) . In this domain the global minimum for \( f_\mathrm{ peaks}\) is -6.551 at \( x_1 = 0.228\) and \( x_2 = -1.626\) [23].

Rosenbrock function is an \( n\) -dimensional function expressed as:

\[ \begin{equation} f_\mathrm{ ros}(\mathbf{x}) = \sum_{i=1}^{n-1} [100(x_{i+1} - x_i^2)^2 + (x_i - 1)^2] \end{equation} \]

(40)

For Rosenbrock function, \( D = \{x_1, …, x_n \in \mathbb{R}: -2.048 \leq x_1, …, x_n \leq 2.048\}\) , global minimum of \( f_\mathrm{ ros}\) is 0, at \( x_1, …, x_n = 1\) [59]. Computational experiments are conducted with the Rosenbrock function for \( n_0=\{3,5,10\}\) .

Table 2 Details of varying architectures used for computational experiments
Function# grid points# neurons (# grid points)# layers (# grid points)
\( f_\mathrm{ peaks}\)\( \{3,6,…,24,27\}\)\( \{2,3,…,9,10\} (15)\)\( \{1,2,3,4,5\} (15)\)
\( f_\mathrm{ ros} (n_0=3)\)\( \{3,6,…,24,27\}\)\( \{2,3,…,9,10\} (12)\)\( \{1,2,3,4,5\} (12)\)
\( f_\mathrm{ ros} (n_0=5)\)\( \{3,6,…,24,27\}\)\( \{2,3,…,7,8\} (6)\)\( \{1,2,3,4,5\} (6)\)
\( f_\mathrm{ ros} (n_0=10)\)\( \{3,6,…,24,27\}\)\( \{2,3,4,5,6\} (3)\)\( \{1,2,3,4,5\} (3)\)

Data is generated for both Rosenbrock and peaks functions, and KANs are trained. The details of the number of samples used for training and testing, the resulting root mean square error on both training and testing subsets, and the time needed to train the KAN are provided in the Appendix. The effect of grid size, the number of neurons in a layer and the number of layers on the computational effort required to optimize a trained KAN is investigated. The details of varying architectures are provided in Table 2. For both functions, the grid size is varied from 3 to 27, with a step increase of 3 for KAN with one layer and two neurons. The number of neurons in one layer is varied from 2 to 10 for a fixed grid size, and finally, the number of layers is varied from 1 to 5, with two neurons in each layer for a fixed grid size. The fixed grid size is reported in the corresponding column in Table 2 for investigating the effect of the number of neurons and layers.

By considering different architectures, 86 instances of KANs are generated and tested on six different configurations as outlined in Table 1. All instances are solved using the global MINLP solver SCIP (v9.0.1) [41] via AMPL Solver Library (ASL) in Pyomo. A default gap tolerance of zero is used for convergence of the solver [60]. A time limit of 2 hours is set if SCIP is unable to converge to the gap tolerance used. Additionally, multi-layer perceptrons with ReLU activation are trained using the same dataset used to train the KANs and solved using SCIP with the same settings by importing the trained network using OMLT [32]. A performance comparison between the computational effort needed to optimize over-trained MLPs and KANs is then carried out. All computational experiments are carried out on a Dell Latitude 7440 laptop with 13th generation Intel ® Core ® i7-1365U processor @1.80 GHz with 16 GB of RAM and running Windows 10.

4.2 Comparison of configurations

For an overall comparison of the different configurations considered, a modified performance profile is presented [61] in Figure 3, showing the proportion of instances solved by a configuration at a given time within the time limit of 2 hours.

Performance profiles comparing the different formulation configurations of KANs
Figure 3. Performance profiles comparing the different formulation configurations of KANs

A large proportion (around \( 33\% - 45\%\) ) of instances are solved within 10 minutes, depending on the configuration. However, regardless of the configuration, few additional instances are solved in more than 10 minutes and less than 2 hours, with around \( 60\%\) of the instances solved in total. This is a modest increase from the proportion of instances solved at 10 minutes, indicating that the computational effort required to optimize over a trained KAN increases significantly with an increase in its size (a more detailed discussion is provided in the upcoming subsections).

Among different configurations, maximum instances at the end of the time limit are solved using LocalSupport configuration. However, for relatively simple instances, LocalSupport configuration takes longer to converge. For simpler instances (solved in less than 10 minutes), while local support cuts strengthen the formulation, they increase the problem’s size due to additional inequalities in the formulation. Hence, simpler instances take longer to solve with the LocalSupport configuration. Similarly, due to the appending of extra constraints for the Redundant configuration, fewer proportion of instances are solved with this configuration at any given time. Moreover, the results indicate that the cuts appended in the Redundant configuration are relatively weaker than for LocalSupport configuration. This is corroborated by the lower proportion of instances solved at the time limit using the Redundant configuration relative to the LocalSupport configuration.

Both ExploitSparsity and McCormick configurations do not offer any benefits overall, as seen in Figure 3. With the ConvexHull configuration, a larger proportion of instances are solved within 1000 to 3000 seconds, indicating that for moderately difficult instances, the use of the convex hull over Big-M reformulation is beneficial. Overall, Default configuration performs the best, as corroborated by the ‘Shifted Geometric Wall-Time Mean’ (SGWM) [62] reported in Table 3. SGWM is calculated as,

\[ \begin{equation} \text{SGWM} = \exp{\left[ \frac{\sum_{m=1}^{n_\mathrm{ ins}} \ln(\tau_m + \alpha)}{n_\mathrm{ ins}} \right]} - \alpha \end{equation} \]

(41)

Table 3 SGWM for different configurations over 86 instances of KANs
ConfigurationDefaultConvexHullExploitSparsity
SGWM (s)799.63809.05832.35
ConfigurationRedundantLocalSupportMcCormick
SGWM (s)1140.801235.14841.75

In Equation (41), \( \tau_m\) refers to the wall time required to optimize an instance of KAN, \( n_\mathrm{ ins}\) is the number of instances and for this study 86 instances of KANs are solved, and \( \alpha\) is the shift parameter, set to 5. Henceforth, all subsequent discussions will be based on the Default configuration of KANs.

4.3 Effect of grid-size

For KANs, number of grid points used to approximate a function using B-splines is a hyperparameter. Bias-variance tradeoff is linked to the number of grid points used. The use of a coarse grid leads to underfitting and a fine grid leads to overfitting [40]. In this section, we discuss the impact of varying grid size on the root mean square error (RMSE) on the testing set and the computational effort required to optimize the trained KAN. From Table 4 it is evident that on increasing the number of grid points first the test error decreases and then starts increasing. Such behavior indicates overfitting and is consistent with the results reported in [40]. However, this is not uniformly observed (c.f. the test RMSE for \( G=12, …, \, 24\) ) for the case of \( f_\mathrm{ peaks}\) possibly due to the stochastic nature of the training [40]. For all the test functions, it is generally observed that on increasing the grid-size the computational effort required to optimize the KAN increases.

Table 4 Trade-off between RMSE on the testing set and wall time (in seconds) denoted as ‘Time’ required to solve a trained KAN with one hidden layer containing two neurons by varying the grid size \( G\) .
\( f_\mathrm{ peaks} (n_0=2)\)\( f_\mathrm{ ros} (n_0=3)\)\( f_\mathrm{ ros} (n_0=5)\)\( f_\mathrm{ ros} (n_0=10)\)
RMSETime (s)RMSETime (s)RMSETime (s)RMSETime (s)
34.12E-0151.30E-0363.21E-01403.71E-017200
63.87E-0151.19E-03163.20E-01463.73E-017200
93.68E-01121.20E-03283.18E-01953.74E-017200
123.52E-01131.20E-03373.19E-0172003.74E-01347
156.49E-01251.21E-03893.18E-015773.73E-01364
185.42E-01301.21E-031303.17E-0172003.73E-017200
213.58E-01271.22E-03843.19E-013413.70E-017200
242.95E-01821.23E-03853.23E-0150533.97E-017200
274.86E-012101.23E-031853.49E-0172004.16E-012756

Again, some outliers are observed in Table 4, see the wall time taken for the case of \( f_\mathrm{ ros}(n_0=3)\) for \( G=15,\,…,\,21\) for an example. A possible reason for the occurence of outliers is that some of the trained activations may exhibit non-smooth behaviour, which adversely impacts the computational effort required to optimize a trained KAN. A more detailed analysis is not conducted in this work.

4.4 Effect of number of neurons

As in the previous sub-section a similar analysis is presented in Table 5 to investigate the impact of increasing the number of neurons in a KAN.

Table 5 Trade-off between RMSE on the testing set and wall time (in seconds) denoted as ‘Time’ required to solve a trained KAN for all cases except \( f_\mathrm{ ros}(n_0=10)\) and the relative gap denoted as ‘Gap’ at the end of time limit is reported for the \( f_\mathrm{ ros}(n_0=10)\) case. Trained KANs with one hidden layer containing \( n_1\) neurons with grid points as defined in Table 2 are under consideration.
\( f_\mathrm{ peaks} (n_0=2)\)\( f_\mathrm{ ros} (n_0=3)\)\( f_\mathrm{ ros} (n_0=5)\)\( f_\mathrm{ ros} (n_0=10)\)
RMSETime (s)RMSETime (s)RMSETime (s)RMSEGap (%)
26.49E-01171.20E-03363.20E-01443.71E-01156.90
36.11E-011217.14E-061512.63E-0172003.46E-01452.02
41.23E-022077.79E-0634596.76E-0272003.19E-01229.22
54.12E-033915.44E-0550381.18E-0472002.86E-01354.60
64.80E-0312927.56E-0572002.00E-0472002.60E-014868.61
72.81E-0320669.84E-0572001.08E-047200
81.32E-0333558.32E-0572001.06E-047200
91.99E-0329788.38E-037200
104.64E-0335781.53E-047200

Again, on increasing the number of neurons, trends of over-fitting are seen for all test functions except \( f_\mathrm{ ros} (n=5)\) . Moreover, the computational effort required to optimize the KAN increases rapidly on increasing the number of neurons. In Table 5 the trade-off between the wall time needed to achieve convergence or the relative gap at the time limit versus the test RMSE is markedly evident compared to the effect of grid-size. For each neuron added in layer \( l\) , \( n_{l-1} + n_{l+1}\) activations are introduced in the KAN. For each activation, B-spline and SiLU equations need to be introduced into the formulation, thereby significantly increasing the computational effort required to solve the KAN. For the KAN instances considered here, the subsequent layer contains one neuron representing the output and the preceeding layer comprises all the inputs. Hence, for KANs with higher number of inputs, the computational effort needed to optimize increases more rapidly. For instance, considering \( f_\mathrm{ ros} (n=3)\) and \( f_\mathrm{ ros} (n=5)\) , instances with more than 5 and 2 neurons, respectively cannot converge within the time limit. Remarkably, for \( f_\mathrm{ ros} (n=3)\) , for the KAN with 3 neurons, one can obtain a highly accurate solution. Moreover, the low wall time (151 seconds) needed to solve the KAN to global optimality provides an added benefit, thereby highlighting the trememdous potential of KANs as surrogate models that can be solved with little computational effort while maintaining a high level of accuracy. Hence, pruning of neurons [40] should be considered to reduce the number of neurons while maintaining similar level of accuracy if the goal is to optimize over a trained KAN.

4.5 Effect of layers

Continuing with the discussion on the impact of KAN architecture, here we discuss the impact of number of layers on the trade-off between test RMSE and wall time required to optimize a trained KAN.

Table 6 Trade-off between RMSE on the testing set and wall time (in seconds) denoted as ‘Time’ required to solve a trained KAN for all cases except \( f_\mathrm{ ros}(n_0=10)\) and the relative gap denoted as ‘Gap’ at the end of time limit is reported for the \( f_\mathrm{ ros}(n_0=10)\) case. Trained KANs with \( L+1\) layers including the input and output layers, with each hidden layer containing two neurons with grid points as defined in Table 2, are under consideration.
\( f_\mathrm{ peaks} (n_0=2)\)\( f_\mathrm{ ros} (n_0=3)\)\( f_\mathrm{ ros} (n_0=5)\)\( f_\mathrm{ ros} (n_0=10)\)
RMSETime (s)RMSETime (s)RMSETime (s)RMSEGap (%)
36.49E-01179.82E-02453.53E-011023.71E-01156.90
41.80E-013461.22E-0315683.40E-011885.55E-01736.69
52.65E-0111196.82E-0272003.22E-0122083.74E-0143.83
61.69E-016485.35E-0172008.65E-0172003.73E-0125.26
71.59E0072001.33E-032884.87E-0172003.85E-0195.12

In Table 6, the total number of layers \( L+1\) , which includes the input and output layers is reported. Hence, if the reported value for \( L+1\) is three, the number of hidden layers in the KAN is one. On varying the number of layers, there is a marked deterioration in the test RMSE for all the functions. All KANs are trained using the same set of hyperparameters and tuning them may improve the performance of KANs. It is important to note that efficient training of KANs is not the focus of this study. A discernible trade-off as seen previously is not observed in terms of computational effort on increasing number of layers. To explain the reason for this, we consider a KAN with one hidden layer and two neurons. The number of inputs is assumed to be the same as computational experiments, i.e., \( 2,\, 3,\, 5, \text{ and }10\) . For a KAN with one hidden layer having two neurons with these number of inputs there are 6, 8, 12 and 22 activations respectively. On adding one neuron to the hidden layer, the number of activations increase to 9, 12, 18 and 33 respectively. On the other hand, if to this KAN, a hidden layer with two neurons is appended then the number of activations increase to 10, 12, 16 and 26 respectively. Thus, for KANs having fewer inputs, increasing the number of neurons in the immediate hidden layer after the input layer introduces fewer activations than adding a hidden layer with two neurons. However, for KANs with five or more inputs, adding an additional hidden layer introduces fewer activations. The computational effort required crucially depends on the number of activations in the network and the number of grid points used to define the activation. This is evident for \( f_{ros}(n=5)\) and \( f_{ros}(n=10)\) where a high proprtion of deeper networks can be solved for \( f_{ros}(n=5)\) and the relative gaps at the end of the time limit are smaller for \( f_{ros}(n=10)\) compared to a KAN with higher number of neurons. With a more careful approach to training, better test RMSEs may be achieved with deeper networks.

4.6 Effect of number of inputs

Table 7 The effect of the number of inputs on different configurations of KAN. This table reports the shifted geometric wall time mean (SGWM) in seconds and number of instances that are solved within the time limit for different cases of number of inputs. The total number of instances solved are shown in parentheses next to the number of inputs.
ConfigurationSGWM (s)# solved
235102 (23)3 (23)5 (21)10 (19)
Default181.89453.251407.984992.792216103
ConvexHull207.36412.831488.194743.122216103
ExploitSparsity213.81501.141397.834414.792216103
Redundant342.36706.921786.425270.192215103
LocalSupport399.11726.471942.145540.672217113
McCormick210.68459.151383.965307.082116112

In the discussion of results in the previous sub-section the effect of number of inputs to the KAN is noted. In this section, a deeper analysis is conducted on the impact of number of inputs to the KAN on the computational effort needed to optimize them.Table 7 reports the proportion of instances solved and SGWM for KANs with 2, 3, 5 and 10 inputs. It is clearly evident that for all the configurations, the number of instances that can be solved decreases, and SGWM increases with an increase in the number of inputs. Similar behaviour is expected if there are multiple outputs in a KAN due to the nature of the KAN architecture. While the benefits of ExploitSparsity configuration were not clearly evident in Section 4.2, for KANs with a large number of inputs, fixing basis functions corresponding to extended grid points resulting in smaller models leads to lower mean runtimes for KANs with five and ten inputs. Similarly, minor benefits are also observed for McCormick configuration for KANs with five inputs. The observation made earlier regarding the benefits of using convex hull formulation over Big-M formulation can also be corroborated here (c.f. SGWM for ConvexHull configuration for 3 and 10 inputs in Table 7). Finally, as seen earlier, no computational benefits can be derived by using Redundant configuration.

4.7 Comparison with multilayer perceptrons

Table 8 Description of architectures of different MLPs trained and how they are denoted
MLPDescription
1One hidden layer with 16 neurons
2One hidden layer with 64 neurons
3Two hidden layers with 16 neurons each
4Two hidden layers with 64 neurons each
5Three hidden layers with 16 neurons each
6Three hidden layers with 64 neurons each
Table 9 Performance comparison between the best KAN and MLPs of varying architectures. The best KAN for \( f_\mathrm{ peaks}\) , \( f_\mathrm{ ros}(n=3)\) , \( f_\mathrm{ ros}(n=5)\) , and \( f_\mathrm{ ros}(n=10)\) is one layer with seven neurons with McCormick configuration, one layer with four neurons with Exploitsparsity configuration, one layer with five neurons with Exploitsparsity configuration, and one layer with two neurons having 15 grid points with ConvexHull configuration, respectively. The description of architectures of MLPs follows Table 8. Best known indicates the value of the objective at convergence or at the time limit denoted as ‘PB’, and best possible solution represents the dual bound at convergence or at the time limit denoted as ‘DB’.
KAN123456
RMSE2.8E-37.3E-16.1E-15.2E-12.2E-12.7E-15.7E-2
Time11380.012260457200
PB-6.551-2.126-2.844-3.305-6.293-5.564-6.775
DB-6.551-2.126-2.844-3.305-6.293-5.564-21.98
RMSE7.8E-63.4E-12.0E-12.0E-15.3E-29.1E-24.2E-2
Time30960.0110.05720027200
PB0.00091-519.37-195.24-222.84-6.12-68.82-11.24
DB0.00091-519.37-195.24-222.84-707.5-68.82-4705.21
RMSE1.2E-44.9E-12.4E-12.8E-11.1E-11.7E-18.2E-2
Time72000.0121720017200
PB0.27-1249.75-724.71-866.48-276.77-6.38-102.15
DB0.19-1249.75-724.71-866.48-1834.25-6.38-9856.67
RMSE3.7E-16.3E-13.4E-14.7E-12.6E-13.9E-12.4E-1
Time2440.0131720017200
PB-667.31-648.3-3156.46-1299.29-1013.9-544.3558.89
DB-667.31-648.3-3156.46-1299.29-7981.19-544.3-26727.8

For a comprehensive evaluation of the suitability of KANs as surrogate models, they are compared with MLPs. Using the same dataset, six MLPs of varying architectures are trained for each test function considered. The architectures considered are MLPs with one to three layers, each with 16 or 64 neurons. The MLPs are trained using Tensorflow [63]. The trained MLPs are loaded as Pyomo blocks using OMLT [32] and then optimized using SCIP [41] on the same machine using the same settings as described in Section 4.1. The results related to optimization of MLPs are reported in Table 9.

For \( f_\mathrm{ peaks}\) , \( f_\mathrm{ ros}(n=3)\) , and \( f_\mathrm{ ros}(n=5)\) , while the smaller MLPs can be optimized very quickly, the objective obtained with those MLPs is not accurate. The known global minimum for \( f_\mathrm{ peaks}\) is -6.551, and for \( f_\mathrm{ ros}\) is 0. Relatively, on optimizing a KAN, the objective is more accurate and it can be solved in less time than larger MLPs (MLP_2_64 and MLP_3_64). Although, the KAN did not converge for \( f_\mathrm{ ros}(n=5)\) , both the best known and best possible objectives are close to the actual value of zero, indicating the potential benefits that KANs offer as surrogate models over MLPs. However, for \( f_\mathrm{ ros}(n=10)\) none of the networks are able to capture the objective accurately. In general, it is observed that MLPs do not provide accuracy with the current set of data points and hyperparameters used for training. Using more training data or conducting hyperparameter tuning may improve the performance of MLPs, however this is not the focus of this work. The computational effort to solve MLPs with large number of inputs does not scale as drastically as for KANs indicating that for surrogate models with more than five inputs, MLPs are a more suitable choice for surrogate models. All test functions considered in this work have one output. Similar behaviour is expected if more outputs were to be predicted using a KAN surrogate model since on increasing the number of outputs, the number of activations will increase similarly as that for the inputs.

5 Conclusion

In this study, we proposed a MINLP formulation to optimize over a trained KAN. Additionally, we proposed some enhancements aimed at improving the effectiveness of the formulation. We found that the formulation based on Default configuration offers the best performance on balance. The addition of LocalSupport cuts, while beneficial for larger KANs, slows down the convergence for smaller KANs. Selective addtion of these constraints by ranking their effectiveness may further improve the effectiveness of these cuts [64]. We also observed that the formulation based on ConvexHull configuration also offered computational benefits for moderately difficult instances. The use of partition-based formulations may further improve the computational performance of the formulation [65]. Finally, providing tighter bounds for different activations via optimization-based bounds tightening [49] may further improve the effectiveness of the formulation.

Additionally, we conducted a thorough investigation of the impact of the size of the KAN by varying the number of grid-points, neurons, layers and inputs on the computational effort required to optimize a KAN. We observe that the computational effort required to optimize a trained KAN crucially depends on the number of activations in the network and less strongly on the number of grid points used to define an activation. The number of activations introduced strongly depends on the number of inputs and outputs of the network. For KANs with less than five inputs, KANs offer high accuracy with relatively tractable runtimes. Hence, for such cases KANs offer a promising alternative as surrogate models. For KANs with more than five inputs, they become less attractive as surrogate models due to the prohibitive computational cost associated with optimizing them. For these cases, a careful consideration while deciding the architecture such as training deeper KANs rather than wider KANs can help improve tractability.

Declarations

Funding All authors gratefully acknowledge financial support from Shell Global Solutions International B.V for conducting the research in this study.

Competing interests The authors have no relevant financial or non-financial interests to disclose.

Supplementary information The Zenodo repository for electronic supplementary information containing all the log files from all optimization runs can be accessed from https://zenodo.org/records/14961066.

Code availability The Pyomo formulation presented in this study can be accessed from https://github.com/process-intelligence-research/optimization-over-KANs.

Author contribution All authors contributed to the study conception and design. The formulation development, implementation, testing and analysis were performed by Tanuj Karia. The first draft of the manuscript was written by Tanuj Karia and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. Funding was acquired by Artur M. Schweidtmann.

If any of the sections are not relevant to your manuscript, please include the heading and write ‘Not applicable’ for that section.

References

[1] Fischetti, M., Fraccaro, M.: Machine learning meets mathematical optimization to predict the optimal production of offshore wind parks. Computers & Operations Research 106, 289– 297 ( 2019) https://doi.org/10.1016/j.cor.2018.04.006

[2] Abbasi, B., Babaei, T., Hosseinifard, Z., Smith-Miles, K., Dehghani, M.: Predicting solutions of large-scale optimization problems via machine learning: A case study in blood supply chain management. Computers & Operations Research 119, 104941 ( 2020) https://doi.org/10.1016/j.cor.2020.104941

[3] Schweidtmann, A.M., Bongartz, D., Mitsos, A.: In: Pardalos, P.M., Prokopyev, O.A. (eds.) Optimization with Trained Machine Learning Models Embedded, pp. 1– 8. Springer, Cham ( 2020). https://doi.org/10.1007/978-3-030-54621-2_735-1 . https://doi.org/10.1007/978-3-030-54621-2_735-1

[4] Misener, R., Biegler, L.: Formulating data-driven surrogate models for process optimization. Computers & Chemical Engineering 179, 108411 ( 2023) https://doi.org/10.1016/j.compchemeng.2023.108411

[5] Dowling, A.W., Eason, J.P., Ma, J., Miller, D.C., Biegler, L.T.: Coal oxycombustion power plant optimization using first principles and surrogate boiler models. Energy Procedia 63, 352– 361 ( 2014) https://doi.org/10.1016/j.egypro.2014.11.038 . 12th International Conference on Greenhouse Gas Control Technologies, GHGT-12

[6] Palmer, K., Realff, M.: Optimization and validation of steady-state flowsheet simulation metamodels. Chemical Engineering Research and Design 80( 7), 773– 782 ( 2002) https://doi.org/10.1205/026387602320776849

[7] Ma, K., Sahinidis, N.V., Bindlish, R., Bury, S.J., Haghpanah, R., Rajagopalan, S.: Data-driven strategies for extractive distillation unit optimization. Computers & Chemical Engineering 167, 107970 ( 2022) https://doi.org/10.1016/j.compchemeng.2022.107970

[8] Ma, K., Sahinidis, N.V., Amaran, S., Bindlish, R., Bury, S.J., Griffith, D., Rajagopalan, S.: Data-driven strategies for optimization of integrated chemical plants. Computers & Chemical Engineering 166, 107961 ( 2022) https://doi.org/10.1016/j.compchemeng.2022.107961

[9] Cozad, A., Sahinidis, N.V.: A global MINLP approach to symbolic regression. Mathematical Programming 170( 1), 97– 119 ( 2018) https://doi.org/10.1007/s10107-018-1289-x

[10] Zhang, Q., Grossmann, I.E., Sundaramoorthy, A., Pinto, J.M.: Data-driven construction of Convex Region Surrogate models. Optimization and Engineering 17( 2), 289– 332 ( 2016) https://doi.org/10.1007/s11081-015-9288-8

[11] Grimstad, B., Knudsen, B.R.: Mathematical programming formulations for piecewise polynomial functions. Journal of Global Optimization 77( 3), 455– 486 ( 2020) https://doi.org/10.1007/s10898-020-00881-4

[12] Grimstad, B., Sandnes, A.: Global optimization with spline constraints: a new branch-and-bound method based on B-splines. Journal of Global Optimization 65( 3), 401– 439 ( 2016) https://doi.org/10.1007/s10898-015-0358-4

[13] Grimstad, B.: A MIQCP formulation for B-spline constraints. Optimization Letters 12( 4), 713– 725 ( 2018) https://doi.org/10.1007/s11590-017-1190-1

[14] Schweidtmann, A.M., Bongartz, D., Grothe, D., Kerkenhoff, T., Lin, X., Najman, J., Mitsos, A.: Deterministic global optimization with Gaussian processes embedded. Mathematical Programming Computation 13( 3), 553– 581 ( 2021) https://doi.org/10.1007/s12532-021-00204-y

[15] Beykal, B., Onel, M., Onel, O., Pistikopoulos, E.N.: A data-driven optimization algorithm for differential algebraic equations with numerical infeasibilities. AIChE Journal 66( 10), 16657 ( 2020) https://doi.org/10.1002/aic.16657 https://aiche.onlinelibrary.wiley.com/doi/pdf/10.1002/aic.16657

[16] Schweidtmann, A.M., Weber, J.M., Wende, C., Netze, L., Mitsos, A.: Obey validity limits of data-driven models through topological data analysis and one-class classification. Optimization and Engineering 23( 2), 855– 876 ( 2022) https://doi.org/10.1007/s11081-021-09608-0

[17] Mišić, V.V.: Optimization of tree ensembles. Operations Research 68( 5), 1605– 1624 ( 2020) https://doi.org/10.1287/opre.2019.1928 https://doi.org/10.1287/opre.2019.1928

[18] Mistry, M., Letsios, D., Krennrich, G., Lee, R.M., Misener, R.: Mixed-integer convex nonlinear optimization with gradient-boosted trees embedded. INFORMS Journal on Computing 33( 3), 1103– 1119 ( 2021) https://doi.org/10.1287/ijoc.2020.0993 https://doi.org/10.1287/ijoc.2020.0993

[19] Thebelt, A., Kronqvist, J., Mistry, M., Lee, R.M., Sudermann-Merx, N., Misener, R.: Entmoot: A framework for optimization over ensemble tree models. Computers & Chemical Engineering 151, 107343 ( 2021) https://doi.org/10.1016/j.compchemeng.2021.107343

[20] Ammari, B.L., Johnson, E.S., Stinchfield, G., Kim, T., Bynum, M., Hart, W.E., Pulsipher, J., Laird, C.D.: Linear model decision trees as surrogates in optimization of engineering applications. Computers & Chemical Engineering 178, 108347 ( 2023) https://doi.org/10.1016/j.compchemeng.2023.108347

[21] McDonald, T., Tsay, C., Schweidtmann, A.M., Yorke-Smith, N.: Mixed-integer optimisation of graph neural networks for computer-aided molecular design. Computers & Chemical Engineering 185, 108660 ( 2024) https://doi.org/10.1016/j.compchemeng.2024.108660

[22] Zhang, S., Campos, J.S., Feldmann, C., Sandfort, F., Mathea, M., Misener, R.: Augmenting optimization-based molecular design with graph neural networks. Computers & Chemical Engineering 186, 108684 ( 2024) https://doi.org/10.1016/j.compchemeng.2024.108684

[23] Schweidtmann, A.M., Mitsos, A.: Deterministic Global Optimization with Artificial Neural Networks Embedded. Journal of Optimization Theory and Applications 180( 3), 925– 948 ( 2019) https://doi.org/10.1007/s10957-018-1396-0

[24] Grimstad, B., Andersson, H.: ReLU networks as surrogate models in mixed-integer linear programs. Computers & Chemical Engineering 131, 106580 ( 2019) https://doi.org/10.1016/j.compchemeng.2019.106580

[25] Anderson, R., Huchette, J., Ma, W., Tjandraatmadja, C., Vielma, J.P.: Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming 183( 1), 3– 39 ( 2020) https://doi.org/10.1007/s10107-020-01474-5

[26] Tsay, C., Kronqvist, J., Thebelt, A., Misener, R.: Partition-based formulations for mixed-integer optimization of trained relu neural networks. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 3068– 3080. Curran Associates, Inc., Red Hook, NY, USA ( 2021). https://proceedings.neurips.cc/paper_files/paper/2021/file/17f98ddf040204eda0af36a108cbdea4-Paper.pdf

[27] Wilhelm, M.E., Wang, C., Stuber, M.D.: Convex and concave envelopes of artificial neural network activation functions for deterministic global optimization. Journal of Global Optimization 85( 3), 569– 594 ( 2023) https://doi.org/10.1007/s10898-022-01228-x

[28] Wang, K., Lozano, L., Cardonha, C., Bergman, D.: Optimizing over an ensemble of trained neural networks. INFORMS Journal on Computing 35( 3), 652– 674 ( 2023) https://doi.org/10.1287/ijoc.2023.1285 https://doi.org/10.1287/ijoc.2023.1285

[29] Bergman, D., Huang, T., Brooks, P., Lodi, A., Raghunathan, A.U.: JANOS: An Integrated Predictive and Prescriptive Modeling Framework. INFORMS Journal on Computing 34( 2), 807– 816 ( 2022) https://doi.org/10.1287/ijoc.2020.1023 https://doi.org/10.1287/ijoc.2020.1023

[30] Maragno, D., Wiberg, H., Bertsimas, D., Birbil, S.I., Hertog, D., Fajemisin, A.: Mixed-Integer Optimization with Constraint Learning (2023). https://arxiv.org/abs/2111.04469

[31] Lueg, L., Grimstad, B., Mitsos, A., Schweidtmann, A.M.: reluMIP: Open Source Tool for MILP Optimization of ReLU Neural Networks. https://doi.org/10.5281/zenodo.5601907 . https://doi.org/10.5281/zenodo.5601907

[32] Ceccon, F., Jalving, J., Haddad, J., Thebelt, A., Tsay, C., Laird, C.D., Misener, R.: OMLT: Optimization & Machine Learning Toolkit. Journal of Machine Learning Research 23( 349), 1– 8 ( 2022)

[33] Bynum, M.L., Hackebeil, G.A., Hart, W.E., Laird, C.D., Nicholson, B.L., Siirola, J.D., Watson, J.-P., Woodruff, D.L.: Pyomo–optimization Modeling in Python vol. 67, 3rd edn. Springer, ??? ( 2021)

[34] Bongartz, D., Najman, J., Sass, S., Mitsos, A.: MAiNGO–McCormick-based Algorithm for mixed-integer Nonlinear Global Optimization. Process Systems Engineering (AVT. SVT), RWTH Aachen University. RWTH Aachen University (2018)

[35] Bonami, P.: Using Trained Machine Learning Predictors in Gurobi (2023). https://www.youtube.com/watch?v=jaux5Oo4qHU

[36] Turner, M., Chmiela, A., Koch, T., Winkler, M.: PySCIPOpt-ML: Embedding Trained Machine Learning Models into Mixed-Integer Programs. arXiv preprint arXiv:2312.08074 (2023)

[37] MathOptAI.jl: Documentation. https://lanl-ansi.github.io/MathOptAI.jl/stable/. Accessed: 2025-02-25

[38] Yang, D., Balaprakash, P., Leyffer, S.: Modeling design and control problems involving neural network surrogates. Computational Optimization and Applications 83( 3), 759– 800 ( 2022) https://doi.org/10.1007/s10589-022-00404-9

[39] Pablo Carrasco and Gonzalo Muñoz Tightening convex relaxations of trained neural networks: a unified approach for convex and S-shaped activations 2024

[40] Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y., Tegmark, M.: KAN: Kolmogorov-Arnold Networks (2024). https://arxiv.org/abs/2404.19756

[41] Bolusani, S., Besançon, M., Bestuzheva, K., Chmiela, A., Dionísio, J., Donkiewicz, T., Doornmalen, J., Eifler, L., Ghannam, M., Gleixner, A., Graczyk, C., Halbig, K., Hedtke, I., Hoen, A., Hojny, C., Hulst, R., Kamp, D., Koch, T., Kofler, K., Lentz, J., Manns, J., Mexi, G., Mühmer, E., Pfetsch, M.E., Schlösser, F., Serrano, F., Shinano, Y., Turner, M., Vigerske, S., Weninger, D., Xu, L.: The SCIP Optimization Suite 9.0 (2024). https://arxiv.org/abs/2402.17702

[42] Kolmogorov, A.N.: On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114, 953– 956 ( 1957)

[43] Givental, A.B., Khesin, B.A., Marsden, J.E., Varchenko, A.N., Vassiliev, V.A., Viro, O.Y., Zakalyukin, V.M. (eds.): On the representation of functions of several variables as a superposition of functions of a smaller number of variables, pp. 25– 46. Springer, Berlin, Heidelberg ( 2009). https://doi.org/10.1007/978-3-642-01742-1_5 . https://doi.org/10.1007/978-3-642-01742-1_5

[44] Girosi, F., Poggio, T.: Representation properties of networks: Kolmogorov's theorem is irrelevant. Neural Computation 1( 4), 465– 469 ( 1989)

[45] Poggio, T., Banburski, A., Liao, Q.: Theoretical issues in deep networks. Proceedings of the National Academy of Sciences of the United States of America TA - TT - 117( 48), 30039– 30045 ( 2020) https://doi.org/10.1073/pnas.1907369117 LK - https://tudelft.on.worldcat.org/oclc/8608019994

[46] Boor, C.: Package for calculating with b-splines. SIAM Journal on Numerical Analysis 14( 3), 441– 472 ( 1977). Accessed 2025-01-31

[47] Liu, Z.: How to get a specific activation function's equation? https://github.com/KindXiaoming/pykan/issues/197

[48] Liu, Z.: Demo 4: Extracting activation functions. Documentation for Kolmogorov Arnold Networks. https://kindxiaoming.github.io/pykan/API_demo/API_4_extract_activations.html

[49] Puranik, Y., Sahinidis, N.V.: Domain reduction techniques for global NLP and MINLP optimization. Constraints 22( 3), 338– 376 ( 2017) https://doi.org/10.1007/s10601-016-9267-5

[50] Gleixner, A.M., Berthold, T., Müller, B., Weltge, S.: Three enhancements for optimization-based bound tightening. Journal of Global Optimization 67( 4), 731– 757 ( 2017) https://doi.org/10.1007/s10898-016-0450-4

[51] Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107, 3– 11 ( 2018) https://doi.org/10.1016/j.neunet.2017.12.012 . Special issue on deep reinforcement learning

[52] Egon Balas Disjunctive programming Springer 2018 https://doi.org/10.1007/978-3-030-00148-3

[53] Ruiz, J.P., Grossmann, I.E.: Using redundancy to strengthen the relaxation for the global optimization of minlp problems. Computers & Chemical Engineering 35( 12), 2729– 2740 ( 2011) https://doi.org/10.1016/j.compchemeng.2011.01.035

[54] Liu, Z.: KAN spline module: extend_grid method. Documentation for Kolmogorov Arnold Networks. https://kindxiaoming.github.io/pykan/kan.html#kan.spline.extend_grid

[55] McCormick, G.P.: Computability of global solutions to factorable nonconvex programs: Part I — Convex underestimating problems. Mathematical Programming 10( 1), 147– 175 ( 1976) https://doi.org/10.1007/BF01580665

[56] Najman, J., Bongartz, D., Mitsos, A.: Linearization of McCormick relaxations and hybridization with the auxiliary variable method. Journal of Global Optimization 80( 4), 731– 756 ( 2021) https://doi.org/10.1007/s10898-020-00977-x

[57] Mitsos, A., Chachuat, B., Barton, P.I.: Mccormick-based relaxations of algorithms. SIAM Journal on Optimization 20( 2), 573– 601 ( 2009) https://doi.org/10.1137/080717341 https://doi.org/10.1137/080717341

[58] Chachuat, B., Houska, B., Paulen, R., Peri'c, N., Rajyaguru, J., Villanueva, M.E.: Set-theoretic approaches in analysis, estimation and control of nonlinear systems. IFAC-PapersOnLine 48( 8), 981– 995 ( 2015) https://doi.org/10.1016/j.ifacol.2015.09.097 . 9th IFAC Symposium on Advanced Control of Chemical Processes ADCHEM 2015

[59] Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a function. The Computer Journal 3( 3), 175– 184 ( 1960) https://doi.org/10.1093/comjnl/3.3.175 https://academic.oup.com/comjnl/article-pdf/3/3/175/988633/030175.pdf

[60] Incorporation, A.O.: AMPL documentation: List of SCIP options. https://dev.ampl.com/solvers/scip/options.html

[61] Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Mathematical Programming 91( 2), 201– 213 ( 2002) https://doi.org/10.1007/s101070100263

[62] Mittelmann, H.D.: Benchmarking Optimization Software - a (Hi)Story. SN Operations Research Forum 1( 1), 2 ( 2020) https://doi.org/10.1007/s43069-020-0002-0

[63] Mart\'in~Abadi and Ashish~Agarwal and Paul~Barham and Eugene~Brevdo and Zhifeng~Chen and Craig~Citro and Greg~S.~Corrado and Andy~Davis and Jeffrey~Dean and Matthieu~Devin and Sanjay~Ghemawat and Ian~Goodfellow and Andrew~Harp and Geoffrey~Irving and Michael~Isard and Yangqing Jia and Rafal~Jozefowicz and Lukasz~Kaiser and Manjunath~Kudlur and Josh~Levenberg and Dandelion~Mané and Rajat~Monga and Sherry~Moore and Derek~Murray and Chris~Olah and Mike~Schuster and Jonathon~Shlens and Benoit~Steiner and Ilya~Sutskever and Kunal~Talwar and Paul~Tucker and Vincent~Vanhoucke and Vijay~Vasudevan and Fernanda~Viégas and Oriol~Vinyals and Pete~Warden and Martin~Wattenberg and Martin~Wicke and Yuan~Yu and Xiaoqiang~Zheng TensorFlow 2015 Software available from tensorflow.org

[64] Radu Baltean-Lugojan and Pierre Bonami and Ruth Misener and Andrea Tramontani Scoring positive semidefinite cutting planes for quadratic optimization via trained neural networks preprint: http://www. optimization-online. org/DB_ HTML/2018/11/6943. html 2019

[65] Jan Kronqvist and Ruth Misener and Calvin Tsay P-split formulations: A class of intermediate formulations between big-M and convex hull for disjunctive constraints 2024

[66] López-Flores, F.J., Ramírez-Márquez, C., Ponce-Ortega, J.M.: Process systems engineering tools for optimization of trained machine learning models: Comparative and perspective. Industrial & Engineering Chemistry Research 63( 32), 13966– 13979 ( 2024) https://doi.org/10.1021/acs.iecr.4c00632 https://doi.org/10.1021/acs.iecr.4c00632

I am normally hidden by the status bar