0.010 0.008 0.006 0.004 0.002 0.000 30 25 0 20 5 15 Ind 10 10 ex on 15 margi 20 5 nal 25 0 Entropic Proximal Gradient Method for Generalized Optimal Transport Problems Master’s Thesis in Applied Mathematics GUSTAV SVENSSON SARA NILSSON DEPARTMENT OF MATHEMATICAL SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 www.chalmers.se www.gu.se Marginal number t Value of marginal t Master’s Thesis in Applied Mathematics 2024 Entropic Proximal Gradient Method for Generalized Optimal Transport Problems GUSTAV SVENSSON SARA NILSSON Department of Mathematical Sciences Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Entropic Proximal Gradient Method for Generalized Optimal Transport Problems GUSTAV SVENSSON SARA NILSSON © GUSTAV SVENSSON, SARA NILSSON 2024. Supervisor: Axel Ringh, Applied Mathematics and Statistics Examiner: Ann-Brith Strömberg, Applied Mathematics and Statistics Master’s Thesis in Applied Mathematics 2024 Department of Mathematical Sciences Chalmers University of Technology University of Gothenburg SE-412 96 Gothenburg Sweden Telephone +46 31 772 1000 Cover: The entropic proximal gradient method for a generalized multi-marginal optimal transport problem, with T = 32 and N = 26. No penalty from ∇F (M) for marginal t > N/4 and t < N/2, ϵ = δ = 0.1, see Section 4.2.1. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2024 iv Entropic Proximal Gradient Method for Generalized Optimal Transport Problems GUSTAV SVENSSON SARA NILSSON Department of Mathematical Sciences Chalmers University of Technology University of Gothenburg Abstract Optimal transport, a fundamental problem in applied mathematics, involves finding the most efficient way to move mass from multiple sources to multiple destina- tions. Previously known approaches employ entropic regularization combined with the Sinkhorn iterations, a technique known for its efficiency in solving large-scale optimal transport problems. This thesis presents a new method for solving general- ized optimal transport problems using the entropic proximal gradient method. The method breaks down the complex problem into a sequence of standard optimal trans- port problems, solved by the Sinkhorn iterations. We provide theoretical founda- tions, including proof of convergence and termination criteria, along with a detailed implementation and numerical experiments showing the algorithm’s applicability. The results of this thesis may offer improvements in computational performance for generalized optimal transport problems, making it a valuable tool for applications in economics, machine learning, and other fields where optimal transport is utilized. Keywords: generalized multi-marginal optimal transport, Sinkhorn iterations, en- tropic regularization, proximal gradient, optimization, graph-structure, log-sum-exp. v Acknowledgments We would like to express our deepest gratitude to our supervisor, Axel, whose un- wavering support, insightful guidance, boundless patience, and overall awesomeness made this thesis possible. Additionally, we extend our heartfelt thanks to our friends, Georg Kyhn, Carl Elofsson and Vilhelm Larsson, for their steadfast support and constant presence throughout this journey. And to our families - we do not know where we would be without your support. Gustav Svensson, Gothenburg, May 2024 Sara Nilsson, Gothenburg, May 2024 vii Nomenclature Below is the nomenclature of indices, sets, parameters, variables, and sums that have been used throughout this thesis. Indices k Iteration index ks Sinkhorn iteration index ko Outer iteration index (t) Marginal t (t1, t2) Joint marginal t1 and t2 i General index it Index for marginal t Sets R̄ R ∪ {∞} R+ {x ∈ R ∶ x ≥ 0} ∂Ψ Subdifferential set of Ψ Parameters ϵ Regularization parameter δ Step size in the entropic proximal gradient algorithm η The sum ϵ + δ ix Variables A Bold capital letter represent tensors A Capital letter represent matrices a Lower case letter represent vectors µt Marginal/distribution t of optimal transport problems (t) ui Variable corresponding to marginal t, index i (t),k ui Variable corresponding to marginal t, index i, iteration k Υ General tree G Path graph, special instance of a tree V Vertices, or nodes, of a tree VE Vertices, or nodes, of a tree with equality constraints VI Vertices, or nodes, of a tree with inequality constraints E Edges of a tree Pt(M) Projection of tensor M on marginal t Pt1,t2(M) Bi-marginal projection of tensor M on joint marginal t1, t2 (t) λi Lagrangian multipliers for marginal t, index i, bi-marginal case Λ(t)i Lagrangian multipliers for marginal t, index it, multi-marginal caset Sums Sums can be written in different ways, depending on convenience. N N N N ∑ = ∑ ⋅ ⋅ ⋅ ∑ ∑ ⋅ ⋅ ⋅ ∑ i1,...,it−1,it+1,...,iT i1=1 it−1=1 it+1=1 iT =1 T ∑ = ∑, where V = {1,2, . . . ,T } t∈V t=1 x Contents Nomenclature ix 1 Introduction 1 1.1 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background and theory 3 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Fundamental theory and notation . . . . . . . . . . . . . . . . . 3 2.1.3 Convex optimization theory . . . . . . . . . . . . . . . . . . . . . 4 2.1.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Optimal transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Bi-marginal optimal transport . . . . . . . . . . . . . . . . . . . 9 2.2.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Multi-marginal optimal transport . . . . . . . . . . . . . . . . . 10 2.2.3 Entropic regularization and the Sinkhorn iterations . . . . . . . 11 2.2.3.1 Entropic regularization in the bi-marginal case . . . . 11 2.2.3.2 Entropic regularization in the multi-marginal case . . 14 2.2.4 Graph-structured problems . . . . . . . . . . . . . . . . . . . . . 16 2.2.5 Generalized multi-marginal optimal transport . . . . . . . . . . 18 2.3 The entropic proximal gradient method . . . . . . . . . . . . . . . . . . 19 3 The entropic proximal gradient method for solving generalized op- timal transport problems 23 3.1 Deriving (3.2) and proving that C̃ko has the same graph-structure as C 23 3.1.1 ∇F (Mko) has a graph-structure . . . . . . . . . . . . . . . . . . 24 3.1.2 Merging ϵD(M) and δKL(M∥Mko) into a single expression . . 28 3.1.3 The graph-structure of C̃ko . . . . . . . . . . . . . . . . . . . . . 29 3.2 The Sinkhorn iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 The computation of C̃ko . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Convergence and termination criterion for the entropic proximal gra- dient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.2 Termination criterion . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 The algorithm in pseudo code . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 General trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 xi Contents 3.5.2 Path graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 Implementation and numerical results 39 4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Defining a generalized multi-marginal optimal transport prob- lem and its components . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2 Transformation from nodes to edges . . . . . . . . . . . . . . . . 41 4.1.2.1 Example of cost-transformation . . . . . . . . . . . . . 41 4.1.3 The log-sum-exp-method . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 No penalty from ∇F (M) for marginal t > N/4 and t < N/2 . . 44 4.2.2 No penalty from ∇F (M) if marginal t mod 2 = 1 . . . . . . . 45 4.2.3 No penalty from ∇F (M) if marginal t mod 2 = 1 and/or if marginal t > N/4 and t < N/2 . . . . . . . . . . . . . . . . . . . . 46 5 Discussion and conclusions 49 5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Bibliography 51 xii 1 Introduction When reading the title of this thesis, ”Entropic Proximal Gradient Method for Gen- eralized Optimal Transport Problems”, one might feel overwhelmed by the formula- tion, as these words might not have been frequently used in a standard mathematical education at university level, and certainly not together. So in this introduction, we begin by giving a brief explanation of the different components before delving into the more advanced theory. Counter-intuitively, we begin with the end of the title, namely ”optimal transport problems”. What are these? Shortly explained, an optimal transport problem entails identifying a transport plan that minimizes the cost of transferring mass from one distribution to another, subject to specific constraints. Historically, optimal transport has revolved around challenges related to efficient transportation and resource allocation. Its origins trace back to 1781 when the mathematician Gaspard Monge first addressed the task of relocating one distribution to another while minimizing expenses, as described in [Tho18]. Today, optimal transport finds application in a broad spectrum of fields extending beyond traditional economic and operational contexts. Its utility spans domains such as image processing, machine learning, cancer detection, and color modeling, as documented in [HRCK21] and references therein. There are different versions of optimal transport problems. The classical version is the bi-marginal version, which consists of only two distributions, i.e., marginals. These marginals are typically subject to equality constraints, but they can also be governed by both equality and inequality constraints. These constraints dictate which locations are permitted or prohibited at various instances. Analogously, the ”standard” multi-marginal optimal transport problem consists of multiple marginals and quickly become more complex. In order to solve these multi-marginal problems, entropic regularization and the Sinkhorn iterations can be applied, see [Sin67], and has been done so successfully for some time. In brief, implementing the Sinkhorn iterations involves performing coordinate ascent in the Lagrangian dual of the optimal transport problem. However, this is only part of the solution to the problem of the computational complexity. While computing the Sinkhorn iterations, projections of the marginals needs to be computed. This is computationally challenging, unless the problem has a graph-structure. Now, these were some more fancy words that most mathematical master students might not know. However, just accept this for now, as it will be explained thoroughly in the following theory chapter. Furthermore, multi-marginal optimal transport problems can be generalized by having constraints on some of the marginals, while others are associated with cost functions that fulfill certain assumptions. These generalized optimal transport prob- 1 1. Introduction lems have been solved in [RHCK21]. Some of the steps in this solution can be com- putationally expensive, depending on the cost functions that are associated with the marginals, and are thus more of an ”abstract algorithm”. This is where the focus of our thesis begins. We wish to develop a new method for solving these generalized optimal transport problems by applying the entropic proximal gradient method, which is the beginning of the title. But what is this? It is an optimization method that is commonly used for solving convex optimization problems that con- sists of smooth and possibly non-smooth components - these components are the cost functions we mentioned above. Shortly explained, this allows us to rewrite the generalized optimal transport problem into a sequence of ”standard” multi-marginal optimal transport problem, which are then solved using the Sinkhorn iterations we mentioned earlier. Now that the entire title has been ”explained”, we might feel like experts. If only that were true. There is still much more to learn, so let us take a look at the outline of this thesis. 1.1 Outline of the thesis We begin with the fundamentals. In Chapter 2 we start with preliminaries, laying the groundwork before delving into optimal transport theory, exploring its various iterations. Subsequently, we delve into the theory behind the entropic proximal gradient method. In Chapter 3, we develop the entropic proximal method for solving generalized op- timal transport problems. We demonstrate that such problems can be decomposed into a sequence of standard optimal transport problems, and we employ Sinkhorn iterations for their solution. We establish the graph-structure of this sequence of problems, explore their computation, and provide proofs for convergence and termi- nation criteria of the entropic proximal gradient method. The chapter culminates in the presentation of the algorithm through pseudo code. Moving forward to Chapter 4, we transition to the practical realm, detailing our implementation efforts and showcasing numerical results. Finally, in Chapter 5, we draw conclusions from our findings and discuss potential future avenues for exploration. 2 2 Background and theory In this chapter, we introduce relevant concepts that are needed in Chapter 3. We begin with the preliminaries, where we set up the notation used in the thesis and introduce basic definitions and results from, amongst other things, convex optimiza- tion theory. We continue with optimal transport theory, and conclude the chapter with theory regarding the entropic proximal gradient method. 2.1 Preliminaries This section introduces used notation, relevant definitions and results on which the following sections are based on. We begin with a short summary of the notation used in this thesis. 2.1.1 Notation Let exp(⋅), log(⋅),⊙,⊘ and ⊗ denote the element wise exponential, logarithm, multi- plication, division and outer product, respectively, of vectors, matrices, and tensors1. We denote a vector of ones as 1, with the size clear from the context, the identity ma- trix as I, and diag(v) denotes a square matrix with the elements of a vector v on the main diagonal. We define the sets R̄ ∶= R∪{∞} and R+ ∶= {x ∈ R ∶ x ≥ 0}. Let dom(f) denote the domain of a function f , defined as dom(f) = {x ∈ RN ∶ f(x) < ∞}, [Bec17, page 14]. Moreover, let int(E) denote the interior of E ⊆ RN , which defines the largest open set contained in E, see [AEP+16, page 40]. 2.1.2 Fundamental theory and notation We begin by defining the inner product, which holds an important role and is used frequently throughout the thesis. Definition 2.1.1. (Inner product)[Bec17, pages 2–3] An inner product of a real vector space V is a function that associates to each pair of vectors x, y ∈ V a real number, which is denoted by ⟨x, y⟩ and satisfies the following properties: 1. (commutativity) ⟨x, y⟩ = ⟨y, x⟩ for any x, y ∈ V. 2. (linearity) ⟨α1x1 + α2x2, y⟩ = α1⟨x1, y⟩ + α2⟨x2, y⟩ for any α1, α2 ∈ R and x1, x2, y ∈ V . 1Multi-dimensional matrices. 3 2. Background and theory 3. (positive definiteness) ⟨x,x⟩ ≥ 0 for any x ∈ V and ⟨x,x⟩ = 0 if and only if x = 0. We also need to define the norm of a vector. Definition 2.1.2. (Norm of a vector)[AEP+16, page 36] Let v ∈ RN be a vector. The norm of v, denoted ∥v∥, is defined as the square root of the associated inner product: √ ∥v∥ ∶= ⟨v, v⟩. Another important definition is that of an adjoint operator, which is used to prove the graph-structure of a function in Chapter 3 . Definition 2.1.3. (Adjoint operator) [Lue69, page 150] Given a linear operator A ∶ RN → RM , the adjoint operator is defined as the unique operator A∗ ∶ RM → RN such that the relation ⟨y,A(x)⟩RM = ⟨A ∗(y), x⟩RN , holds for all x ∈ RN and y ∈ RM . In order to ease notation, the subscript used to specify the space in which the elements of the inner product is, is omitted whenever it is clear from the context. Moreover, the following theorem states that the adjoint operator is also a linear operator. Theorem 2.1.1. [Lue69, page 151] The adjoint operator A∗ of the linear operator A ∶ RN → RM is linear and bounded with ∥A∗∥ = ∥A∥. We proceed with some relevant convex optimization theory. 2.1.3 Convex optimization theory Consider the optimization problem minimize f(x), (2.1) subject to x ∈ E, where E ⊆ RN is a nonempty set and f ∶ RN → R̄ is a given function. Of particular interest here are so-called convex problems. We first define what it means for f to be a convex function. Definition 2.1.4. (Convex function) [AEP+16, Definition 3.39] Suppose that E ⊆ RN . A function f ∶ RN → R̄ is convex at x ∈ E if y ∈ E ⎪⎫ λ ∈ (0 ⎪,1) ⎬ Ô⇒ f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y), λx + (1 − λ)y ∈ E ⎭⎪⎪ where λ ∈ (0,1)⇔ 0 < λ < 1. The function f is convex on E if it is convex at every x ∈ E. 4 2. Background and theory We also introduce the definition of a convex set. Definition 2.1.5. (Convex set) [AEP+16, Definition 3.1] Let E ⊆ RN . The set E is convex if ⎫ x, y ∈ E ⎪⎪ 0 1 ⎬ Ô⇒ λx + (1 − λ)y ∈ Eλ ∈ ( , ) ⎪⎪⎭ holds. The problem (2.1) is called convex if E is a convex set and f is a convex function on E. The following definition is a stricter definition of a convex function, namely that it is σ-strongly convex. Definition 2.1.6. (σ-strongly convex function)[Bec17, Definition 5.16] A function f ∶ E → (−∞,∞] is called σ-strongly convex for a given σ > 0 if dom(f) is a convex set and the following inequality holds for any x, y ∈ dom(f) and λ ∈ [0,1], σ f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y) − λ(1 − λ)∣∣x − y∣∣22 . An important result regarding convex optimization is the Fundamental Theorem of Global Optimality. Theorem 2.1.2. [AEP+16, Theorem 4.3] Consider the optimization problem (2.1), where E is a convex set and f is convex on E. Then every local minimum of f over E is also a global minimum. For unconstrained optimization problems, i.e., (2.1) without the constraint, we have the following theorem, which gives necessary and sufficient global optimality conditions for optimality of an optimization problem. Theorem 2.1.3. [AEP+16, Theorem 4.18] Suppose that f ∶ RN → R is convex and in C1 on RN . Then, x∗ is a global minimum of f over RN ⇐⇒ ∇f(x∗) = 0N . If we have constraints it is not necessarily true that there exists a x ∈ E such that ∇f(x) = 0 in all optimal solutions. However, we have the following result. Theorem 2.1.4. Let f and E in (2.1) be convex. If ∇f(x) = 0, for some x ∈ E, then x is a global optimal solution to (2.1). Proof. This follows for example from Theorem 2.1.3 and the Relaxation Theorem, [AEP+16, Theorem 6.1], by relaxing the constraint x ∈ E to x ∈ RN . 2.1.4 Functions In this thesis we work a lot with functions, and we need a cornucopia of theory regarding them. We begin with some simple definitions. 5 2. Background and theory Definition 2.1.7. (Proper function) [Bec17, page 14] A function f ∶ E → [−∞,∞] is called proper if it does not attain the value −∞ and there exists at least one x ∈ E such that f(x) <∞. Definition 2.1.8. (Indicator function) The function δA ∶ RN → {0,∞} given by ⎪⎪ ⎧0 if x ∈ A, δA(x) = ⎨ (2.2) ⎪⎪∞ if x ∉ A,⎩ is called an indicator function of a set A, where A ⊆ RN . Note that this is a convex function if and only if A is a convex set. Definition 2.1.9. (Subgradient) [AEP+16, Definition 6.17] Let f ∶ RN → R be a convex function. We say that a vector g ∈ RN is a subgradient of f at x ∈ RN if f(y) ≥ f(x) + gT(y − x), y ∈ RN . The set of such vectors g defines the subdifferential of f at x, and is denoted ∂f(x). Definition 2.1.10. (Lower semi-continuous function) [Bec17, Definition 2.5] A function f ∶ E→ [−∞,∞] is called lower semi-continuous at x ∈ E if f(x) ≤ lim inf f(xn) (2.3) n→∞ for any sequence {xn}n≥1 ⊆ E for which xn → x as n → ∞. A function f ∶ E → [−∞,∞] is called lower semi-continuous if it is lower semi-continuous at each point in E. Definition 2.1.11. (Lf -smoothness)[Bec17, Definition 5.1] Let Lf ≥ 0. A function f : E→ (−∞,∞] is said to be Lf -smooth over a set D ⊆ E if it is differentiable over D and satisfies ∣∣∇f(x) −∇f(y)∣∣ ≤ Lf ∣∣x − y∣∣ for all x, y ∈D. The constant Lf is called the smoothness parameter. Definition 2.1.12. (Conjugate function)[Bec17, Definition 4.1] Let f ∶ E→ [−∞,∞] be an extended real-valued function. The function ḟ ∶ Ė→ [−∞,∞], defined by ḟ(y) =max{⟨y, x⟩ − f(x)}, y ∈ Ė, x∈E is called the conjugate function of f . In this thesis, we will also be using the function f ∶ RN → (−∞,∞] defined by f N(x) = ∑i=1 xi logxi, and need the following result. Theorem 2.1.5. [Bec17, Example 5.27] Consider the function f ∶ RN → (−∞,∞] given by ⎧ N ⎪⎪∑ f x i=1 xi logxi, x ∈∆N , ( ) = ⎨ ⎪⎪∞ else,⎩ 6 2. Background and theory where ∆N denotes the unit simplex2 in the space RN . Then, f is 1-strongly3 convex w.r.t. both the ℓ1- and ℓ2-norm. In order to prove this theorem, we require the following two lemmas. Lemma 2.1.6. [Bec17, pages 97-98] Let f ∶ RN → (−∞,∞] be given by ⎪⎧ N⎪∑i=1 xi logxi, x ∈∆N ,f(x) = ⎨ ⎪⎪∞ else.⎩ Then, the conjugate of f is N ḟ(y) = log( ey∑ j), j=1 also known as the log-sum-exp function. Lemma 2.1.7. (1-smoothness of the log-sum-exp function)[Bec17, Example 5.15] Let f ∶ RN → (−∞,∞] be given by N f(x) = log( ex∑ i). i=1 Then f is 1-smooth4 with respect to the ℓ2- and ℓ∞-norms. Proof. We will first show f is 1-smooth w.r.t. the ℓ2-norm. The partial derivatives of f are ∂f exi (x) = , i = 1,2, . . . ,N, ∂x Ni ∑ xk=1 e k and the second-order partial derivatives are ⎧ xi xj ∂2f ⎪⎪− e e (∑N x= 2 , i ≠ j, (x) = ⎨ k 1 e k) exiexi ex∂x ∂x ii j ⎪⎪−⎩ (∑N exk)2 + ∑N ex , i = j.k=1 kk=1 We can thus write the Hessian matrix5 as ∇2f(x) = diag(w) −wwT , where xw e ii = ∑N x . To show that f is 1-smooth w.r.t. the ℓ2-norm, note that fork=1 e k any x ∈ RN , ∇2f(x) = diag(w) −wwT ⪯ 6 diag(w) ⪯ I. 2∆N = {x ∈ RN ∶ x ≥ 0, eTx = 1 }, where e denotes the unit vector, [Bec17, page 5]. 3Definition 2.1.6 with σ = 1. 4Definition 2.1.11 with Lf = 1. 5Symmetric matrix denoted by ∇2f(x0), see [AEP+16, page 41]. 6For symmetric matrices A and B, A ⪯ B is defined as A −B being positive semi-definite. 7 2. Background and theory Hence λ 2max(∇ f(x)) ≤ 1 for any x ∈ RN . Noting that the log-sum-exp function is convex, we can invoke [Bec17, Corollary 5.13] and conclude that f is 1-smooth w.r.t. the ℓ2-norm. We will show that f is 1-smooth w.r.t. the ℓ∞-norm. For that, we begin by proving that for any d ∈ RN , dT∇2f(x)d ≤ ∣∣d∣∣2∞. (2.4) Indeed, dT∇2f(x)d = dT (diag(w) −wwT )d = dT diag(w)d − (wTd)2 ≤ dT diag(w)d N = 2∑widi i=1 N ≤ ∣∣d∣∣2∞∑wi i=1 = ∣∣d∣∣2∞. Now, since f is twice continuously differentiable over RN , it follows by [Bec17, Theorem 5.10] that for any x, y ∈ RN there exists ξ ∈ [x, y] for which 1 f(y) = f(x) +∇f(x)T (y − x) + 2(y − x) T∇2f(ξ)(y − x). (2.5) Combining (2.5), taking d = y − x, and (2.4), we obtain the inequality 1 f(y) ≤ f(x) +∇f(x)T (y − x) + 2 ∣∣y − x∣∣ 2 ∞, which by [Bec17, Theorem 5.8] implies the 1-smoothness of f w.r.t. the ℓ∞-norm. We can now prove Theorem 2.1.5. Proof of Theorem 2.1.5. Let f be denoted as in the theorem and let ḟ denote the conjugate function of f . Then by Lemma 2.1.6 the conjugate to f is the log-sum- exp function ḟ N(y) ∶= log(∑i=1 exp(yi)), which by Lemma 2.1.7 is a 1-smooth function w.r.t. both the ℓ∞- and the ℓ2-norm. Using [Bec17, Theorem 5.26] the smoothness of ḟ implies that f is 1-strongly convex w.r.t. both the ℓ1- and the ℓ2-norm. With these definitions, lemmas, and theorems established, we are now ready to delve into the realm of optimal transport. 2.2 Optimal transport As we mentioned in the introduction, the objective of an optimal transport problem is to determine a transportation plan that minimizes the cost of transferring mass from one distribution to another, i.e., finding an optimal transport plan. These distributions are also called marginals, and a problem can contain two or more marginals. If there are only two marginals, it is called a bi-marginal optimal trans- port problem. 8 2. Background and theory 2.2.1 Bi-marginal optimal transport A simple example of an optimal transport problem arises in the bi-marginal scenario, involving just two distributions: the initial given distribution and the final target distribution. Let µ1 ∈ RN+ and µ2 ∈ RN+ be two non-negative vectors with equal total mass, i.e., N N∑i=1 µ1,i = ∑j=1 µ2,j. A transportation plan is represented by a matrix M ∈ RN×N+ , where Mij signifies the amount of mass transported from µ1,i to µ . Thus, M is a valid transport plan between µ and µ if N2,j 1,i 2,j ∑j=1Mij = µ1,i for i = 1, . . . ,N and N∑i=1Mij = µ2,j for j = 1, . . . ,N . Given a cost matrix C ∈ RN×N , where Cij denotes the cost of transporting a unit mass from µ1,i to µ2,j, the objective of the optimal transport problem is to identify a valid transport plan that minimizes the total cost of transportation. This problem can be formulated mathematically as min ⟨C,M⟩ M∈R+N×N N s.t. ∑Mij = µ1,i for i = 1, . . . ,N, (2.6) j=1 N ∑Mij = µ2,j for j = 1, . . . ,N. i=1 Here, ⟨⋅, ⋅⟩ denotes the inner product ⟨C,M N N⟩ = ∑i=1∑j=1CijMij. Note that the constraints can also be written as M1 = µ1 and MT1 = µ2, and that the constraints are symmetric, i.e., the marginals can swap place. The transport plan would be the same, except that the transport matrix is transposed. We illustrate the bi-marginal optimal transport problem with a simple example. 2.2.1.1 Example For simplicity, assume we wish to find an integer solution to an optimal transport problem defined as in (2.6). Let N = 3, and assume that the marginals and the cost matrix are given by: ⎡3⎤ ⎡0⎤ ⎡0 1 2⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥ ⎢µ1 = ⎢ ⎥ , µ2 = ⎢2⎥ , and C = ⎢1 0 1⎥⎥ .⎢1⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣2⎦ ⎣2 1 0⎦ The marginals µ1 and µ2 represent the start and final distribution, respectively. Our objective is to construct a transport matrix M such that the constraints in (2.6) are fulfilled, while also minimizing the cost. Since each element Mij in M represents the amount of mass transported from node i to j, we can rewrite the constraints as ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎢M11 M12 M13⎥ ⎢1⎥ ⎢3⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢M21 M22 M23⎥ ⋅ ⎢1⎥ = ⎢0⎥ ,⎢ ⎥ ⎢ ⎥ ⎢ ⎢ ⎣M31 M32 M33 ⎥ ⎢1⎥ ⎢⎦ ⎣ ⎦ ⎣1 ⎥ ⎥ ⎦ and ⎡M M ⎤ ⎡ ⎤ ⎡ ⎤⎢ 11 21 M31⎥ ⎢1⎥ ⎢0⎥ ⎢ ⎥ ⎢1⎥ ⎢ ⎥⎢M12 M22 M32⎥ ⋅ ⎢ ⎥ = ⎢2⎥ .⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢M13 M23 M33⎥ ⎢1⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣2⎦ 9 2. Background and theory This gives the system of equations M11 +M12 +M13 = 3, M21 +M22 +M23 = 0, M31 +M32 +M33 = 1, M11 +M21 +M31 = 0, M12 +M22 +M32 = 2, M13 +M23 +M33 = 2. Since Mij ≥ 0 for all i, j, we have that M21 = M22 = M23 = M11 = M31 = 0, and the equations that remain are M12 +M13 = 3, M32 +M33 = 1, M12 +M32 = 2, M13 +M33 = 2. This yields two integer solutions;M12 = 1,M13 = 2,M32 = 1,M33 = 0 orM12 = 2,M13 = 1,M32 = 0,M33 = 1, which gives the transport matrices ⎡ ⎤ ⎡ ⎢0 1 2⎥ ⎢0 2 1⎤⎥ a ⎢0 0 0⎥ or b ⎢0 0 0⎥M = ⎢ ⎥ M = ⎢ ⎥ . ⎢ ⎥ ⎢ ⎥ ⎢0 1 0⎥ ⎢⎣ ⎦ ⎣0 0 1⎥⎦ These can be interpreted as follows. Given Ma, 1 unit of mass is transported from µ1,1 to µ2,2, 2 units of mass is transported from µ1,1 to µ2,3 and 1 unit of mass is transported from µ1 3 to µ2 2. For M b, , , 2 units of mass is transported from µ1,1 to µ2,2, 1 unit of mass is transported from µ1,1 to µ2,3 and 1 unit of mass is transported from µ1,3 to µ2,3, i.e., it stays in place. In order to find the optimal transport plan, we calculate ⟨C,M⟩ for Ma and M b, respectively. The resulting costs of the problem are ∑ a bij CijMij = 10 and ∑ij CijMij = 4. Thus, M b yields the lowest cost and is therefore our optimal transport plan. An extension beyond the conventional bi-marginal optimal transport problem is multi-marginal optimal transport, aiming to devise a transportation plan involving not just two, but multiple distributions. 2.2.2 Multi-marginal optimal transport As previously mentioned, a multi-marginal optimal transport problem have multiple marginals. These marginal distributions are defined as a finite set of T non-negative vectors µ1, . . . , µT ∈ RN+ , i.e., there are T marginals. The transport plan and the corresponding cost of moving mass are both represented by T -mode tensors, M ∈ RNT+ and C ∈ RN T , respectively, where RNT = RN×N×...×N . With this notation, a square N ×N matrix is an element in RN2 . We have that Mi1,...,iT and Ci1,...,iT represent the transported mass and the cost of moving mass, respectively, associated with the tuple (i1, . . . , iT ). The total cost 10 2. Background and theory of transport is therefore given by ⟨C,M⟩ ∶= ∑ Ci1,...,iTMi1,...,iT . i1,...,iT Remember, we wish to find the optimal transport plan M. This of course means that M needs to be a feasible transport plan, which means that it must have the given distributions as its marginals. To this end, the marginals of M are given by the projections Pt(M) ∈ RN+ onto the t:th marginal defined by [Pt(M)]it ∶= ∑ Mi1,...,iT , it = 1, . . . ,N. (2.7) i1,...,it−1,it+1,...,iT Hence M is feasible if Pt(M) = µt, for t = 1, . . . ,T . A generalization of the multi- marginal optimal transport problem is to not necessarily impose marginal constraints on all projections Pt(M), but only for an index subset Γ ⊂ {1, . . . ,T }. This gen- eralization of the discrete multi-marginal optimal transport problem can thus be formulated as min ⟨C,M⟩ M∈RNT+ (2.8) s.t. Pt(M) = µt, t ∈ Γ ⊂ {1, . . . ,T }. Here, the marginals are subjected to equality constraints, but inequality constraints can also be implemented. Solving (2.8) is challenging. It is a linear program, but the number of vari- ables is NT , which quickly becomes overwhelming as the number T of marginals increases. One approach to solving this is to add a small entropy term to the objec- tive function and use the Sinkhorn iterations, see [RHCK21]. This is presented in the following sections. 2.2.3 Entropic regularization and the Sinkhorn iterations As previously stated, solving the linear program by ”brute-force” is not realistic, since the number of variables increases exponentially in the number of marginals. This can be solved by adding an entropy term to the objective function, and applying the Sinkhorn iterations. We start by explaining the bi-marginal case, even though the multi-marginal is of more interest, since the bi-marginal case facilitates a better understanding. 2.2.3.1 Entropic regularization in the bi-marginal case An entropy term, defined as N N D(M) ∶=∑∑ [Mij log (Mij) −Mij + 1], i=1 j=1 11 2. Background and theory is introduced to (2.6), altering the formulation of the optimal transport problem to min ⟨C,M⟩ + ϵD(M) (2.9a) M∈R+N×N N s.t. ∑Mij = µ1,i for i = 1, . . . ,N, (2.9b) j=1 N ∑Mij = µ2,j for j = 1, . . . ,N, (2.9c) i=1 where ϵ > 0 is a regularization parameter. While the addition of the entropy term might appear arbitrary initially, it plays a crucial role in deriving the Sinkhorn iterations. It turns out to lead to a ”sparse” representation in terms of that the optimal transport problem is parameterized by fewer variables (the Lagrangian dual variables) compared to the original formulation. Specifically, the number of parameters is reduced from NT to N ⋅T . These variables can then be efficiently computed using coordinate ascent steps in the Lagrangian dual, see [RHCK21, page 1]. In order to derive an optimal solution to (2.9), we first consider the Lagrangian obtained by relaxing the two equality constraints, L(M,λ(1), λ(2)) = ⟨C,M⟩ + ϵD(M) + (λ(1))T (M1 − µ1) + (λ(2))T (MT1 − µ2), where λ(1), λ(2) ∈ RN are Lagrange multipliers corresponding to the constraints (2.9b) and (2.9c), respectively. The Lagrange dual function ϕ(λ(1), λ(2)) is defined as ϕ(λ(1), λ(2)) = min L(M,λ(1), λ(2)), (2.10) M∈R+N×N and the Lagrange dual problem as max ϕ(λ(1), λ(2)). λ(1),λ(2)∈RN To find the optimal transport plan M , denoted M∗, we exploit the fact that (2.9) is a convex problem, and convex optimization theory applies. By Theorem 2.1.4 we have that if the derivative of L with respect toM is zero in the feasible set, then this is an optimal solution to the optimal transport problem. To this end, we compute the derivative of the function with respect to M and set it to zero, 0 ∂= L(M,λ(1), λ(2) (1) (2)) = Cij + λi + λj + ϵ logMij,∂Mij yielding (1) (2) ∗ (1) (2) Cij + λi + λjMij(λi , λj ) = exp ( − ) ≥ 0.ϵ By defining (1) (2)λ Kij ∶= exp Cij λ j ( − ), u ii ∶= exp( − ) and vj ∶= exp( − ) (2.11) ϵ ϵ ϵ 12 2. Background and theory we have that M∗ takes the form M∗ij =Kijuivj. Moreover, since the minimum of (2.10) is obtained at ∗ (1) (2)M (λi , λj ), we have that ϕ(λ(1), λ(2)) =L(M∗ (1) (2) (λi , λ ), λ (1), λ(2)j ) ∗ (1) (2) ∗ (1) (2) (1) T ∗ (1) (2)=⟨C,M (λi , λj )⟩ + ϵD(M (λi , λj ) − (λ ) (µ1 −M (λi , λj )1) − (λ(2) T (1) (2) ) (µ2 − (M ∗(λi , λj )) T1) N N (1) (2)C + λ + λ N N exp ij i j (1) (2)= − ϵ∑∑ [ ( − ) − 1] −∑λi µ1,i −∑λj µ2,j. i=1 j=1 ϵ i=1 j=1 (2.12) In order to examine whether an optimal solution to the bi-marginal optimal trans- port problem exists, the partial derivatives of (2.12) with respect to (1) and (2)λi λj are set to zero. We obtain the following equations for (1)λi , (1) (2) ∂ϕ N ⎛ Cij + λi + λ0 j ⎞ = (1) =∑ [ exp − − µ1,i], for i = 1, . . . ,N, (2.13) ∂λi j=1 ⎝ ϵ ⎠ and for (2)λj , N (1) (2)⎛ C + λ 0 ∂ϕ exp ij i + λj ⎞ = (2) =∑ [ − − µ2,j], for j = 1, . . . ,N. ∂λ ⎝ ϵ ⎠j i=1 Here, we can utilize coordinate ascent, where one variable vector is chosen to be optimized while the other variable is fixed. We begin with λ(1) by writing (2.13) in terms of K,u and v, which yields 0 ∂ϕ N = (1) =∑ (Kijuivj) − µ1,i ∂λi j=1 N = ui∑ (Kijvj) − µ1,i j=1 = ui(Kv)i − µ1,i. Here, K and µ1 are constant parameters, and the variable vector v is kept fixed. We re-organize the equation in terms of u and get uk+1 = µ1 ⊘ (Kv k). (2.14) Analogously for λ(2), we get vk+1 = µ ⊘ (KTuk2 ). (2.15) The procedure outlined in (2.14) and (2.15) forms the foundation of the Sinkhorn iterations, where k represents each iteration. Initially, arbitrary values for u and 13 2. Background and theory v, denoted as u0 and v0, are chosen. The Sinkhorn iterations begin by computing an initial approximation u1 based on (2.14). Next, (2.15) is used to compute an approximation for v1. This process constitutes the first iteration, and we proceed to the second iteration of the Sinkhorn iterations, where v1 is implemented in (2.14), and so on. Importantly, this algorithm converges linearly and independently of the initial value chosen for v0. For a proof, we refer to [PC19, Theorem 4.2]. With this groundwork laid, we are ready to derive the Sinkhorn iterations for the multi-marginal case. 2.2.3.2 Entropic regularization in the multi-marginal case The entropy term D for the multi-marginal optimal transport problem is defined as D(M) ∶= ∑ (Mi1,...,iT log(Mi1,...,iT ) −Mi1,...,iT + 1), (2.16) i1,...,iT which together with problem (2.8), including inequality constraints, gives the new optimal transport problem min ⟨C,M⟩ + ϵD(M), M∈RNT+ s.t. Pt(M) ≤ µt, for t ∈ Γ≤, (2.17) Pt(M) = µt, for t ∈ Γ=, Pt(M) ≥ µt, for t ∈ Γ≥, where Γ = Γ≤∪Γ=∪Γ≥7, and ϵ > 0 is a regularization parameter. An optimal solution to (2.17) can be noted as M∗ϵ where M∗ϵ →M∗ when ϵ → 0, see [PC19, Proposition 4.1]. The derivation of the Sinkhorn iterations for the multi-marginal case follows the same structure as the bi-marginal case. We begin by defining the Lagrange function of (2.17) as follows: L(M,Λ) = ⟨C,M⟩ + ϵD(M) + Λ(t)∑ (Pt(M) − µt), t∈Γ where Λ(t) ∈ RN for t ∈ Γ are Lagrangian multipliers. The corresponding Lagrange dual function is ϕ(Λ) ∶= min L(M,Λ), (2.18) M∈R+N T and the Lagrange dual problem max ϕ(Λ) Λ(t), t=1,...,T when Λ(t) ∈ RN+ , for t ∈ Γ≤, Λ(t) ∈ RN , for t ∈ Γ=, Λ(t) ∈ RN− , for t ∈ Γ≥. 7Where the pairwise intersection between each subset of Γ is ∅ and Γ ⊆ {1, . . . ,T }. 14 2. Background and theory As in the bi-marginal case, we compute the partial derivative w.r.t. Mi1,...,iT of the dual function with respect toM and set it to zero in order to find an expression for the optimal transport plan, M∗(Λ(t)) 0 ∂ (t)= L(M,Λ) =C M i1,...,iT + ϵ logMi1,...,i +∑Λ ,∂ T iti1,...,iT t∈Γ yielding (t) ∗ ( ) ⎛ CM Λ t exp i1,...,iT +∑t∈ΓΛi ⎞t i1,...,iT ( ) = − .⎝ ϵ ⎠ We define K and u similarly to the terms presented in the bi-marginal case: ⎡ C ⎤ ⎡ ⎤⎢ i1,...,i ⎥ ⎢ (t)⎥ K ∶= exp⎢ − T ⎥ and (t) Λi1,...,iT ⎢ ⎥ ui ∶= exp⎢ i ⎥ϵ ⎢ − ϵ ⎥. (2.19)⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ Our transport tensor can then be expressed as M∗(Λ(t)) = K ⊙U, where U is a rank-one tensor defined by U (t)i1,...,iT =∏ui , (2.20)t t∈Γ which means that we can write U = u(1) ⊗ u(2) ⊗ ⋅ ⋅ ⋅ ⊗ u(T ). The minimum of (2.18) is obtained at M∗(Λ(t)), and we have that ϕ(Λ) = L(M∗(Λ(t)),Λ) N N ⎛ C (t)i1,...,iT +∑ ∈ Λ ⎞ T Nt Γ = −ϵ i⋅ ⋅ ⋅ exp [ − t (t)∑ ∑ ] − 1 −∑∑Λi µt,i. i1=1 iT =1⎝ ϵ ⎠ t=1 i=1 As in the bi-marginal case, we set the derivatives with respect to the elements of Λ to zero, N N N N C (t) 0 ∂ϕ i1,...,iT +∑t∈ΓΛi = ( ) = ∑ ⋅ ⋅ ⋅ ∑ ∑ ⋅ ⋅ ⋅ ∑ exp [ − t ] − µ t t,i . ∂Λi i1=1 it−1=1 it+1=1 it T =1 ϵ Here, we utilize coordinate ascent and, using the notation defined in (2.19), we rearrange the terms as follows u(t),ks+1 = u(t),ks ⊙ µt ⊘ Pt(K⊙Uks), (2.21) where ks represents the Sinkhorn iteration. Note thatU is updated in each Sinkhorn iteration as well, since it depends on u(t), hence the iteration index ks. However, unlike in the bi-marginal case, (2.17) includes inequality conditions. As a result, (2.21) might fall outside the allowed set. If this occurs, we will need to project it back into the allowed set. This results in three different update rules for u(t), depending on the sign of the Lagrangian multipliers. Similar to (2.14) and (2.15), our formulation then takes the form u(t),ks+1 =min(u(t),ks ⊙ µt ⊘ P kst(K⊙U ),1), for t ∈ Γ≤, u(t),ks+1 = u(t),ks ⊙ µt ⊘ P (K⊙Ukst ), for t ∈ Γ=, (2.22) u(t),ks+1 =max(u(t),ks ⊙ µ ⊘ P (K⊙Ukst t ),1), for t ∈ Γ≥, 15 2. Background and theory where min and max denote element wise minimum and maximum, respectively. These iterations continue until they have converged to a feasible solution of the optimal transport problem. This can be measured in multiple ways, but a common way is to proceed until the value ∥u(t),ks+1 − u(t),ks∥2 is smaller than some chosen tolerance level. This is still a challenging computation since the number of terms that must be summed in order to compute Pt grows exponentially with the number of marginals in the problem. However, if the underlying cost tensorC is of a certain graph-structure, the projections can be computed efficiently by vector/matrix multiplications. 2.2.4 Graph-structured problems The graph-structure of interest in this thesis is the tree-structure. The following definition outlines a general tree, and this is what the theory in this thesis is built on. Definition 2.2.1. [HRCK21, Definition 3.1] A graph Υ = (V ,E), with vertices V and edges E , is a tree if it is acyclic and connected. The vertices with degree 1 are called leaves, and we denote the set of leaves with L. For a vertex t ∈ V, the set of neighbors Nt is defined as the set of vertices that have a common edge with t. The path between two vertices t1 and tL is a sequence of edges that connect t1 and tL, such that all edges are distinct. We also denote a path by the set of intermediate vertices. An illustration of a simple tree graph with five nodes can be seen in Figure 2.1 below. The vertices are the nodes, and the edges are the lines that connects the nodes. 2 1 4 3 5 Figure 2.1: A simple tree graph with five nodes. Suppose now that Υ = (V ,E) is a connected graph with T = ∣V ∣ nodes, and consider the optimization problem min ⟨C,M⟩ + ϵD(M) M∈RNT+ (2.23) s.t. Pt(M) = µt, t ∈ Ṽ , where Ṽ ⊆ V is a set of vertices. Now let the cost tensor C have the structure C (t ,t )i1,...,iT = ∑ C 1 2 i , (2.24)t ( )∈E 1 ,it2 t1,t2 16 2. Background and theory where C(t1,t2) ∈ RN×N . This means that the cost term takes the form ⟨C,M⟩ = ∑ Ci1,...,iTMi1,...,iT i1,...,iT (t ,t ) = C 1 2∑ ∑ it ,i Mt i1,...,i1 2 T i1,...,iT (t1,t2)∈E (t ,t ) = 1 2∑ ∑ Cit ,i ∑ Mt i1,...,iT (2.25) ( 1 2t1,t2)∈E it1 ,it2 {i1,...,iT }/{it1 ,it2} (t1,t= C 2 ) ∑ ∑ i [Pt ,t (M)]t i ,i1 ,it2 1 2 t1 t2(t1,t2)∈E it1 ,it2 = ⟨C(t1,t2)∑ , Pt1,t2(M)⟩, (t1,t2)∈E where Pt1,t2(M) denotes the joint, i.e., bi-marginal, projection of the tensor M on the two marginals t1 and t2, given by [Pt1,t2(M)]it ,it ∶= ∑ Mi1,...,iT . (2.26)1 2 {i1,...,iT }/{it1 ,it2} This is called an entropy regularized graph-structured multi-marginal optimal trans- port problem when (2.23) has a cost tensor structured according to (2.24). In this thesis, we refer to this as the standard optimal transport form. The graph-structure of the optimal transport problem allows us to calculate (2.22) by applying the following theorem, which holds for marginal projections, i.e., for (2.7). Theorem 2.2.1. [HRCK21, Theorem 3.2] Let K = exp (−C/ϵ) where C decouples as (2.24) for the tree Υ = (V ,E), and let U = u(1)⊗u(2)⊗⋯⊗u(Υ). Define K(t1,t2) = exp (−C(t1,t2)/ϵ) for (t1, t2) ∈ E . Then the projection on the t:th marginal of K⊙U is of the form Pt(K⊙U) = u(t) ⊙ ⊙α(t,r). r∈Nt The vectors α(t,r), for all ordered tuples (t, r) ∈ E , can be computed recursively start- ing in the leaves of the tree according to α =K(t,r)u(r)(t,r) for r ∈ L, α (t,r) (r)(t,r) =K (u ⊙ ⊙ α(r,ℓ)) for r ∉ L. ℓ∈Nr∖{t} The Sinkhorn iterations can be used to efficiently solve these kinds of optimal transport problems. Bi-marginal projections, as in (2.26), can be computed in a similar way. Proposition 2.2.1. [HRCK21, Proposition 3.3] Let the assumptions in Theorem 2.2.1 hold, let t1, tL ∈ V, and let t1, t2, . . . , tL denote the path between t1 and tL. Then the bi-marginal projection of K ⊙U on the marginals t1 and tL, denoted by Pt1,t (K⊙U), isL ⎛ L−1 ⎞ ∏ diag (u(tℓ) ⊙ ⊙ α(t ,k))K(tℓ,tℓ+1) diag (u(tL) ⊙ℓ ⊙ α(tL,r)), (2.27) ⎝ ℓ=1 r∈Nt ⎠ℓ r∈Nt /{tL L−1} r∉{t1,...,tL} 17 2. Background and theory where α(t1,t2), for (t1, t2) ∈ E , are defined as in Theorem 2.2.1. Theorem 2.2.1 and Proposition 2.2.1 can be simplified in the special case of a so-called path graph, which is a simpler version of a tree. It is defined as follows. Definition 2.2.2. [GY05, page 18] The path graph Pn is a tree characterized by two nodes with a vertex degree of 1, while the remaining n−2 nodes possess a vertex degree of 2. Consequently, a path graph can be visually represented in such a way that all its vertices and edges are positioned along a single straight line. An illustration of a simple path graph with four nodes can be seen in Figure 2.2 below. This type of graph can for example appear when the nodes represent different time instances. 1 2 3 4 Figure 2.2: A simple path graph with four nodes. Operating on path graphs enables a simplification of Theorem 2.2.1 and Propo- sition 2.2.1, which are defined on general trees, to the theorem below. Theorem 2.2.2. [EHJK20, Proposition 2] Given the notations above, it holds that Pt(K⊙U) = u(t) ⊙ φ̂(t) ⊙ φ(t), Pt,t+1(K⊙U) = diag(φ̂(t))diag(u(t))Kt,t+1 diag(u(t+1))diag(φ(t+1)) where φ̂(t) = (K(t−1,t) T ) diag (u(t−1))φ̂(t−1), for t = 2, . . . ,T , φ(t) =K(t,t+1) diag (u(t+1))φ(t+1), for t = 1, . . . ,T − 1. Here, the starting values φ̂(1) and φ(T ) are defined as vectors of ones. 2.2.5 Generalized multi-marginal optimal transport As mentioned in the introduction, our focus in this thesis is generalized optimal transport problems. We repeat the description here for convenience, and comple- ment this with the mathematical expression. A generalization of (2.17) is where not all the marginals are given and fixed. Instead they can be associated with convex cost functions. In the generalized prob- lem, we also impose costs on the bi-marginal projections of the tensor. As before, let Υ T= (V ,E) be a connected graph with T = ∣V ∣ nodes, and let C ∈ R̄N be a cost tensor that takes the form (2.24). The convex graph-structured tensor optimization problems of interest are problems of the form min ⟨C,M⟩ + ϵD(M) + g(t)∑ (Pt(M)) + f (t1,t2)∑ (Pt1,t2(M)) M∈RNT+ t∈V (t1,t2)∈E s.t. Pt(M) = µt, for t , (2.28)∈ VE ⊆ V Pt(M) ≤ µt, for t ∈ VI ⊆ V , 18 2. Background and theory where g(t) and f (t1,t2) are proper, convex, and lower semi-continuous8 functions affecting nodes and edges respectively. In particular, g(t) is the convex cost for the mass in the nodes and f (t1,t2) is the convex cost of moving mass on the edges. This is a convex problem since the functions are convex on a convex domain, and known optimization theory for convex problems applies. A solution algorithm for problems of the form (2.28) has been developed, based on coordinate ascent in the Lagrangian dual of an equivalent problem, see [RHCK21]. In this thesis, we develop a new method. To this end, we examine the entropic proximal gradient method in the next section. 2.3 The entropic proximal gradient method The entropic proximal gradient algorithm, which we wish to apply to (2.28), is an optimization method commonly used for solving convex optimization problems. It combines the gradient descent method with a proximal operator to handle the potentially non-smooth part of the objective function. While the function does not necessarily have to be non-smooth, the method is capable of addressing such cases. To gain a clearer understanding of the entropic proximal gradient method, which does not assume a Euclidean underlying space, we first review the proximal gradient method that operates within a Euclidean space, see [Bec17, Chapter 10]. Consider the following problem min E(y) +G(y), (2.29) y∈RN where RN denotes the feasible set and we assume the following. Assumption 1. [Bec17, Assumption 10.77] (A) G ∶ RN → (−∞,∞] is proper, closed and convex. (B) E ∶ RN → (−∞,∞] is proper, closed and convex; dom(G) ⊆ int(dom(E)) and E is L -smooth9E over int(dom(E)). (C) Let H(y) ∶= E(y) + G(y). The set of optimal solutions of problem (2.29) is nonempty and denoted by Y ∗. The optimal value of the problem is denoted by H(y)opt. The proximal gradient method consists of a gradient step followed by a proximal mapping, which is defined as follows. Definition 2.3.1. (Proximal mapping)[Bec17, Definition 6.1] Given a function f ∶ RN → (−∞,∞], the proximal mapping of f is the operator given by prox 1f(x) = argmin{f(u) + ∥u − x∥2} u∈RN 2 for any x ∈ RN . 8See Definitions 2.1.7, 2.1.4 and 2.1.10. 9See Definition 2.1.11. 19 2. Background and theory The update step for the proximal gradient method is given by yk+1 = argmin 1E(yk) + ⟨∇E(yk), y − yk⟩ +G(y) + 2 ∣∣y − y k∣∣2, y∈RN tk where tk is the step size at iteration k. This can be transformed into yk+1 = argmin 1tkG(y) + k2 ∣∣y − (y − t k k∇E(y ))∣∣ 2, y∈RN which by Definition 2.3.1 is the same as yk+1 = proxt G(yk − tk∇E(yk)).k In this context, we can clearly observe the combination of the gradient step and the proximal mapping. With the proximal gradient method defined, we can proceed to define the gener- alized entropic proximal gradient method. However, we first need to introduce the Bregman distance. Definition 2.3.2. (Bregman distance)[Bec17, Definition 9.2] Let Ψ ∶ RN → (−∞,∞] be a proper, closed and convex function that is differentiable over dom(∂Ψ). The Bregman distance associated with Ψ is the function DΨ ∶ dom(Ψ)×dom(∂Ψ)→ R given by DΨ(x, y) = Ψ(x) −Ψ(y) − ⟨x − y,∇Ψ(y)⟩. The generalized entropic proximal gradient algorithm then reads yk+1 = argmin 1E(yk) +G(y) + DΨ(y, yk) + ⟨∇E(yk), y − yk⟩, (2.30) y∈RN tk where tk denotes the step size. We can express this method as taking a gradient step with respect to E(y) and an entropic proximal step with respect to G(y). To ensure convergence of (2.30), Ψ has to fulfill some assumptions as well. These are listed below. Assumption 2. [Bec17, Assumption 10.78] • Ψ is proper, closed and convex. • Ψ is differentiable over dom(∂Ψ). • dom(G) ⊆ dom(Ψ). • Ψ + δdom(G) is 1-strongly convex. We can see that the first and second point of the assumption is redundant, since it is already required in Definition 2.3.2. If we let ⎪⎧ n Ψ ⎪∑z i=1 zi log zi, for all i such that zi ≥ 0,( ) = ⎨ (2.31) ⎪⎪∞, else,⎩ the normalized Kullback-Liebler relative entropy is obtained as the Bregman dis- tance DΨ(x, y), see [Teb92], n xi DΨ(x, y) = KL(x∥y) =∑xi log + yi − xi. (2.32) i=1 yi 20 2. Background and theory The observant reader might notice the similarity with Definition 2.16 – this is actu- ally its general form, where x =M and y = 1. This choice of function, (2.31), fulfills the requirements in Assumption 2, which will be shown i Section 3.4.1. In the theorem provided below, the convergence rate of the entropic proximal gradient method is given. For proof, we refer to [Bec17, Theorem 10.81]. Theorem 2.3.1. (O(1/k) rate of convergence of the entropic proximal gradient method). Suppose that Assumptions 1 and 2 hold. Let {yk}k≥0 be the sequence of (2.30) generated by the entropic proximal gradient method for solving problem (2.29) with constant step size in which LE ∶= 1t for all k ≥ 0. Thenk (a) the sequence {H(yk)}k≥0 is non-increasing; (b) for any k ≥ 1 and y∗ ∈ Y ∗, LEDΨ(x∗, x0) H(xk) −Hopt ≤ . k 21 2. Background and theory 22 3 The entropic proximal gradient method for solving generalized optimal transport problems In this chapter, we derive a new method for solving problems of the form (2.28), repeated here for convenience, min ⟨C,M⟩ + ϵD(M) + (t)∑ g (Pt(M)) + f (t1,t2)∑ (PT t1,t2(M)),M∈RN+ t∈V (t1,t2)∈E s.t. P (3.1)t(M) = µt, for t ∈ VE ⊆ V , Pt(M) ≤ µt, for t ∈ VI ⊆ V , where we define F (M) ∶= ∑ g(t)t∈V (Pt(M))+∑ (t1,t )( 2t1,t2)∈E f (Pt1,t2(M)). Our method consists of solving a sequence of problems on standard optimal transport form, where the sequence is indexed by ko, i.e., Mko+1 = argmin ⟨C̃ko ,M⟩ + ηD(M), M∈RNT+ s.t. M (3.2)Pt( ) = µt, for t ∈ VE, Pt(M) ≤ µt, for t ∈ VI , where C̃ko ∶= C + ∇F (Mko) − δ log(Mko) and η ∶= ϵ + δ. Here ϵ is the regularization parameter and δ1 denotes the step size in the entropic proximal gradient algorithm. Moreover, the fact that we use the gradient of F means that we restrict ourselves to functions g and f that are continuously differentiable. We have that, under certain conditions, Mko converges to a solution of (3.1), see Theorem 3.4.1. In the upcoming sections, we derive this algorithm, and show that C̃ko has the same graph structure as the underlying constant cost matrix C. Thereafter, each problem (3.2) can be solved using the Sinkhorn iterations. 3.1 Deriving (3.2) and proving that C̃ko has the same graph-structure as C In order to derive (3.2), we utilize the entropic proximal gradient method, as outlined in Section 2.3, for solving (3.1). We know that g and f are proper, convex and 1Note that this is not the indicator function. 23 3. The entropic proximal gradient method for solving generalized optimal transport problems lower semi-continuous functions. However, in order to apply the entropic proximal gradient, it is also necessary for g and f to be Lg,f -smooth, see Assumption 1. This enables the application of (2.30) to (3.1). We take an entropic proximal step with respect to ⟨C,M⟩+ ϵD(M) and the constraints, and a gradient step with respect to F (M), which yields Mko+1 = argmin F (Mko) + ⟨C,M⟩ + ϵD(M) + δKL(M∥Mko) + ⟨∇F (Mko),M −Mko⟩, M∈RNT+ s.t. Pt(M) = µt, for t ∈ VE, Pt(M) ≤ µt, for t ∈ VI . (3.3) Remark 1. We can rewrite ⟨∇F (Mko),M−Mko⟩ = ⟨∇F (Mko),M⟩−⟨∇F (Mko),Mko⟩. Here ⟨∇F (Mko),Mko⟩ is only a constant and is not relevant to solving our optimiza- tion problem since we are only interested in the argument and not the actual value. Thus only ⟨∇F (Mko),M⟩ remains. The same applies to the term F (Mko) – it is a constant and can be ignored in the solving of the optimization problem. However, (3.3) is not on the standard optimal transport form, i.e., not on the form of (2.17). This is the subject of the following sections. 3.1.1 ∇F (Mko) has a graph-structure In order to write (3.3) in the standard optimal transport form, we begin by interpret- ing ⟨∇F (Mko),M⟩. The goal is to show that ∇F (Mko) has the same graph-structure as C, as defined in (2.24). To this end, consider the following definition. Definition 3.1.1. (Fréchet differentiable)[Lue69, page 172] Let T be a transforma- tion defined on an open domain D in a normed space X and having range in a normed space Y . If for fixed x ∈ D and each h ∈ X there exists σT (x;h) ∈ Y which is linear and continuous with respect to h such that lim ∥T (x + h) − T (x) − σT (x;h)∥ = 0, ∥h∥→0 ∥h∥ then T is said to be Fréchet differentiable at x and σT (x;h) is said to be the Fréchet differential of T at x with increment h. Let T ∶ RN → R be a function, and assume that T is Fréchet differentiable throughout RN . We can interpret the variables of Definition 3.1.1 as follows: At a fixed point x ∈ RN , the Fréchet differential σT (x;h) is, by definition, of the form Axh, where Ax is a bounded linear operator from RN to R. Thus, as x varies over RN , the correspondence x→ Ax defines a transformation from RN into R; this transformation is called the Fréchet derivative ∇T of T . Thus we have, by definition, σT (x;h) = ⟨∇T (x), h⟩. This means that the Fréchet derivative can be interpreted as providing directional information about how the function T behaves in the vicinity of a given point x, which is summarized in the following definition. 24 3. The entropic proximal gradient method for solving generalized optimal transport problems Definition 3.1.2. The gradient of T ∶ RN → R in a point x is the element ∇T (x) s.t. for all h ∈ RN , ⟨∇T (x), h⟩ is the Fréchet derivative with increment h. This interpretation aligns with the notion of a derivative in calculus, where it rep- resents the rate of change of a function in a particular direction. In our case, F is a transformation such as T in Definition 3.1.1, which means that lim ∥F (M + δM) − F (M) − σF (M; δM)∥ = 0, (3.4) ∥δM∥→0 ∥δM∥ where δM represents a small change in the input M. To prove that ∇F (Mko) is a tensor with the same graph-structure as C, we will utilize Taylor’s Theorem for multivariate functions, listed below. Theorem 3.1.1. [MT03, Page 160](Multivariate version of Taylor’s theorem) Let f ∶ RN → R be a k-times continuously differentiable function at the point a ∈ RN . Then there exist functions hα ∶ RN → R, where ∣α∣ = k, such that Dαf(a) f(x) = (x − a)α + h (x)(x − a)α∑ ∑ α , ∣α∣≤k α! ∣α∣=k and lim hα(x) = 0. x→a Since we are only interested in the first order Taylor polynomial, the gradient of F is our only concern. Hence, F needs to be at least C1, which is covered by the assumption that g and f are Lg,f -smooth. In the following theorem, we implement Taylor’s Theorem in order to rewrite (3.4) and show that ∇F (Mko) is a tensor with the same graph-structure as C. Theorem 3.1.2. Define F (M) ∶= ∑ g(t)t∈V (Pt(M)) + ∑ (t1,t2)(t1,t2)∈E f (Pt1,t2(M)) where g(t) and f (t1,t2) are proper, convex, lower semi-continuous, and Lg,f -smooth functions. We have that ∇F (Mko) =∑P ∗t (∇g(t)(P kot(M ))) + ∗ (t∑ P (∇f 1,t2) kot1,t2 (Pt1,t2(M ))), t∈V (t1,t2)∈E which is a tensor with the same graph-structure as C, defined as (2.24). Proof. We begin by expressing F (M+δM) in terms of the functions g and f . These functions follow the same pattern, and we focus on g for better readability, and keep f implicit. A difference to keep in mind is that g is a function on the vertices, i.e., nodes, and f on the edges. Doing a Taylor series expansion, we have that F (M + δM) = g(t)∑ (P (M) + P (δM)) + f (t1,t2)t t ∑ (Pt1,t2(M) + Pt1,t2(δM)) t∈V (t1,t2)∈E = g(t)∑ (Pt(M)) + ⟨∇g(t)(Pt(M)), Pt(δM)⟩ + f (t1,t2)∑ (. . . ) +O(δM) t∈V (t1,t2)∈E = F (M) + ⟨∇g(t)∑ (Pt(M)), Pt(δM)⟩ + ∑ ⟨⋅, ⋅⟩ +O(δM), t∈V (t1,t2)∈E 25 3. The entropic proximal gradient method for solving generalized optimal transport problems where O(δM) = h1(M + δM)(δM). When δM → 0, then O(δM)/δM → 0. More- over, we see that F (M+δM) = F (M)+∑t∈V⟨∇g(t)(Pt(M)), Pt(δM)⟩+∑(t1,t2)∈E⟨⋅, ⋅⟩+ O(δM), which gives F (M + δM) − F (M) = (t)∑⟨∇g (Pt(M)), Pt(δM)⟩ + ∑ ⟨⋅, ⋅⟩ +O(δM). t∈V (t1,t2)∈E To understand the element ⟨∇g(t)(Pt(M)), Pt(δM)⟩, first note that this inner prod- uct lives in RN . We consider the inner product with general variables in order to ease notation, which gives N ⟨y,Pt(x)⟩RN =∑ yiPt(x)i i=1 N N N N N =∑ yi ∑ ⋅ ⋅ ⋅ ∑ ∑ ⋅ ⋅ ⋅ ∑ xi1,...,iT (3.5) i=1 i1=1 it−1=1 it+1=1 iT =1 N N = ∑ ⋅ ⋅ ⋅ ∑ yitxi1,...,iT , i1=1 iT =1 where the definition of Pt(x), (2.7), is applied. By the definition of an adjoint oper- ator2, we know that ⟨y,A(x)⟩RN = ⟨A∗(y), x⟩RNT . Reverting back to our notation, with A = Pt, y = ∇g(t)(Pt(M)) and x = δM, we get ⟨∇g(t)(Pt(M)), Pt(δM)⟩RN = ⟨P ∗t (∇g(t)(Pt(M))), δM⟩RNT . (3.6) Comparing the right hand side of (3.6) to Definition 3.1.2, we see that P ∗(∇g(t)t (Pt(M))) is part of the gradient of F (M). We also add the function f to the expression, which follow the same pattern, but is a function on the edges. This means that the gradient of F (M) can be written as ∇F (M) = ∗ (t)∑Pt (∇g (Pt(M))) + P ∗ (∇f (t ,t )∑ 1 2t1,t2 (Pt1,t2(M))). (3.7) t∈V (t1,t2)∈E 2See Definition 2.1.3. 26 3. The entropic proximal gradient method for solving generalized optimal transport problems We now have an expression for the gradient of F , and can express ⟨∇F (Mko),M⟩ by ⟨∇F (Mko),M⟩ = ⟨ P ∗ (t) k∑ ot (∇g (Pt(M ))) t∈V + P ∗ (∇f (t1,t∑ 2)t1,t2 (Pt1,t2(M ko))),M⟩ (t1,t2)∈E RNT = ⟨∑P ∗ (t)t (∇g (Pt(Mko))),M⟩ t∈V RNT + ⟨ P ∗ (t1,t2) k∑ ot1,t2(∇f (Pt1,t2(M ))),M⟩ (t1,t2)∈E RNT (3.8) = ⟨ ∇g(t)∑ (P (Mkot )), Pt(M)⟩ t∈V RN + ⟨ ∇f (t1,t2)∑ (P (Mkot1,t2 )), Pt1,t2(M)⟩ (t1,t2)∈E RN = ⟨∇g(t)(P (Mk∑ ot )), Pt(M)⟩ t∈V RN + ⟨∇f (t1,t ) k∑ 2 (P ot1,t2(M )), Pt1,t2(M)⟩ . (t1,t2)∈E RN Comparing (3.8) to (2.25), we see that ∇F (Mko) has the same graph-structure as C, where ∇g acts on nodes and ∇f acts on edges. It does not matter whether we act on both nodes and edges because we can always rewrite computations on the nodes as computations on the edges by appropriate transformations, see N ⟨Ct, Pt(M)⟩ = ∑ Ct,it[Pt(M)]it it=1 N N = ∑ Ct,it ∑[Pt,t̃(M)]it,it̃ (3.9) it=1 it̃=1 N = ∑ Ct,it[Pt,t̃(M)]it,i ,t̃ it,it̃=1 and we are done. Remark 2. The choice of t̃ in (3.9) is not unique. Furthermore, we have an ex- pression in order to convert marginal projections to bi-marginal projections by N Pt(M) = ∑ αt̃∑[Pt,t̃(M)]it,i ,t̃ t̃∈Nt it̃=1 where ∑t̃∈N αt̃ = 1 and Nt as in Definition 2.2.1.t By Theorem 3.1.2, ∇F (Mko) has the same graph-structure as C, we can combine 27 3. The entropic proximal gradient method for solving generalized optimal transport problems these two, i.e., we can rewrite (3.3) as Mko+1 = argmin ⟨C +∇F (Mko),M⟩ + ϵD(M) + δKL(M∥Mko), M∈RNT+ s.t. M (3.10)Pt( ) = µt, for t ∈ VE, Pt(M) ≤ µt, for t ∈ VI , and we are one step closer to rewriting (3.3) to standard optimal transport form. However, it remains to merge ϵD(M) and δKL(M∥Mko) into a single expression. 3.1.2 Merging ϵD(M) and δKL(M∥Mko) into a single expres- sion Consider the following theorem. Theorem 3.1.3. We have that ϵD(M) + δKL(M∥Mko) = (ϵ + δ)( ∑ Mi1,...,iT log (Mi1,...,iT ) −Mi1,...,iT ) i1,...,iT +M koi1,...,iT log(Mi1,...,iT ) + ϵ + δMkoi ,...,iT ,1 i.e., an entropy term, a linear term, and constants. Proof. The sum of the two entropy terms is ϵD(M) + δKL(M∣∣Mko) =ϵ( ∑ Mi1,...,iT log(Mi1,...,iT ) −Mi1,...,iT + 1) i1,...,iT M log Mi1,...,i+ δ( ( T∑ i1,...,iT ) −M koMk i1,...,iT +M o i1,...,iT ). i1,...,iT i1,...,iT We can rewrite the δ-term as δ( k∑ oi1,...,iT Mi1,...,iT log(Mi1,...,iT )−Mi1,...,iT log(Mi1,...,iT )− M ko koi1,...,iT +Mi ,...,iT ). Here, Mi1,...,iT log(M1 i1,...,iT ) is a linear term. This gives us an entropy term, (ϵ + δ)( ∑ Mi1,...,iT log (Mi1,...,iT ) −Mi1,...,iT ), i1,...,iT a linear term, M ko koi1,...,iT log(Mi1,...,iT ), and constants ϵ + δMi ,...,iT , and we are done.1 We have shown that ϵD(M) and δKL(M) can be merged into a single expression, consisting of an entropy term, a linear term, and constants. Next, we need to incor- porate these into (3.10). The linear term from Theorem 3.1.3,Mi1,...,iT log(Mkoi1,...,iT ), can be added to ⟨C +∇F (Mko),M⟩, which gives ⟨C +∇F (Mko) − δ log(Mko),M⟩. We define C̃ko ∶=C +∇F (Mko) − δ log(Mko), 28 3. The entropic proximal gradient method for solving generalized optimal transport problems and get ⟨C̃ko ,M⟩. As in Remark 1, ϵ and δMkoi ,...,iT are constants and they are therefore not relevant1 to solving our optimization problem and can be removed. We are left with (ϵ + δ)(∑i1,...,iT Mi1,...,iT log (Mi1,...,iT )−Mi1,...,iT ) = (ϵ+δ)(D(M)−NT ). As before, here we handle (ϵ+ δ)NT as in Remark 1, i.e., we remove it. We define η ∶= ϵ+ δ, and we can write (3.10) as Mko+1 = argmin ⟨C̃ko ,M⟩ + ηD(M), M∈RNT+ s.t. M for (3.11)Pt( ) = µt, t ∈ VE, Pt(M) ≤ µt, for t ∈ VI . We have thus reformulated (3.3) into the standard formulation of an optimal trans- port problem, enabling us to apply established theories for its solution. Left to show is that (3.11) is a graph-structured problem with the same underlying graph as (3.1). This is the focus in the next section. 3.1.3 The graph-structure of C̃ko We know that ⟨C̃ko ,M⟩ = ⟨C +∇F (Mko) − δ log(Mko),M⟩ = ⟨C,M⟩ + ⟨∇F (Mko),M⟩ − ⟨δ log(Mko),M⟩. We also know that C has a graph-structure, so computing ⟨C,M⟩ is straightforward. In Section 3.1.1 we proved that ∇F (Mko) also have a graph-structure. Next, we prove that also δ log(Mko) have a graph-structure. Theorem 3.1.4. Given the notation above, C̃ko has the same graph-structure as C, defined as (2.24). Proof. We have that C̃ko = C + ∇F (Mko) − δ log(Mko). By Theorem 3.1.2 we have that ∇F (Mko) is a tensor with the same graph-structure as C. Remains to prove that log(Mko) does also. To demonstrate that log(Mko+1) has a graph-structure, we use proof by induction. First, we can easily define an initial transport matrixM0 asM0 = 1⇒ log(M0) = 0. Assume now that log(Mko) has a graph-structure. This implies that C̃ko also has a graph-structure. From the Sinkhorn iterations in Section 2.2.3.2, we know that Mko+1 can be computed as Kko ⊙Uko , where ⎡ C̃ko ⎤⎢ ⎥ Kko ∶= exp⎢ i1,...,iT ⎥ and Uko (t),ki ,...,iT ⎢ − ⎥ i ,...,iT = u o , (3.12) 1 ∏⎢ η i⎥ 1 ⎣ ⎦ t∈V where ⎡ ⎤ (t),k ⎢ Λ(t),ko ⎥ u oi ∶= exp⎢⎢ − i ⎥⎥. ⎢ η ⎥ ⎣ ⎦ 29 3. The entropic proximal gradient method for solving generalized optimal transport problems Note that the notation Mko+1 = Kko ⊙ Uko originates from the fact that we are working on a sequence of problems, and our cost matrix C̃ko changes with each iteration. We thus have log(Mko+1 ko koi ,...,iT ) = log([K ⊙U ]1 i1,...,iT ) = log([ Kko ⊙ u(t),k∏ ot1,t2 ∏ ] ) (t1,t2)∈E t∈V i1,...,iT C̃ko (t),ko = log exp t1,t2 Λ( ([ ∑ − −∑ ] )) (t1,t2)∈E η t∈V η i1,...,iT C̃ko (t),kot1,t2 Λ = ∑ −∑ . (t1,t2)∈E η t∈V η This has a graph-structure, and as in Remark 2, we can transform this to computa- tions on the edges. Therefore, we conclude that log(Mko+1) has a graph-structure, completing the induction proof. We have now proved that C̃ko has the same graph-structure as C, and previously known theory applies. 3.2 The Sinkhorn iterations At this point, we have our generalized optimal transport problem on standard form, as given in (3.11), and we need to perform Lagrange relaxation on the problem in order to utilize the Sinkhorn iterations. This process was presented in detail in Section 2.2.3.2, and we just present the final result here. We define Kko and U as in (3.12) and (2.20), respectively. As mentioned in the proof of Theorem 3.1.4, we are solving a sequence of problems where our cost matrix C̃ko changes with each iteration, and we need to keep track of the current iteration ko. Note that we do not assign an outer-iteration index to U because it is the variable being updated. The notation Uko denotes the optimal solution to (3.2), i.e., the tensor that the Sinkhorn iterations converge to. We update u(t),ks based on the equality/inequality conditions on our marginals, u(t),ks+1 =min(u(t),ks ⊙ µ ko kst ⊘ Pt(K ⊙U ),1), for t ∈ VI , u(t),ks+1 ( ) (3.13) = u t ,ks ⊙ µt ⊘ Pt(Kko ⊙Uks), for t ∈ VE. With the Sinkhorn iterations derived, we need to clarify the computation of C̃ko , which we do in the next section. 3.3 The computation of C̃ko We have shown that C̃ko has the required graph-structure. However, as mentioned in the proof of Theorem 3.1.2, some functions in (3.11) need a transformation in 30 3. The entropic proximal gradient method for solving generalized optimal transport problems order to correspond dimensionally, since some act on nodes and others on edges. We have that, when writing all functions explicitly, ⟨C̃ko ,M⟩ = ∑ ∑ Cit [P (M)]1 ,it2 t1,t2 it1 ,it2 (t1,t2)∈E it1 ,it2 + ∇g(t)∑∑ (P kot(M ))[Pt(M)]it t∈V it + ∇f (t1,t2)∑ ∑ (Pt1,t2(Mko))[Pt1,t2(M)]it1 ,it2 (3.14) (t1,t2)∈E it1 ,it2 − δ k −1∑ ∑ log(K ot1,t2 )[Pt1,t2(M)]it1 ,it2 (t1,t2)∈E it1 ,it2 + δ log(u(t),ko−1∑∑ )[Pt(M)]it . t∈V it The functions ∇g(t) and log(u(t),ko) only acts on specific nodes, not edges. Therefore we must redefine the way they act on our cost function C̃(t,t+1). This can be done as shown in Remark 2. In the next section we verify the convergence of our algorithm, along with deter- mining an appropriate termination criterion. 3.4 Convergence and termination criterion for the entropic proximal gradient method Now that our algorithm has been formulated, it is crucial to verify its convergence to an optimal solution and determine an appropriate termination criterion for the outer iterations. We commence by proving convergence and subsequently address the termination criterion. 3.4.1 Convergence To prove convergence, we verify that the assumptions outlined in Section 2.3 hold. We begin by splitting the system (3.1) into the two functions G and E as G(M) ∶= ⟨C,M⟩ + ϵD(M) + δA(M), and E(M) ∶= g(t)(P (M)) + f (t ,t )∑ ∑ 1 2t (Pt1,t2(M)), (3.15) t∈V (t1,t2)∈E where A RNT= {M ∈ + ∶ Pt(M) = µt, t ∈ VE ⊆ V , Pt(M) ≤ µt, t ∈ VI ⊆ V}. This is the same as the notation used in Section 2.3. We can assume that a feasible transport matrix M exists, otherwise the transport problem would not be of interest. We first take a look at G(M) that consists of four parts; ⟨C,M⟩, ϵD(M), the constraints andM T∈ RN+ . It is sufficient to show that ⟨C,M⟩ and ϵD(M) fulfill the conditions, since the constraints can be seen as an indicator function δA(M), as in Definition 2.1.8, where A is the set that represents the constraints. This function is obviously proper and lower semi-continuous when the set A, as defined by the 31 3. The entropic proximal gradient method for solving generalized optimal transport problems constraints, is non-empty. Moreover, δA is convex when the set A is convex, as is the case here. The inner product ⟨C,M⟩ is linear in M, ensuring its convexity and lower semi- continuity. Furthermore, to guarantee the existence of a solution to the transport problem such that ⟨C,M⟩ < ∞, we assume that C does not contain ∞. Even if a feasible transport matrix exists, this condition is necessary. Consequently, ⟨C,M⟩ is considered proper, as we previously assumed the existence of a feasible transport matrix. The function ϵD(M) consists of the terms∑i1,...,iT (Mi1,...,iT log(Mi1,...,iT )−Mi1,...,iT + 1), which is just a sum of functions of the type f(x) ∶= x log(x) − x + 1. Since the second derivative of f is positive for all x > 0, the function f is convex and obvi- ously proper. We also know that each component of f is continuous, hence f is also continuous and obviously lower semi-continuous. Before stating the arguments for g(t) and f (t1,t2) it is important to note that they do not operate on the space RNT on their own, as seen in (3.15), where for example g(t) ∶ RN → R, f (t1,t2) RN2∶ → R. But as a composite function with Pt and Pt1,t2 respectively, we get (g(t) T ○ P ) ∶ RNt → R, (f (t1,t2) T ○ Pt1,t2) ∶ RN → R. Both g(t) and f (t1,t2) can handle M ∈ RNT since the tensors are projected onto vectors and matrices before being processed by these functions. To satisfy the requirements in Assumption 1, both functions f and g must be proper, convex, closed, and LE-smooth over the interior of the domain of E, denoted as int(dom(E)). To demonstrate that dom(G) ⊆ int(dom(E)), it is sufficient to show that RN+ ⊆ int(dom(g(t) 2 T )) and RN+ ⊆ int(dom(f (t1,t2))), since dom(G) ⊆ RN+ . As mentioned in Section 2.3, choosing Ψ(z N) = ∑i=1 zi log zi in Definition 2.3.2 will produce the normalized Kullback-Liebler relative entropy term (2.32). It now remains to show that the chosen Ψ fulfills the conditions in Assumption 2. As mentioned before, the second derivative of Ψ is positive for all x > 0 and hence Ψ is proper convex for all x > 0. Ψ is also continuous since the product of two continuous functions are continuous. It is obviously differentiable over dom(∂Ψ) and dom(G) ⊆ dom(Ψ), given that the domain of Ψ is RNT+ . Lastly, Ψ + δdom(G) is 1-strongly convex by Theorem 2.1.5. We have now proved the following theorem. Theorem 3.4.1. Assume that there exists a feasible transport tensor M, C does not contain ∞, both functions g ∶ RN → R and RN2f ∶ → R are Lg,f -smooth over int(dom(g + f)). Then, if RN ⊆ int(dom(g(t)+ )) and RN 2 + ⊆ int(dom(f (t1,t2))), we have that the algorithm Mko+1 = argmin ⟨C̃ko ,M⟩ + ηD(M), M∈RNT+ s.t. Pt(M) = µt, for t ∈ VE, Pt(M) ≤ µt, for t ∈ VI , 32 3. The entropic proximal gradient method for solving generalized optimal transport problems converges to an optimal solution of (3.1). Lastly, we need a termination criterion for our method, which is presented in the next section. 3.4.2 Termination criterion Next we examine what termination criterion we need for the outer level of the algorithm. For the inner level, we can use the square error, ∥u(t),ks+1 − u(t),ks∥2, compared to some chosen tolerance level. As a criterion to terminate the outer level, we can implement ∑i1,...,iT (M ko+1 k 2o i1,...,iT −Mi1,...,iT ) . If this is less than ε > 0, then the difference between Mko+1 and Mko are small, and we can terminate the algorithm. We can calculate this efficiently by utilizing the following theorem, where M2i ,...,iT denotes the element wise square.1 Theorem 3.4.2. Following the notation above, we have that 2 ∑ (Mko+1 koi1,...,iT −Mi ,...,iT )1 i1,...,iT can be more efficiently computed by N N N ∑ [P1((Mko+1)2)] − 2 [P (Mko+1∑ 1 ⊙Mko)] +∑ [P k 21((M o) )] . i=1 i i=1 i i=1 i Proof. We have that T −1 T M (t,t+1) (t)i1,...,iT =Ki1,...,iTUi1,...,iT = (∏ Ki ,i + )(∏ui ),t t 1 t t=1 t=1 and T −1 T M2 (t,t+1) 2 (t) 2 i1,...,iT = (∏ (Ki ,i ) )(∏ (u ) ).t t+1 it t=1 t=1 This means that M2 has the same graph-structure as M. For ease of notation, denote Mko+1 as M̂ koi ,...,iT i1,...,iT and Mi ,...,iT as Mi1,...,iT . We have1 1 2 ⎡ T −1 T T −1 T ⎤ 2 ⎢ M̂ M ⎢ (t,t+1) (t) (t,t+1) (t) ⎥ ( ⎥i1,...,iT − i1,...,iT ) =⎢(∏ K̂it,i )( û ) − ( Kt+1 ∏ i ∏t it,i )( u )⎢ t+1 ∏ it ⎥ ⎣ t=1 t=1 t=1 t=1 ⎥ ⎦ T −1 T (t,t+1) 2 (t) 2 =(∏ (K̂i ,i + ) )(∏ (ûi ) )t t 1 t t=1 t=1 T −1 T T −1 T (3.16) 2 (t,t+1) (t) (t,t+1) (t)− (∏ K̂it,it+ )(∏ ûi )( K1 ∏t i ,i + )(∏ut t 1 i )t t=1 t=1 t=1 t=1 T −1 ( 2 T 2 t,t+1) (t) + (∏ (Ki ,i + ) )(∏ (u ) )t t 1 it t=1 t=1 = M̂2i1,...,iT − 2M̂ 2 i1,...,iTMi1,...,iT +Mi1,...,iT . 33 3. The entropic proximal gradient method for solving generalized optimal transport problems Moreover, we have that [Pt(M)]it = ∑ Mi1,...,iT . i1,...,iT /it If we here sum over an arbitrary index, say 1, we get N ∑ [P1(M)]i1 = ∑ Mi1,...,iT . i1=1 i1,...,iT This, together with (3.16), gives us N N ∑ (M̂ 2 2 i1,...,iT −Mi1,...,iT ) =∑ [P1((M̂) )] − 2∑ [P1(M̂⊙M)] i1,...,iT i=1 i i=1 i N + [P ((M)2∑ 1 )] , i=1 i and we are done. Next we present the algorithm in pseudo code, in order to facilitate better un- derstanding. 34 3. The entropic proximal gradient method for solving generalized optimal transport problems 3.5 The algorithm in pseudo code In order to facilitate a better understanding of the algorithm, we present it in pseudo code format, for both general trees and path graphs. The algorithm has two ”levels”, repeatedly solving (3.11) using the Sinkhorn iterations. Let us call the Sinkhorn iterations as the inner level, and solving (3.11) repeatedly, the outer level. The outer level is almost the same for both general trees and path graphs. The difference lies in the projections of the bi-marginal projections, and we present the algorithms in the following sections. 3.5.1 General trees The inner level for general trees is presented in Algorithm 1, which is from [HRCK21, Algorithm 3.1], and the outer level is presented in Algorithm 2. Algorithm 1 Sinkhorn method for the tree-structured multimarginal optimal trans- port problem Given: Tree Υ = (V ,E) with leaves L = {1,2, . . . , ∣L∣} Initial guess u(t) for t ∈ L for all ordered tuples (t1, t2) ∈ E do Initialize α(t1,t2) by α =K(t,r)u(r)(t,r) for r ∈ L, α =K(t,r)(t,r) (u (r) ⊙ ⊙ α(r,ℓ)) for r ∉ L ℓ∈Nr∖{t} end for Initialize t ∈ L while Sinkhorn not converged do u(t) ← µt ⊘⊙k∈N αt (t,r) for (t1, t2) ∈ E on the path from t to (t + 1 mod ∣L∣) do Update α(t1,t2) by α =K(t,r)(t,r) u (r) for r ∈ L, α =K(t,r)(t,r) (u (r) ⊙ ⊙ α(r,ℓ)) for r ∉ L ℓ∈Nr∖{t} end for t← t + 1 mod ∣L∣ end while return u(t) for t ∈ L 35 3. The entropic proximal gradient method for solving generalized optimal transport problems Algorithm 2 Entropic proximal gradient method for the graph-structured multi- marginal optimal transport problem Function Generate ∇F (Mko) ∇g(P (Mkot ))← α∇g1(P (Mkot )) + (1 − α)∇g2(P (Mkot )) ∇F (Mko)← ∇g(Pt(Mko)) +∇f(P kot1,t2(M )) return ∇F (Mko) Function Generate C̃(t,t+1),ko Generate ∇F (Mko) u1 ← u(t) ⋅ 1T u2 ← 1 ⋅ u(t) T u← αu1 + (1 − α)u2 C̃(t,t+1),ko ← C(t,t+1) +∇F (Mko) − δ(log(K(t,t+1),ko−1) − log(u)) return C̃(t,t+1),ko Initial guess K0 = 1, Pt1,t2(Mko) = 1; while M not converged do for t in range(T ) do Generate C̃(t,t+1),ko ( +1) exp C̃(t,t+1),kt,t ,k oK o ← ( − )η end for u, φ̂, φ← Algorithm 1 for t in range(T − 1) do ⎛ L−1 ⎞ Pt1,t2(Mko)← diag (u(t )∏ ℓ ⊙ ⊙ α(t ,k))K(tℓ,tℓ+1) diag (u(tL)⊙ℓ ⊙ α(t )L,r) ⎝ ℓ=1 r∈Nt ⎠ℓ r∈Nt /{tL L−1} r∉{t1,...,tL} end for ko ← ko + 1 end while return P (Mkot ) 3.5.2 Path graphs The inner algorithm for path graphs is presented in Algorithm 3, which is based on [HRCK21, Algorithm 3.1], but adapted for path graphs. The outer algorithm is presented in Algorithm 4. 36 3. The entropic proximal gradient method for solving generalized optimal transport problems Algorithm 3 Sinkhorn method for the path graph-structured multimarginal opti- mal transport problem Function Generate φ φ(T −1) = 1 φ(T −2) ←K(T −2,T −1) diag (u(T −1)) for t in range(T -3, -1, -1) do φ(t) ←K(t,t+1) diag (u(t+1))φ(t+1) end for return φ Given: K, T marginals denoted µ, node size N , types = ’<’,’=’; Initial guess u(t) = 1, for t = 1, . . . ,T , φ̂(0) = 1; while Sinkhorn not converged do Generate φ for t in range(T ) do Pt(Mko)← u(t) ⊙ φ̂(t) ⊙ φ(t) u (t) (t)tmp ← u ⊙ µ ⊘ P (Mkot ) if type = ’<’ then u(t) =min(utmp,1) else if type = ’=’ then u(t) = utmp else if type = ’>’ then u(t) =max(utmp,1) end if if t = 0 then T φ̂(t+1) ← (K(t−1,t)) (u(t−1))φ̂(t−1) else if t ≠ T − 1 then T φ̂(t+1) ← (K(t−1,t)) diag (u(t−1))φ̂(t−1) end if end for ks ← ks + 1 end while Generate φ return u, φ̂, φ 37 3. The entropic proximal gradient method for solving generalized optimal transport problems Algorithm 4 Entropic proximal gradient method for the graph-structured multi- marginal optimal transport problem Function Generate ∇F (Mko) ∇g(Pt(Mko))← α∇g1(P (Mko)) + (1 − α)∇g2(P (Mkot t )) ∇F (Mko)← ∇g(P kot(M )) +∇f(Pt,t+1(Mko)) return ∇F (Mko) Function Generate C̃(t,t+1),ko Generate ∇F (Mko) u1 ← u(t) ⋅ 1T T u2 ← 1 ⋅ u(t) u← αu1 + (1 − α)u2 C̃(t,t+1),ko ← C(t,t+1) +∇F (Mko) − δ(log(K(t,t+1),ko−1) − log(u)) return C̃(t,t+1),ko Initial guess K0 = 1, P kot,t+1(M ) = 1; while M not converged do for t in range(T ) do Generate C̃(t,t+1),ko ( +1) exp C̃(t,t+1),kK t,t ,k oo ← ( − )η end for u, φ̂, φ← Algorithm 3 for t in range(T − 1) do Pt,t+1(Mko)← diag(φ̂(t))diag(u(t))K(t,t+1),ko diag(u(t+1))diag(φ(t+1)) end for ko ← ko + 1 end while return P (Mkot ) 38 4 Implementation and numerical results In this chapter, we present the implementation of Algorithms 3 and 4, choice of parameter value, and generated numerical results. While the theory in this the- sis is developed for general trees, our implementation will be on path graphs, for simplicity. 4.1 Implementation The implementation of the entropic proximal gradient method has been done in Python. We continue by defining a generalized multi-marginal optimal transport problem by defining each parameter, such as the constant cost tensor C, functions g and f . 4.1.1 Defining a generalized multi-marginal optimal trans- port problem and its components We define a generalized multi-marginal optimal transport problem as in (3.1), but on a path graph. Let G = (V ,E) be a path graph with T = ∣V ∣ marginals of size N , i.e., the marginals are 1-dimensional. In order to rewrite this problem to the form (3.2), we need to define all functions and constraints. We begin by defining an underlying constant cost tensor C with the structure (2.24). We take the interval [0,1], and discretize it into N equidistant points, and define C as the tensor of all distances between points, i.e., the value of entry Cij is given by j∣ i − 2N N ∣ . We choose a variety of values for the regularization parameter ϵ and the step size δ for the entropic gradient method. The marginals associated with the problem are 1-dimensional, i.e., vectors. Their values are generated from a normal distribution, meaning the highest concentration of mass is at the center of the marginals. The start and end marginals are restricted by equality constraints while the marginals in between are restricted by inequality constraints. In this setting, we choose to set each value of µt, t ∈ VI , to be less or equal to the total sum of µt, t ∈ VE. As mentioned in Section 3.4.1, g and f must be chosen carefully so that Assump- tion 1 is not violated. We opt for the ℓ2-norm as g(t) and the Frobenius norm for f (t1,t2). Both these norms treat elements in vectors/matrices the same, as exempli- fied by 39 4. Implementation and numerical results ∣∣x∣∣2ℓ = x 2 1 + x 2 2 +⋯ + x 2 2 N , N N ∣∣A∣∣2F =∑∑A 2 ij. i=1 j=1 Moreover, all p-norms (1 < p < ∞) are smooth, including the Frobenius norm. They are proper, convex, and continuous, thereby ensuring lower semi-continuity as well. The domain of G is obviously included in the interior of the domain of E, since the domain of E is RNT , where the interior is the set itself. Consequently, the relation dom(G) ⊆ int(dom(E)) holds true. Now, we explicitly present what g and f are. As previously stated, we implement the ℓ2-norm for g, but we also introduce a parameter νt. This represents a shift in where the mass is centered. Without νt, the µt-term would only smooth the mass around the center, which is not of interest. We can define νt in various ways. For example, we can define it as a vector where the first half is zeros, and the second half is ones, imposing a heavy penalty for the mass on the zero side. We get g(t)(P (Mk 1o)) = ∥P (Mkot 2 t ) − ν 2 t∥ℓ .2 We need the gradient which is ∇g(t)(Pt(Mko)) = Pt(Mko) − νt. For f , we implement the Frobenius norm, f (t1,t2)(P (Mko)) = ∥P (Mko)∥2t1,t2 t1,t2 F , with the gradient ∇f (t1,t2)(Pt1,t2(Mko)) = 2P kot1,t2(M ). With these functions and constraints, we can write our problem as in (3.2): Mko+1 = argmin ⟨C̃ko ,M⟩ + ηD(M), M∈R+N T s.t. (4.1)Pt(M) = µt, for t ∈ VE, Pt(M) ≤ µt, for t ∈ VI . We select the stopping criterion for the inner loop as the iteration at which the absolute square of the difference between u(t),ks+1 and u(t),ks is less than a predeter- mined tolerance level. For the outer level, we implement a surrogate to Theorem 3.4.2 due to time limitations. Instead of comparing Mko+1 and Mko , we compare P (Mko+1t ) and P (Mkot ). This approach does not yield the same accuracy as the theorem, but it provides a sufficiently good approximation. As mentioned in Section 3.3, ∇g and log(u(t),ks) are both vectors, differing in dimension from the other functions. How this can be handled is shown in the next section. 40 4. Implementation and numerical results 4.1.2 Transformation from nodes to edges The problem with ∇g and log(u(t),ks) is that both are vectors, differing in dimension from C. However, referring back to Remark 2, we know that this is not an issue. By multiplying the g-function with a vector of ones and the correct constants, we obtain a matrix with the desired proportions. The same modifications can be applied analogously to log(u(t),ks). In the following example, we demonstrate this concept, focusing on the g-function. 4.1.2.1 Example of cost-transformation For simplicity, assume that we have a problem that only depend on C and ∇g(t), ignoring the other functions present in C̃ko , (4.1). Let the cost matrix C(t,t+1) be defined by ⎡ ⎢0 1 2 3⎤⎥ ⎢ ⎥ Ct,t+1 ⎢1 0 1 2⎥ = ⎢ ⎢ ⎢2 1 0 1 ⎥ , ⎥ ⎥ ⎢ ⎣3 2 1 0⎥⎦ and ∇g(t) and ∇g(t+1) as ⎡ ⎢1⎤ ⎡ ⎤⎥ ⎢5⎥ ⎢ ⎥ ⎢ ⎥ (t) ⎢2⎥ ⎢6⎥∇g = ⎢ ⎥ , and ∇g(t+1)3 = ⎢ ⎥ .⎢ ⎥ ⎢⎢ ⎥ ⎢7⎥⎥ ⎢ ⎣4⎥ ⎢8⎥⎦ ⎣ ⎦ In this context, ∇g(t) and ∇g(t+1) represent additional costs associated with a pair of marginals. Specifically, ∇g(t) indicates that it costs 1 additional cost-unit to transport mass from position 1 to any other position, including remaining in place. This principle applies uniformly across all positions; for example, moving mass from position 4 to any other position incurs an extra cost of 4. Similarly, ∇g(t+1) applies to marginal t+1, meaning that transporting mass to position 1 incurs an additional cost of 5. To facilitate computation, we need to transform ∇g(t) such that every row of C(t,t+1) is incremented by ∇g(t). First off, by multiplying ∇g(t) by the vector 1T will result in ⎡1⎤ ⎡⎢ ⎥ ⎢1 1 1 1⎤⎥ ⎢ ⎥ ⎢ ⎥ (t)1T ⎢2⎥ 1 1 1 1 ⎢2 2 2 2⎥∇g = ⎢3⎥ [ ] = ⎢3 3 3 3⎥ .⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢4⎥ ⎢4 4 4 4⎥⎣ ⎦ ⎣ ⎦ Similarly we get ⎡ ⎢1⎤ ⎡⎥ ⎢5 6 7 8⎤⎥ ⎢1⎥ ⎢1 (t+1) T ⎢ ⎥ 5 6 7 8 ⎢5 6 7 8 ⎥ ⎥ (∇g ) = ⎢1⎥ [ ] = ⎢⎢ ⎥ ⎢⎢ ⎥ ⎢5 6 7 8 ⎥ . ⎥ ⎥ ⎢ ⎥ ⎢ ⎣1⎦ ⎣5 6 7 8⎥⎦ 41 4. Implementation and numerical results Our calculations are now complete and the costs can be added to C(t,t+1) without complications. Denote the final cost as (t,t+1)C , and we get that (t,t+1) C = C(t,t+1) T + α∇g(t)1T + (1 − α)1(∇g(t+1)) ⎡ ⎤ ⎡ ⎢0 1 2 3⎥ ⎢1 1 1 1⎤ ⎡ ⎤⎥ ⎢5 6 7 8⎥ ⎢1 0 1 2⎥ ⎢⎢ ⎥ ⎢2 2 2 2⎥ ⎢⎥ ⎢5 6 7 8⎥⎥ = ⎢2 1 0 1⎥ + α ⎢3 3 3 3⎥ + (1 − α) ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢5 6 7 8⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎣3 2 1 0⎦ ⎣4 4 4 4⎥ ⎢ ⎥⎦ ⎣5 6 7 8⎦ ⎡ ⎢5 − 4α 7 − 5α 9 − 6α 11 − 7α⎤⎥ ⎢ ⎢6 − 3α 6 − 4α 8 − 5α 10 − 6α⎥⎥ = ⎢ ⎢ ⎢7 − 2α 7 − 3α 7 − 4α 9 5 ⎥ . − α ⎥⎥ ⎢ ⎣ 8 − α 8 − 2α 8 − 3α 8 − 4α ⎥⎦ The constant α ensures that calculations are not repeated. Specifically, the cost ∇g(t+1) has already been incorporated into the next cost matrix (t+1,t+2)C during the previous iteration. This adjustment is unnecessary at the leaves, as they do not receive the cost twice. 4.1.3 The log-sum-exp-method It is known that optimal transport problems can experience numerical instability for small values of ϵ. A way to make the algorithms more numerically stable is to apply the log-sum-exp-method in order to do calculations in the log-domain with addition and subtraction, instead of matrix/vector multiplication, see [Sch19] and [Win19]. This means that we need to take the logarithm of our functions and variables, and adjust initial values. As many as possible of our calculations is done in the log-domain, and are then converted back to the real domain by applying the exponential function. Our variables are straightforward to compute, for example taking log(u(t)) is taking the logarithm of the elements in u(t). However, our functions require some calculations. From Theorem 2.2.2 we have that the marginal projections can be calculated like P (K⊙U) = u(t)⊙φ̂(t)t ⊙φ(t) and the bi-marginal P +1(K⊙U) = diag(φ̂(t))diag(u(t))K(t,t+1)t,t diag(u(t+1))diag(φ(t+1)). For simplicity, we do this for projections with the standardM from Section 2.2.4, i.e., without the iteration index – it also holds for Mko . We need to take the logarithm of the projections, log(Pt(K⊙U))) = log(u(t) ⊙ φ̂(t) ⊙ φ(t)) = log(u(t)) + log(φ̂(t)) + log(φ(t)), and log(Pt,t+1(K⊙U)) = log (diag(φ̂(t))diag(u(t))K(t,t+1) diag(u(t+1))diag(φ(t+1))). Now we examine what it means to take the logarithm of diag(a)B, for a general 42 4. Implementation and numerical results vector a and matrix B. We have that, by definition of the logarithm, diag N(a)B = [aiBij] ⇒i,j=1 log(diag(a)B) = [ log N(aiBij)]i,j=1⇒ (4.2) log N N[ (a T NiBij)]i,j=1 = [ log(ai)1 ] =1 + [log(Bij)] .i,j i,j=1 Analogously we get [ log N N T N(Bijaj)] =1 = [ log(Bij)] =1 + [1( log(aj)) ] =1. (4.3)i,j i,j i,j Repeating (4.2) and (4.3) for log(Pt,t+1(K⊙U)), we get log(Pt,t+1(K⊙U)) = [ log (t) N (t) N (t,t+1) N (φ̂i )1T ] =1 + [ log (u T i )1 ] =1 + [ log (Ki,j i,j ij )]i,j=1 1 log (t+1) T N 1 log (t+1) T N+ [ ( (uj )) ] + [ ( (φ )) ]i,j=1 j i,j=1. (4.4) This is the same idea as shown in Example 4.1.2.1, which is an implementation Remark 2. The calculations of log(φ̂(t)) and log(φ(t)) are a bit more complex. We begin with log(φ(t)) and get log(φ(t)) = log(K(t,t+1) diag(u(t+1))φ(t+1)) N ⎛ log exp Cij (t+1) ⎞ = [ ( − )uj ] φ (t+1) ⎝ ϵ i,j=1 ⎠ N ⎛ T ⎞ = log exp Cij[ (− + 1( log (t+1)(uj )) )] φ(t+1) ⎝ ϵ i,j=1 ⎠ ⎛ N T ⎞ = log ∑ exp Cij (− + 1( log (t+1) (t+1)(uj )) )φj ⎝ j=1 ϵ ⎠ ⎛ N T log Cij (t+1) (t+1) T ⎞ = ∑ exp(− + 1( log (u )) + 1( log (φ )) ) . ⎝ j=1 ϵ j j ⎠ Analogously we get that ⎛ N CT log (t) log exp ij 1 log (t−1) T ( Tt−1) ⎞ (φ̂ ) = ∑ (− + ( (uj )) + 1( log (φj )) ) . ⎝ j=1 ϵ ⎠ An important thing to note is that everything that was initialized as ones, should now be initialized as zeros, since log(1) = 0. Furthermore, 1 in (3.13) is replaced by 0, i.e., a vector of zeros. The observant reader might have already noticed, but when applying the log-sum-exp-method we no longer implement K. Instead we operate directly on −Cϵ . Next we present our numerical results, with and without the log-sum-exp-method. 43 4. Implementation and numerical results 4.2 Numerical results In this section, we present several numerical examples to illustrate the functionality of our algorithm. We define our problem to have T = 32 marginals, each of size N = 26. As mentioned in Section 4.1.1, for marginal t = 1 and t = T equality constraints hold, while inequality constraints holds for the remaining marginals. Throughout the examples, we assume a discretized normal distribution for both the initial, t = 1, and final, t = T , marginals. This setup implies that the mass is centered at the marginals with varying means, while the standard deviation is fixed at σ = 4. Specifically, the initial marginal has a mean of m1 = N/3, and the final marginal has a mean ofm2 = N/3+T /3. Moreover, the parameter ν is employed to ensure that the resulting distributions approximate normal distributions with the specified means. In Figure 4.1 we have implemented the algorithm with the simple cost functions g and f , defined in Section 4.1.1, with and without the log-sum-exp-method. 0.010 0.010 0.008 0.008 0.006 0.006 0.004 0.004 0.002 0.002 0.000 0.000 30 30 25 25 0 20 20 5 15 0 15 In 10 10 5 d 10 10 ex on 15 m 20 5 Index a on 15 rg i m 20 5 na 25 0 al rginal 25 0 (a) Outer count: 178 (b) log-sum-exp. Outer count: 176 Figure 4.1: Reference picture with simple cost functions. Both the regular method and log-sum-exp-method yield similar results. The reg- ular method requires two more iterations than the log-sum-exp-method; otherwise, their performance is very similar. Note that the outer count refers to the number of outer iterations performed before the algorithm converges. 4.2.1 No penalty from ∇F (M) for marginal t > N/4 and t < N/2 In this section we start to experiment with the penalties. For marginals where t > N/4 and t < N/2 we chose to not let ∇F (M) have an impact on the cost. This is achieved by setting g(t) and f (t,t+1) to zero at each node, or edge, associated with node t, where N/4 < t < N/2. The result can be seen in Figure 4.2. As in Figure 4.1, we get very similar results between the two methods. 44 Marginal number t Value of marginal t Marginal number t Value of marginal t 4. Implementation and numerical results 0.010 0.010 0.008 0.008 0.006 0.006 0.004 0.004 0.002 0.002 0.000 0.000 30 30 25 25 0 20 20 5 15 0 15 In 10 5 10 d 10 Ind 10ex on 15 5 ex 15 m 5argi 20 on m 20 nal 25 0 arginal 25 0 (a) Outer count: 178. (b) log-sum-exp. Outer count: 176. Figure 4.2: No penalty from ∇F (M) for marginal t > N/4 and t < N/2. ϵ = 0.1, δ = 0.1 We then continued with the same conditions but experimented with different values for ϵ and δ. We began by decreasing ϵ as much as possible before the methods diverged and then proceeded to decrease δ. This approach led to different optimal values of the problem (4.1) for different values of ϵ and δ, highlighting that the methods require different parameter settings to maintain stability and convergence. 0.010 0.010 0.008 0.008 0.006 0.006 0.004 0.004 0.002 0.002 0.000 0.000 3025 30 25 0 20 5 15 0 20 I 10 10 5 15 ndex on 15 5 Ind 10 10 margi 20n 0 ex on 15 a 25 m 5 l argi 20nal 25 0 (b) log-sum-exp. Outer count: 69. (a) Outer count: 26. ϵ = 0.07, δ = 0.07 ϵ = 0.09, δ = 0.07 Figure 4.3: No penalty from ∇F (M) for marginal t > N/4 and t < N/2. While observing Figure 4.3, we can see that the methods can handle similar values of ϵ and δ, but the log-sum-exp method requires a larger ϵ to converge. Moreover, the log-sum-exp method requires more outer iterations to achieve convergence. 4.2.2 No penalty from ∇F (M) if marginal t mod 2 = 1 Next we implement the condition that ∇F (M) does not yield a penalty if marginal t mod 2 = 1. We begin with standard values ϵ = δ = 0.1, see Figure 4.4. Here, both methods yield the same result, with the same number of outer iterations. 45 Ma Mrg ai rn ga inl n alu nm ub me br et r t Value of marginal t Value of marginal t Ma Mrg ai rn ga inl n alu nm ub me br et r t Value of marginal t Value of marginal t 4. Implementation and numerical results 0.010 0.010 0.008 0.008 0.006 0.006 0.004 0.004 0.002 0.002 0.000 0.000 30 30 25 25 0 20 0 20 5 15 15 In 10 10 5 d 10 10 ex I no dn 15 e m 5 x a or 20 n 15 m 20 5gina al 25 0 rginal 25 0 (a) Outer count: 255. (b) log-sum-exp. Outer count: 255. Figure 4.4: No penalty from ∇F (M) for marginal t mod 2 = 1. ϵ = 0.1, δ = 0.1. As before, we decreased ϵ as much as possible before the methods diverged and then proceeded to decrease δ. We see that the log-sum-exp can handle slightly smaller values for ϵ. Moreover, it manages smaller values of δ and requires more outer iterations to converge. 0.010 0.010 0.008 0.008 0.006 0.006 0.004 0.004 0.002 0.002 0.000 0.000 3025 30 0 2025 5 15 0 20 I 10 10 5 15 ndex on 15 10 10 marg 20 5 Ind ine 15 al 25 0 x on m 5argi 20nal 25 0 (b) log-sum-exp. Outer count: 44. (a) Outer count: 29. ϵ = 0.09, δ = 0.1. ϵ = 0.08, δ = 0.03. Figure 4.5: No penalty from ∇F (M) for marginal t mod 2 = 1. 4.2.3 No penalty from ∇F (M) if marginal t mod 2 = 1 and/or if marginal t > N/4 and t < N/2 Let us now combine the previous two sections into one by implementing a condition where there is no penalty from ∇F (M) if the marginal t mod 2 = 1 and/or if t > N/4 and t < N/2. As before, we begin with standard values ϵ = δ = 0.1. This yielded Figure 4.6, where we observe that both algorithms produce the same result at the same outer count. 46 Ma Mrg ai rn ga inl n alu nm ub me br et r t Value of marginal t Value of marginal t Ma Mrg ai rn ga inl n alu nm ub me br et r t Value of marginal t Value of marginal t 4. Implementation and numerical results 0.010 0.010 0.008 0.008 0.006 0.006 0.004 0.004 0.002 0.002 0.000 0.000 30 30 25 25 0 2015 0 20 5 5 15 In 10 10d 10 10ex 15 5 In o dn e m x on 15 5 arg i 20 mnal 25 0 argi 20nal 25 0 (a) Outer count: 255. (b) log-sum-exp. Outer count: 255. Figure 4.6: No penalty from ∇F (M) if marginal t mod 2 = 1 and/or if marginal t > N/4 and t < N/2. ϵ = 0.1, δ = 0.1. When we decrease ϵ and δ, see Figure 4.7, we observe that the log-sum-exp- method can handle smaller values of δ while still converging. The outer count is larger for the log-sum-exp-method, which is expected due to the smaller proximal step size δ. 0.010 0.010 0.008 0.008 0.006 0.006 0.004 0.004 0.002 0.002 0.000 0.000 30 30 25 25 0 20 0 20 5 15 5 15 In 10 10d 10ex I 10 no dn 15 5 e m x o 15 5 ar ngi 20nal 25 0 margi 20nal 25 0 (a) Outer count: 29. ϵ = 0.07, δ = (b) log-sum-exp. Outer count: 94. 0.08. ϵ = 0.08, δ = 0.009. Figure 4.7: No penalty from ∇F (M) if marginal t mod 2 = 1 and/or if marginal t > N/4 and t < N/2. 47 Ma Mrg ai rn ga inl n alu nm ub me br et r t Value of marginal t Value of marginal t Ma Mrg ai rn ga inl n alu nm ub me br et r t Value of marginal t Value of marginal t 4. Implementation and numerical results 48 5 Discussion and conclusions The goal of this thesis was to develop a new method for solving generalized op- timal transport problems of the form (2.28). We have successfully done this by transforming the generalized problem into a sequence of standard optimal transport problems, and presenting rigorous theory to support our method. The method has been successfully implemented using Python, but it currently lacks numerical sta- bility and requires further refinement to, for example, handle marginals of higher dimensions effectively. Given that the Sinkhorn iterations are known to be numerically unstable for small values of ϵ, we also implemented the log-sum-exp-method. In our case, we did not observe any significant differences between the cases of with or without the log- sum-exp-method, and it did not seem to lessen the observed numerical instability. However, in some instances, it allowed us to reduce the step size δ. We have theoretically proven the assumptions under which our method converges and identified a suitable termination criterion. However, predicting whether the al- gorithm would converge in practical settings remains challenging. This difficulty may stem from the assumption in Theorem 3.4.1 that a feasible transport matrixM exists, which cannot always be confirmed due to the complexity of optimal trans- port problems. Moreover, since we did not explicitly implement Theorem 3.4.2 and instead used a surrogate, our method may not achieve the required accuracy. Ad- ditionally, fine-tuning the variables, adding or removing marginals, or changing the size of the marginals sometimes resulted in a solution – understanding why requires further investigation. Finally, while the effectiveness of the method in its current state is uncertain, it has the potential to improve computational performance for generalized opti- mal transport problems. Further investigation is needed to validate and refine the method, ensuring its robustness and reliability in various practical applications. 5.1 Future work The numerical instability requires further examination, particularly to determine whether the log-sum-exp-method can effectively handle these types of problems, as we did not observe any significant differences in our experiments. Exploring the impact of marginal sizes and variables such as ϵ, δ and cost functions ∇g and ∇f might provide more insight into the reasons behind this instability. Besides addressing the numerical instability, there are numerous avenues for im- provement and future work. One interesting direction would be to implement the 49 5. Discussion and conclusions algorithm on general trees instead of limiting it to path graphs. This extension would enable the algorithm to address real-world problems that are often more complex than what path graphs can handle. Additionally, exploring more complex optimal transport problems featuring multiple marginals and intricate cost functions could yield valuable insights. Another avenue for future research would be to compare our algorithm to the one derived in [HRCK21], which also handles generalized optimal transport problems. Although we have not conducted a comparison, we understand that the algorithm in [HRCK21] is more abstract in nature. In contrast, our algorithm may offer greater practical usability, though this remains to be validated through further experimen- tation and comparison. Overall, while our method shows promise, extensive testing and refinement are needed to fully realize its potential and ensure its robustness in a variety of prac- tical applications. Future research should focus on these aspects to enhance the algorithm’s performance and applicability. 50 Bibliography [AEP+16] N. Andréasson, A. Evgrafov, M. Patriksson, E. Gustavsson, Z. Nedělková, K. C. Sou, and M. Önnheim. An Introduction to Con- tinuous Optimization. Studentlitteratur AB, 2016. [Bec17] A. Beck. First-order methods in optimization. MOS-SIAM series on optimization. MOS-SIAM, 2017. [EHJK20] F. Elvander, I. Haasler, A. Jakobsson, and J. Karlsson. Multi-marginal optimal transport using partial information with applications in robust localization and sensor fusion. Signal Processing, 171:107474, 2020. [GY05] J.L. Gross and J. Yellen. Graph Theory and Its Applications. Chapman and Hall/CRC, 2nd edition, 2005. [HRCK21] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson. Multimarginal opti- mal transport with a tree-structured cost and the Schrödinger bridge problem. SIAM Journal on Control and Optimization, 59(4):2428–2453, 2021. [Lue69] D. G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons, Inc, 1969. [MT03] J. E. Marsden and A. Tromba. Vector Calculus. Macmillan, 2003. [PC19] G. Peyré and M. Cuturi. Computational optimal transport: with appli- cations to data science. Foundations and Trends® in Machine Learning, 11(5–6):355–607, 2019. [RHCK21] A. Ringh, I. Haasler, Y. Chen, and J. Karlsson. Graph-structured tensor optimization for nonlinear density control and mean field games. arXiv preprint arXiv:2112.05645, 2021. [Sch19] B. Schmitzer. Stabilized sparse scaling algorithms for entropy regu- larized transport problems. SIAM Journal on Scientific Computing, 41(3):A1443–A1481, 2019. [Sin67] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967. [Teb92] M. Teboulle. Entropic proximal mappings with applications to nonlin- ear programming. Mathematics of Operations Research, 17(3):670–690, 1992. [Tho18] M. Thorpe. Introduction to optimal transport. F2.08, Centre for Math- ematical Sciences, University of Cambridge, 2018. [Win19] L. Winkler. Sinkhorn iterations, 2019. https://ludwigwinkler. github.io/blog/Sinkhorn/ [Accessed: 2024-03-12]. 51 DEPARTMENT OF MATHEMATICAL SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden www.chalmers.se