0.010
0.008
0.006
0.004
0.002
0.000
30
25
0 20
5 15
Ind 10
10
ex on 15 margi 20
5
nal 25 0
Entropic Proximal Gradient Method for
Generalized Optimal Transport Problems
Master’s Thesis in Applied Mathematics
GUSTAV SVENSSON
SARA NILSSON
DEPARTMENT OF MATHEMATICAL SCIENCES
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024
www.chalmers.se
www.gu.se
Marginal number t
Value of marginal t

Master’s Thesis in Applied Mathematics 2024
Entropic Proximal Gradient Method for
Generalized Optimal Transport Problems
GUSTAV SVENSSON
SARA NILSSON
Department of Mathematical Sciences
Chalmers University of Technology
University of Gothenburg
Gothenburg, Sweden 2024
Entropic Proximal Gradient Method for Generalized Optimal Transport Problems
GUSTAV SVENSSON
SARA NILSSON
© GUSTAV SVENSSON, SARA NILSSON 2024.
Supervisor: Axel Ringh, Applied Mathematics and Statistics
Examiner: Ann-Brith Strömberg, Applied Mathematics and Statistics
Master’s Thesis in Applied Mathematics 2024
Department of Mathematical Sciences
Chalmers University of Technology
University of Gothenburg
SE-412 96 Gothenburg
Sweden
Telephone +46 31 772 1000
Cover: The entropic proximal gradient method for a generalized multi-marginal
optimal transport problem, with T = 32 and N = 26. No penalty from ∇F (M) for
marginal t > N/4 and t < N/2, ϵ = δ = 0.1, see Section 4.2.1.
Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2024
iv
Entropic Proximal Gradient Method for Generalized Optimal Transport Problems
GUSTAV SVENSSON
SARA NILSSON
Department of Mathematical Sciences
Chalmers University of Technology
University of Gothenburg
Abstract
Optimal transport, a fundamental problem in applied mathematics, involves finding
the most efficient way to move mass from multiple sources to multiple destina-
tions. Previously known approaches employ entropic regularization combined with
the Sinkhorn iterations, a technique known for its efficiency in solving large-scale
optimal transport problems. This thesis presents a new method for solving general-
ized optimal transport problems using the entropic proximal gradient method. The
method breaks down the complex problem into a sequence of standard optimal trans-
port problems, solved by the Sinkhorn iterations. We provide theoretical founda-
tions, including proof of convergence and termination criteria, along with a detailed
implementation and numerical experiments showing the algorithm’s applicability.
The results of this thesis may offer improvements in computational performance for
generalized optimal transport problems, making it a valuable tool for applications
in economics, machine learning, and other fields where optimal transport is utilized.
Keywords: generalized multi-marginal optimal transport, Sinkhorn iterations, en-
tropic regularization, proximal gradient, optimization, graph-structure, log-sum-exp.
v

Acknowledgments
We would like to express our deepest gratitude to our supervisor, Axel, whose un-
wavering support, insightful guidance, boundless patience, and overall awesomeness
made this thesis possible.
Additionally, we extend our heartfelt thanks to our friends, Georg Kyhn, Carl
Elofsson and Vilhelm Larsson, for their steadfast support and constant presence
throughout this journey.
And to our families - we do not know where we would be without your support.
Gustav Svensson, Gothenburg, May 2024
Sara Nilsson, Gothenburg, May 2024
vii

Nomenclature
Below is the nomenclature of indices, sets, parameters, variables, and sums that
have been used throughout this thesis.
Indices
k Iteration index
ks Sinkhorn iteration index
ko Outer iteration index
(t) Marginal t
(t1, t2) Joint marginal t1 and t2
i General index
it Index for marginal t
Sets
R̄ R ∪ {∞}
R+ {x ∈ R ∶ x ≥ 0}
∂Ψ Subdifferential set of Ψ
Parameters
ϵ Regularization parameter
δ Step size in the entropic proximal gradient algorithm
η The sum ϵ + δ
ix
Variables
A Bold capital letter represent tensors
A Capital letter represent matrices
a Lower case letter represent vectors
µt Marginal/distribution t of optimal transport problems
(t)
ui Variable corresponding to marginal t, index i
(t),k
ui Variable corresponding to marginal t, index i, iteration k
Υ General tree
G Path graph, special instance of a tree
V Vertices, or nodes, of a tree
VE Vertices, or nodes, of a tree with equality constraints
VI Vertices, or nodes, of a tree with inequality constraints
E Edges of a tree
Pt(M) Projection of tensor M on marginal t
Pt1,t2(M) Bi-marginal projection of tensor M on joint marginal t1, t2
(t)
λi Lagrangian multipliers for marginal t, index i, bi-marginal case
Λ(t)i Lagrangian multipliers for marginal t, index it, multi-marginal caset
Sums
Sums can be written in different ways, depending on convenience.
N N N N
∑ = ∑ ⋅ ⋅ ⋅ ∑ ∑ ⋅ ⋅ ⋅ ∑
i1,...,it−1,it+1,...,iT i1=1 it−1=1 it+1=1 iT =1
T
∑ = ∑, where V = {1,2, . . . ,T }
t∈V t=1
x
Contents
Nomenclature ix
1 Introduction 1
1.1 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background and theory 3
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Fundamental theory and notation . . . . . . . . . . . . . . . . . 3
2.1.3 Convex optimization theory . . . . . . . . . . . . . . . . . . . . . 4
2.1.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Optimal transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Bi-marginal optimal transport . . . . . . . . . . . . . . . . . . . 9
2.2.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Multi-marginal optimal transport . . . . . . . . . . . . . . . . . 10
2.2.3 Entropic regularization and the Sinkhorn iterations . . . . . . . 11
2.2.3.1 Entropic regularization in the bi-marginal case . . . . 11
2.2.3.2 Entropic regularization in the multi-marginal case . . 14
2.2.4 Graph-structured problems . . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Generalized multi-marginal optimal transport . . . . . . . . . . 18
2.3 The entropic proximal gradient method . . . . . . . . . . . . . . . . . . 19
3 The entropic proximal gradient method for solving generalized op-
timal transport problems 23
3.1 Deriving (3.2) and proving that C̃ko has the same graph-structure as C 23
3.1.1 ∇F (Mko) has a graph-structure . . . . . . . . . . . . . . . . . . 24
3.1.2 Merging ϵD(M) and δKL(M∥Mko) into a single expression . . 28
3.1.3 The graph-structure of C̃ko . . . . . . . . . . . . . . . . . . . . . 29
3.2 The Sinkhorn iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 The computation of C̃ko . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Convergence and termination criterion for the entropic proximal gra-
dient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Termination criterion . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 The algorithm in pseudo code . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.1 General trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
xi
Contents
3.5.2 Path graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Implementation and numerical results 39
4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Defining a generalized multi-marginal optimal transport prob-
lem and its components . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Transformation from nodes to edges . . . . . . . . . . . . . . . . 41
4.1.2.1 Example of cost-transformation . . . . . . . . . . . . . 41
4.1.3 The log-sum-exp-method . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 No penalty from ∇F (M) for marginal t > N/4 and t < N/2 . . 44
4.2.2 No penalty from ∇F (M) if marginal t mod 2 = 1 . . . . . . . 45
4.2.3 No penalty from ∇F (M) if marginal t mod 2 = 1 and/or if
marginal t > N/4 and t < N/2 . . . . . . . . . . . . . . . . . . . . 46
5 Discussion and conclusions 49
5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Bibliography 51
xii
1
Introduction
When reading the title of this thesis, ”Entropic Proximal Gradient Method for Gen-
eralized Optimal Transport Problems”, one might feel overwhelmed by the formula-
tion, as these words might not have been frequently used in a standard mathematical
education at university level, and certainly not together. So in this introduction, we
begin by giving a brief explanation of the different components before delving into
the more advanced theory. Counter-intuitively, we begin with the end of the title,
namely ”optimal transport problems”. What are these?
Shortly explained, an optimal transport problem entails identifying a transport
plan that minimizes the cost of transferring mass from one distribution to another,
subject to specific constraints. Historically, optimal transport has revolved around
challenges related to efficient transportation and resource allocation. Its origins
trace back to 1781 when the mathematician Gaspard Monge first addressed the task
of relocating one distribution to another while minimizing expenses, as described in
[Tho18]. Today, optimal transport finds application in a broad spectrum of fields
extending beyond traditional economic and operational contexts. Its utility spans
domains such as image processing, machine learning, cancer detection, and color
modeling, as documented in [HRCK21] and references therein.
There are different versions of optimal transport problems. The classical version
is the bi-marginal version, which consists of only two distributions, i.e., marginals.
These marginals are typically subject to equality constraints, but they can also be
governed by both equality and inequality constraints. These constraints dictate
which locations are permitted or prohibited at various instances.
Analogously, the ”standard” multi-marginal optimal transport problem consists
of multiple marginals and quickly become more complex. In order to solve these
multi-marginal problems, entropic regularization and the Sinkhorn iterations can
be applied, see [Sin67], and has been done so successfully for some time. In brief,
implementing the Sinkhorn iterations involves performing coordinate ascent in the
Lagrangian dual of the optimal transport problem. However, this is only part of
the solution to the problem of the computational complexity. While computing the
Sinkhorn iterations, projections of the marginals needs to be computed. This is
computationally challenging, unless the problem has a graph-structure. Now, these
were some more fancy words that most mathematical master students might not
know. However, just accept this for now, as it will be explained thoroughly in the
following theory chapter.
Furthermore, multi-marginal optimal transport problems can be generalized by
having constraints on some of the marginals, while others are associated with cost
functions that fulfill certain assumptions. These generalized optimal transport prob-
1
1. Introduction
lems have been solved in [RHCK21]. Some of the steps in this solution can be com-
putationally expensive, depending on the cost functions that are associated with
the marginals, and are thus more of an ”abstract algorithm”. This is where the
focus of our thesis begins. We wish to develop a new method for solving these
generalized optimal transport problems by applying the entropic proximal gradient
method, which is the beginning of the title. But what is this? It is an optimization
method that is commonly used for solving convex optimization problems that con-
sists of smooth and possibly non-smooth components - these components are the
cost functions we mentioned above. Shortly explained, this allows us to rewrite the
generalized optimal transport problem into a sequence of ”standard” multi-marginal
optimal transport problem, which are then solved using the Sinkhorn iterations we
mentioned earlier.
Now that the entire title has been ”explained”, we might feel like experts. If only
that were true. There is still much more to learn, so let us take a look at the outline
of this thesis.
1.1 Outline of the thesis
We begin with the fundamentals. In Chapter 2 we start with preliminaries, laying
the groundwork before delving into optimal transport theory, exploring its various
iterations. Subsequently, we delve into the theory behind the entropic proximal
gradient method.
In Chapter 3, we develop the entropic proximal method for solving generalized op-
timal transport problems. We demonstrate that such problems can be decomposed
into a sequence of standard optimal transport problems, and we employ Sinkhorn
iterations for their solution. We establish the graph-structure of this sequence of
problems, explore their computation, and provide proofs for convergence and termi-
nation criteria of the entropic proximal gradient method. The chapter culminates
in the presentation of the algorithm through pseudo code.
Moving forward to Chapter 4, we transition to the practical realm, detailing
our implementation efforts and showcasing numerical results. Finally, in Chapter
5, we draw conclusions from our findings and discuss potential future avenues for
exploration.
2
2
Background and theory
In this chapter, we introduce relevant concepts that are needed in Chapter 3. We
begin with the preliminaries, where we set up the notation used in the thesis and
introduce basic definitions and results from, amongst other things, convex optimiza-
tion theory. We continue with optimal transport theory, and conclude the chapter
with theory regarding the entropic proximal gradient method.
2.1 Preliminaries
This section introduces used notation, relevant definitions and results on which the
following sections are based on. We begin with a short summary of the notation
used in this thesis.
2.1.1 Notation
Let exp(⋅), log(⋅),⊙,⊘ and ⊗ denote the element wise exponential, logarithm, multi-
plication, division and outer product, respectively, of vectors, matrices, and tensors1.
We denote a vector of ones as 1, with the size clear from the context, the identity ma-
trix as I, and diag(v) denotes a square matrix with the elements of a vector v on the
main diagonal. We define the sets R̄ ∶= R∪{∞} and R+ ∶= {x ∈ R ∶ x ≥ 0}. Let dom(f)
denote the domain of a function f , defined as dom(f) = {x ∈ RN ∶ f(x) < ∞},
[Bec17, page 14]. Moreover, let int(E) denote the interior of E ⊆ RN , which defines
the largest open set contained in E, see [AEP+16, page 40].
2.1.2 Fundamental theory and notation
We begin by defining the inner product, which holds an important role and is used
frequently throughout the thesis.
Definition 2.1.1. (Inner product)[Bec17, pages 2–3] An inner product of a real
vector space V is a function that associates to each pair of vectors x, y ∈ V a real
number, which is denoted by ⟨x, y⟩ and satisfies the following properties:
1. (commutativity) ⟨x, y⟩ = ⟨y, x⟩ for any x, y ∈ V.
2. (linearity) ⟨α1x1 + α2x2, y⟩ = α1⟨x1, y⟩ + α2⟨x2, y⟩ for any α1, α2 ∈ R and
x1, x2, y ∈ V .
1Multi-dimensional matrices.
3
2. Background and theory
3. (positive definiteness) ⟨x,x⟩ ≥ 0 for any x ∈ V and ⟨x,x⟩ = 0 if and only if
x = 0.
We also need to define the norm of a vector.
Definition 2.1.2. (Norm of a vector)[AEP+16, page 36] Let v ∈ RN be a vector.
The norm of v, denoted ∥v∥, is defined as the square root of the associated inner
product: √
∥v∥ ∶= ⟨v, v⟩.
Another important definition is that of an adjoint operator, which is used to prove
the graph-structure of a function in Chapter 3 .
Definition 2.1.3. (Adjoint operator) [Lue69, page 150] Given a linear operator A ∶
RN → RM , the adjoint operator is defined as the unique operator A∗ ∶ RM → RN
such that the relation
⟨y,A(x)⟩RM = ⟨A
∗(y), x⟩RN ,
holds for all x ∈ RN and y ∈ RM .
In order to ease notation, the subscript used to specify the space in which the
elements of the inner product is, is omitted whenever it is clear from the context.
Moreover, the following theorem states that the adjoint operator is also a linear
operator.
Theorem 2.1.1. [Lue69, page 151] The adjoint operator A∗ of the linear operator
A ∶ RN → RM is linear and bounded with ∥A∗∥ = ∥A∥.
We proceed with some relevant convex optimization theory.
2.1.3 Convex optimization theory
Consider the optimization problem
minimize f(x),
(2.1)
subject to x ∈ E,
where E ⊆ RN is a nonempty set and f ∶ RN → R̄ is a given function. Of particular
interest here are so-called convex problems. We first define what it means for f to
be a convex function.
Definition 2.1.4. (Convex function) [AEP+16, Definition 3.39] Suppose that E ⊆
RN . A function f ∶ RN → R̄ is convex at x ∈ E if
y ∈ E ⎪⎫
λ ∈ (0 ⎪,1) ⎬ Ô⇒ f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y),
λx + (1 − λ)y ∈ E ⎭⎪⎪
where λ ∈ (0,1)⇔ 0 < λ < 1. The function f is convex on E if it is convex at every
x ∈ E.
4
2. Background and theory
We also introduce the definition of a convex set.
Definition 2.1.5. (Convex set) [AEP+16, Definition 3.1] Let E ⊆ RN . The set E is
convex if
⎫
x, y ∈ E ⎪⎪
0 1 ⎬ Ô⇒ λx + (1 − λ)y ∈ Eλ ∈ ( , ) ⎪⎪⎭
holds.
The problem (2.1) is called convex if E is a convex set and f is a convex function
on E. The following definition is a stricter definition of a convex function, namely
that it is σ-strongly convex.
Definition 2.1.6. (σ-strongly convex function)[Bec17, Definition 5.16] A function
f ∶ E → (−∞,∞] is called σ-strongly convex for a given σ > 0 if dom(f) is a
convex set and the following inequality holds for any x, y ∈ dom(f) and λ ∈ [0,1],
σ
f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y) − λ(1 − λ)∣∣x − y∣∣22 .
An important result regarding convex optimization is the Fundamental Theorem
of Global Optimality.
Theorem 2.1.2. [AEP+16, Theorem 4.3] Consider the optimization problem (2.1),
where E is a convex set and f is convex on E. Then every local minimum of f over
E is also a global minimum.
For unconstrained optimization problems, i.e., (2.1) without the constraint, we
have the following theorem, which gives necessary and sufficient global optimality
conditions for optimality of an optimization problem.
Theorem 2.1.3. [AEP+16, Theorem 4.18] Suppose that f ∶ RN → R is convex and
in C1 on RN . Then,
x∗ is a global minimum of f over RN ⇐⇒ ∇f(x∗) = 0N .
If we have constraints it is not necessarily true that there exists a x ∈ E such that
∇f(x) = 0 in all optimal solutions. However, we have the following result.
Theorem 2.1.4. Let f and E in (2.1) be convex. If ∇f(x) = 0, for some x ∈ E,
then x is a global optimal solution to (2.1).
Proof. This follows for example from Theorem 2.1.3 and the Relaxation Theorem,
[AEP+16, Theorem 6.1], by relaxing the constraint x ∈ E to x ∈ RN .
2.1.4 Functions
In this thesis we work a lot with functions, and we need a cornucopia of theory
regarding them. We begin with some simple definitions.
5
2. Background and theory
Definition 2.1.7. (Proper function) [Bec17, page 14] A function f ∶ E → [−∞,∞]
is called proper if it does not attain the value −∞ and there exists at least one x ∈ E
such that f(x) <∞.
Definition 2.1.8. (Indicator function) The function δA ∶ RN → {0,∞} given by
⎪⎪
⎧0 if x ∈ A,
δA(x) = ⎨ (2.2)
⎪⎪∞ if x ∉ A,⎩
is called an indicator function of a set A, where A ⊆ RN .
Note that this is a convex function if and only if A is a convex set.
Definition 2.1.9. (Subgradient) [AEP+16, Definition 6.17] Let f ∶ RN → R be a
convex function. We say that a vector g ∈ RN is a subgradient of f at x ∈ RN if
f(y) ≥ f(x) + gT(y − x), y ∈ RN .
The set of such vectors g defines the subdifferential of f at x, and is denoted
∂f(x).
Definition 2.1.10. (Lower semi-continuous function) [Bec17, Definition 2.5] A
function f ∶ E→ [−∞,∞] is called lower semi-continuous at x ∈ E if
f(x) ≤ lim inf f(xn) (2.3)
n→∞
for any sequence {xn}n≥1 ⊆ E for which xn → x as n → ∞. A function f ∶ E →
[−∞,∞] is called lower semi-continuous if it is lower semi-continuous at each
point in E.
Definition 2.1.11. (Lf -smoothness)[Bec17, Definition 5.1] Let Lf ≥ 0. A function
f : E→ (−∞,∞] is said to be Lf -smooth over a set D ⊆ E if it is differentiable over
D and satisfies
∣∣∇f(x) −∇f(y)∣∣ ≤ Lf ∣∣x − y∣∣ for all x, y ∈D.
The constant Lf is called the smoothness parameter.
Definition 2.1.12. (Conjugate function)[Bec17, Definition 4.1] Let f ∶ E→ [−∞,∞]
be an extended real-valued function. The function ḟ ∶ Ė→ [−∞,∞], defined by
ḟ(y) =max{⟨y, x⟩ − f(x)}, y ∈ Ė,
x∈E
is called the conjugate function of f .
In this thesis, we will also be using the function f ∶ RN → (−∞,∞] defined by
f N(x) = ∑i=1 xi logxi, and need the following result.
Theorem 2.1.5. [Bec17, Example 5.27] Consider the function f ∶ RN → (−∞,∞]
given by
⎧ N
⎪⎪∑
f x i=1
xi logxi, x ∈∆N ,
( ) = ⎨
⎪⎪∞ else,⎩
6
2. Background and theory
where ∆N denotes the unit simplex2 in the space RN . Then, f is 1-strongly3 convex
w.r.t. both the ℓ1- and ℓ2-norm.
In order to prove this theorem, we require the following two lemmas.
Lemma 2.1.6. [Bec17, pages 97-98] Let f ∶ RN → (−∞,∞] be given by
⎪⎧ N⎪∑i=1 xi logxi, x ∈∆N ,f(x) = ⎨
⎪⎪∞ else.⎩
Then, the conjugate of f is
N
ḟ(y) = log( ey∑ j),
j=1
also known as the log-sum-exp function.
Lemma 2.1.7. (1-smoothness of the log-sum-exp function)[Bec17, Example 5.15]
Let f ∶ RN → (−∞,∞] be given by
N
f(x) = log( ex∑ i).
i=1
Then f is 1-smooth4 with respect to the ℓ2- and ℓ∞-norms.
Proof. We will first show f is 1-smooth w.r.t. the ℓ2-norm. The partial derivatives
of f are
∂f exi
(x) = , i = 1,2, . . . ,N,
∂x Ni ∑ xk=1 e k
and the second-order partial derivatives are
⎧ xi xj
∂2f ⎪⎪−
e e
(∑N x= 2
, i ≠ j,
(x) = ⎨ k 1 e
k)
exiexi ex∂x ∂x ii j ⎪⎪−⎩ (∑N exk)2 + ∑N ex , i = j.k=1 kk=1
We can thus write the Hessian matrix5 as
∇2f(x) = diag(w) −wwT ,
where xw e ii = ∑N x . To show that f is 1-smooth w.r.t. the ℓ2-norm, note that fork=1 e k
any x ∈ RN ,
∇2f(x) = diag(w) −wwT ⪯ 6 diag(w) ⪯ I.
2∆N = {x ∈ RN ∶ x ≥ 0, eTx = 1 }, where e denotes the unit vector, [Bec17, page 5].
3Definition 2.1.6 with σ = 1.
4Definition 2.1.11 with Lf = 1.
5Symmetric matrix denoted by ∇2f(x0), see [AEP+16, page 41].
6For symmetric matrices A and B, A ⪯ B is defined as A −B being positive semi-definite.
7
2. Background and theory
Hence λ 2max(∇ f(x)) ≤ 1 for any x ∈ RN . Noting that the log-sum-exp function
is convex, we can invoke [Bec17, Corollary 5.13] and conclude that f is 1-smooth
w.r.t. the ℓ2-norm.
We will show that f is 1-smooth w.r.t. the ℓ∞-norm. For that, we begin by
proving that for any d ∈ RN ,
dT∇2f(x)d ≤ ∣∣d∣∣2∞. (2.4)
Indeed,
dT∇2f(x)d = dT (diag(w) −wwT )d
= dT diag(w)d − (wTd)2
≤ dT diag(w)d
N
= 2∑widi
i=1
N
≤ ∣∣d∣∣2∞∑wi
i=1
= ∣∣d∣∣2∞.
Now, since f is twice continuously differentiable over RN , it follows by [Bec17,
Theorem 5.10] that for any x, y ∈ RN there exists ξ ∈ [x, y] for which
1
f(y) = f(x) +∇f(x)T (y − x) + 2(y − x)
T∇2f(ξ)(y − x). (2.5)
Combining (2.5), taking d = y − x, and (2.4), we obtain the inequality
1
f(y) ≤ f(x) +∇f(x)T (y − x) + 2 ∣∣y − x∣∣
2
∞,
which by [Bec17, Theorem 5.8] implies the 1-smoothness of f w.r.t. the ℓ∞-norm.
We can now prove Theorem 2.1.5.
Proof of Theorem 2.1.5. Let f be denoted as in the theorem and let ḟ denote the
conjugate function of f . Then by Lemma 2.1.6 the conjugate to f is the log-sum-
exp function ḟ N(y) ∶= log(∑i=1 exp(yi)), which by Lemma 2.1.7 is a 1-smooth function
w.r.t. both the ℓ∞- and the ℓ2-norm. Using [Bec17, Theorem 5.26] the smoothness
of ḟ implies that f is 1-strongly convex w.r.t. both the ℓ1- and the ℓ2-norm.
With these definitions, lemmas, and theorems established, we are now ready to
delve into the realm of optimal transport.
2.2 Optimal transport
As we mentioned in the introduction, the objective of an optimal transport problem
is to determine a transportation plan that minimizes the cost of transferring mass
from one distribution to another, i.e., finding an optimal transport plan. These
distributions are also called marginals, and a problem can contain two or more
marginals. If there are only two marginals, it is called a bi-marginal optimal trans-
port problem.
8
2. Background and theory
2.2.1 Bi-marginal optimal transport
A simple example of an optimal transport problem arises in the bi-marginal scenario,
involving just two distributions: the initial given distribution and the final target
distribution. Let µ1 ∈ RN+ and µ2 ∈ RN+ be two non-negative vectors with equal
total mass, i.e., N N∑i=1 µ1,i = ∑j=1 µ2,j. A transportation plan is represented by a
matrix M ∈ RN×N+ , where Mij signifies the amount of mass transported from µ1,i to
µ . Thus, M is a valid transport plan between µ and µ if N2,j 1,i 2,j ∑j=1Mij = µ1,i for
i = 1, . . . ,N and N∑i=1Mij = µ2,j for j = 1, . . . ,N . Given a cost matrix C ∈ RN×N ,
where Cij denotes the cost of transporting a unit mass from µ1,i to µ2,j, the objective
of the optimal transport problem is to identify a valid transport plan that minimizes
the total cost of transportation. This problem can be formulated mathematically as
min ⟨C,M⟩
M∈R+N×N
N
s.t. ∑Mij = µ1,i for i = 1, . . . ,N, (2.6)
j=1
N
∑Mij = µ2,j for j = 1, . . . ,N.
i=1
Here, ⟨⋅, ⋅⟩ denotes the inner product ⟨C,M N N⟩ = ∑i=1∑j=1CijMij. Note that the
constraints can also be written as M1 = µ1 and MT1 = µ2, and that the constraints
are symmetric, i.e., the marginals can swap place. The transport plan would be the
same, except that the transport matrix is transposed. We illustrate the bi-marginal
optimal transport problem with a simple example.
2.2.1.1 Example
For simplicity, assume we wish to find an integer solution to an optimal transport
problem defined as in (2.6). Let N = 3, and assume that the marginals and the cost
matrix are given by:
⎡3⎤ ⎡0⎤ ⎡0 1 2⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢0⎥ ⎢ ⎥ ⎢µ1 = ⎢ ⎥ , µ2 = ⎢2⎥ , and C = ⎢1 0 1⎥⎥ .⎢1⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣2⎦ ⎣2 1 0⎦
The marginals µ1 and µ2 represent the start and final distribution, respectively. Our
objective is to construct a transport matrix M such that the constraints in (2.6) are
fulfilled, while also minimizing the cost. Since each element Mij in M represents
the amount of mass transported from node i to j, we can rewrite the constraints as
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢M11 M12 M13⎥ ⎢1⎥ ⎢3⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢M21 M22 M23⎥ ⋅ ⎢1⎥ = ⎢0⎥ ,⎢ ⎥ ⎢ ⎥ ⎢
⎢
⎣M31 M32 M33
⎥ ⎢1⎥ ⎢⎦ ⎣ ⎦ ⎣1
⎥
⎥
⎦
and
⎡M M ⎤ ⎡ ⎤ ⎡ ⎤⎢ 11 21 M31⎥ ⎢1⎥ ⎢0⎥
⎢ ⎥ ⎢1⎥ ⎢ ⎥⎢M12 M22 M32⎥ ⋅ ⎢ ⎥ = ⎢2⎥ .⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢M13 M23 M33⎥ ⎢1⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣2⎦
9
2. Background and theory
This gives the system of equations
M11 +M12 +M13 = 3,
M21 +M22 +M23 = 0,
M31 +M32 +M33 = 1,
M11 +M21 +M31 = 0,
M12 +M22 +M32 = 2,
M13 +M23 +M33 = 2.
Since Mij ≥ 0 for all i, j, we have that M21 = M22 = M23 = M11 = M31 = 0, and the
equations that remain are
M12 +M13 = 3,
M32 +M33 = 1,
M12 +M32 = 2,
M13 +M33 = 2.
This yields two integer solutions;M12 = 1,M13 = 2,M32 = 1,M33 = 0 orM12 = 2,M13 =
1,M32 = 0,M33 = 1, which gives the transport matrices
⎡ ⎤ ⎡
⎢0 1 2⎥ ⎢0 2 1⎤⎥
a ⎢0 0 0⎥ or b ⎢0 0 0⎥M = ⎢ ⎥ M = ⎢ ⎥ .
⎢ ⎥ ⎢ ⎥
⎢0 1 0⎥ ⎢⎣ ⎦ ⎣0 0 1⎥⎦
These can be interpreted as follows. Given Ma, 1 unit of mass is transported from
µ1,1 to µ2,2, 2 units of mass is transported from µ1,1 to µ2,3 and 1 unit of mass is
transported from µ1 3 to µ2 2. For M b, , , 2 units of mass is transported from µ1,1 to
µ2,2, 1 unit of mass is transported from µ1,1 to µ2,3 and 1 unit of mass is transported
from µ1,3 to µ2,3, i.e., it stays in place.
In order to find the optimal transport plan, we calculate ⟨C,M⟩ for Ma and M b,
respectively. The resulting costs of the problem are ∑ a bij CijMij = 10 and ∑ij CijMij =
4. Thus, M b yields the lowest cost and is therefore our optimal transport plan.
An extension beyond the conventional bi-marginal optimal transport problem is
multi-marginal optimal transport, aiming to devise a transportation plan involving
not just two, but multiple distributions.
2.2.2 Multi-marginal optimal transport
As previously mentioned, a multi-marginal optimal transport problem have multiple
marginals. These marginal distributions are defined as a finite set of T non-negative
vectors µ1, . . . , µT ∈ RN+ , i.e., there are T marginals. The transport plan and the
corresponding cost of moving mass are both represented by T -mode tensors, M ∈
RNT+ and C ∈ RN
T , respectively, where RNT = RN×N×...×N . With this notation, a
square N ×N matrix is an element in RN2 .
We have that Mi1,...,iT and Ci1,...,iT represent the transported mass and the cost
of moving mass, respectively, associated with the tuple (i1, . . . , iT ). The total cost
10
2. Background and theory
of transport is therefore given by
⟨C,M⟩ ∶= ∑ Ci1,...,iTMi1,...,iT .
i1,...,iT
Remember, we wish to find the optimal transport plan M. This of course means
that M needs to be a feasible transport plan, which means that it must have the
given distributions as its marginals. To this end, the marginals of M are given by
the projections Pt(M) ∈ RN+ onto the t:th marginal defined by
[Pt(M)]it ∶= ∑ Mi1,...,iT , it = 1, . . . ,N. (2.7)
i1,...,it−1,it+1,...,iT
Hence M is feasible if Pt(M) = µt, for t = 1, . . . ,T . A generalization of the multi-
marginal optimal transport problem is to not necessarily impose marginal constraints
on all projections Pt(M), but only for an index subset Γ ⊂ {1, . . . ,T }. This gen-
eralization of the discrete multi-marginal optimal transport problem can thus be
formulated as
min ⟨C,M⟩
M∈RNT+ (2.8)
s.t. Pt(M) = µt, t ∈ Γ ⊂ {1, . . . ,T }.
Here, the marginals are subjected to equality constraints, but inequality constraints
can also be implemented.
Solving (2.8) is challenging. It is a linear program, but the number of vari-
ables is NT , which quickly becomes overwhelming as the number T of marginals
increases. One approach to solving this is to add a small entropy term to the objec-
tive function and use the Sinkhorn iterations, see [RHCK21]. This is presented in
the following sections.
2.2.3 Entropic regularization and the Sinkhorn iterations
As previously stated, solving the linear program by ”brute-force” is not realistic,
since the number of variables increases exponentially in the number of marginals.
This can be solved by adding an entropy term to the objective function, and applying
the Sinkhorn iterations. We start by explaining the bi-marginal case, even though
the multi-marginal is of more interest, since the bi-marginal case facilitates a better
understanding.
2.2.3.1 Entropic regularization in the bi-marginal case
An entropy term, defined as
N N
D(M) ∶=∑∑ [Mij log (Mij) −Mij + 1],
i=1 j=1
11
2. Background and theory
is introduced to (2.6), altering the formulation of the optimal transport problem to
min ⟨C,M⟩ + ϵD(M) (2.9a)
M∈R+N×N
N
s.t. ∑Mij = µ1,i for i = 1, . . . ,N, (2.9b)
j=1
N
∑Mij = µ2,j for j = 1, . . . ,N, (2.9c)
i=1
where ϵ > 0 is a regularization parameter.
While the addition of the entropy term might appear arbitrary initially, it plays
a crucial role in deriving the Sinkhorn iterations. It turns out to lead to a ”sparse”
representation in terms of that the optimal transport problem is parameterized by
fewer variables (the Lagrangian dual variables) compared to the original formulation.
Specifically, the number of parameters is reduced from NT to N ⋅T . These variables
can then be efficiently computed using coordinate ascent steps in the Lagrangian
dual, see [RHCK21, page 1].
In order to derive an optimal solution to (2.9), we first consider the Lagrangian
obtained by relaxing the two equality constraints,
L(M,λ(1), λ(2)) = ⟨C,M⟩ + ϵD(M) + (λ(1))T (M1 − µ1) + (λ(2))T (MT1 − µ2),
where λ(1), λ(2) ∈ RN are Lagrange multipliers corresponding to the constraints
(2.9b) and (2.9c), respectively. The Lagrange dual function ϕ(λ(1), λ(2)) is defined
as
ϕ(λ(1), λ(2)) = min L(M,λ(1), λ(2)), (2.10)
M∈R+N×N
and the Lagrange dual problem as
max ϕ(λ(1), λ(2)).
λ(1),λ(2)∈RN
To find the optimal transport plan M , denoted M∗, we exploit the fact that (2.9)
is a convex problem, and convex optimization theory applies. By Theorem 2.1.4 we
have that if the derivative of L with respect toM is zero in the feasible set, then this
is an optimal solution to the optimal transport problem. To this end, we compute
the derivative of the function with respect to M and set it to zero,
0 ∂= L(M,λ(1), λ(2) (1) (2)) = Cij + λi + λj + ϵ logMij,∂Mij
yielding
(1) (2)
∗ (1) (2) Cij + λi + λjMij(λi , λj ) = exp ( − ) ≥ 0.ϵ
By defining
(1) (2)λ
Kij ∶= exp
Cij λ j
( − ), u ii ∶= exp( − ) and vj ∶= exp( − ) (2.11)
ϵ ϵ ϵ
12
2. Background and theory
we have that M∗ takes the form
M∗ij =Kijuivj.
Moreover, since the minimum of (2.10) is obtained at ∗ (1) (2)M (λi , λj ), we have that
ϕ(λ(1), λ(2)) =L(M∗
(1) (2)
(λi , λ ), λ
(1), λ(2)j )
∗ (1) (2) ∗ (1) (2) (1) T ∗ (1) (2)=⟨C,M (λi , λj )⟩ + ϵD(M (λi , λj ) − (λ ) (µ1 −M (λi , λj )1)
− (λ(2) T
(1) (2)
) (µ2 − (M
∗(λi , λj ))
T1)
N N (1) (2)C + λ + λ N N
exp ij i j (1) (2)= − ϵ∑∑ [ ( − ) − 1] −∑λi µ1,i −∑λj µ2,j.
i=1 j=1 ϵ i=1 j=1
(2.12)
In order to examine whether an optimal solution to the bi-marginal optimal trans-
port problem exists, the partial derivatives of (2.12) with respect to (1) and (2)λi λj
are set to zero. We obtain the following equations for (1)λi ,
(1) (2)
∂ϕ N ⎛ Cij + λi + λ0 j
⎞
= (1) =∑ [ exp − − µ1,i], for i = 1, . . . ,N, (2.13)
∂λi j=1 ⎝ ϵ ⎠
and for (2)λj ,
N (1) (2)⎛ C + λ
0 ∂ϕ exp ij i
+ λj ⎞
= (2) =∑ [ − − µ2,j], for j = 1, . . . ,N.
∂λ ⎝ ϵ ⎠j i=1
Here, we can utilize coordinate ascent, where one variable vector is chosen to be
optimized while the other variable is fixed. We begin with λ(1) by writing (2.13) in
terms of K,u and v, which yields
0 ∂ϕ
N
= (1) =∑ (Kijuivj) − µ1,i
∂λi j=1
N
= ui∑ (Kijvj) − µ1,i
j=1
= ui(Kv)i − µ1,i.
Here, K and µ1 are constant parameters, and the variable vector v is kept fixed. We
re-organize the equation in terms of u and get
uk+1 = µ1 ⊘ (Kv
k). (2.14)
Analogously for λ(2), we get
vk+1 = µ ⊘ (KTuk2 ). (2.15)
The procedure outlined in (2.14) and (2.15) forms the foundation of the Sinkhorn
iterations, where k represents each iteration. Initially, arbitrary values for u and
13
2. Background and theory
v, denoted as u0 and v0, are chosen. The Sinkhorn iterations begin by computing
an initial approximation u1 based on (2.14). Next, (2.15) is used to compute an
approximation for v1. This process constitutes the first iteration, and we proceed to
the second iteration of the Sinkhorn iterations, where v1 is implemented in (2.14),
and so on. Importantly, this algorithm converges linearly and independently of the
initial value chosen for v0. For a proof, we refer to [PC19, Theorem 4.2].
With this groundwork laid, we are ready to derive the Sinkhorn iterations for the
multi-marginal case.
2.2.3.2 Entropic regularization in the multi-marginal case
The entropy term D for the multi-marginal optimal transport problem is defined as
D(M) ∶= ∑ (Mi1,...,iT log(Mi1,...,iT ) −Mi1,...,iT + 1), (2.16)
i1,...,iT
which together with problem (2.8), including inequality constraints, gives the new
optimal transport problem
min ⟨C,M⟩ + ϵD(M),
M∈RNT+
s.t. Pt(M) ≤ µt, for t ∈ Γ≤, (2.17)
Pt(M) = µt, for t ∈ Γ=,
Pt(M) ≥ µt, for t ∈ Γ≥,
where Γ = Γ≤∪Γ=∪Γ≥7, and ϵ > 0 is a regularization parameter. An optimal solution
to (2.17) can be noted as M∗ϵ where M∗ϵ →M∗ when ϵ → 0, see [PC19, Proposition
4.1]. The derivation of the Sinkhorn iterations for the multi-marginal case follows
the same structure as the bi-marginal case. We begin by defining the Lagrange
function of (2.17) as follows:
L(M,Λ) = ⟨C,M⟩ + ϵD(M) + Λ(t)∑ (Pt(M) − µt),
t∈Γ
where Λ(t) ∈ RN for t ∈ Γ are Lagrangian multipliers. The corresponding Lagrange
dual function is
ϕ(Λ) ∶= min L(M,Λ), (2.18)
M∈R+N
T
and the Lagrange dual problem
max ϕ(Λ)
Λ(t), t=1,...,T
when Λ(t) ∈ RN+ , for t ∈ Γ≤,
Λ(t) ∈ RN , for t ∈ Γ=,
Λ(t) ∈ RN− , for t ∈ Γ≥.
7Where the pairwise intersection between each subset of Γ is ∅ and Γ ⊆ {1, . . . ,T }.
14
2. Background and theory
As in the bi-marginal case, we compute the partial derivative w.r.t. Mi1,...,iT of the
dual function with respect toM and set it to zero in order to find an expression for
the optimal transport plan, M∗(Λ(t))
0 ∂ (t)= L(M,Λ) =C
M i1,...,iT
+ ϵ logMi1,...,i +∑Λ ,∂ T iti1,...,iT t∈Γ
yielding
(t)
∗ ( ) ⎛ CM Λ t exp i1,...,iT
+∑t∈ΓΛi ⎞t
i1,...,iT ( ) = − .⎝ ϵ ⎠
We define K and u similarly to the terms presented in the bi-marginal case:
⎡ C ⎤ ⎡ ⎤⎢ i1,...,i ⎥ ⎢
(t)⎥
K ∶= exp⎢ − T ⎥ and (t) Λi1,...,iT ⎢ ⎥ ui ∶= exp⎢ i ⎥ϵ ⎢ − ϵ ⎥. (2.19)⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
Our transport tensor can then be expressed as M∗(Λ(t)) = K ⊙U, where U is a
rank-one tensor defined by
U (t)i1,...,iT =∏ui , (2.20)t
t∈Γ
which means that we can write U = u(1) ⊗ u(2) ⊗ ⋅ ⋅ ⋅ ⊗ u(T ).
The minimum of (2.18) is obtained at M∗(Λ(t)), and we have that
ϕ(Λ) = L(M∗(Λ(t)),Λ)
N N ⎛ C (t)i1,...,iT +∑ ∈ Λ ⎞ T Nt Γ
= −ϵ i⋅ ⋅ ⋅ exp [ − t (t)∑ ∑ ] − 1 −∑∑Λi µt,i.
i1=1 iT =1⎝ ϵ ⎠ t=1 i=1
As in the bi-marginal case, we set the derivatives with respect to the elements of Λ
to zero,
N N N N C (t)
0 ∂ϕ i1,...,iT
+∑t∈ΓΛi
= ( ) = ∑ ⋅ ⋅ ⋅ ∑ ∑ ⋅ ⋅ ⋅ ∑ exp [ −
t ] − µ
t t,i
.
∂Λi i1=1 it−1=1 it+1=1 it T =1 ϵ
Here, we utilize coordinate ascent and, using the notation defined in (2.19), we
rearrange the terms as follows
u(t),ks+1 = u(t),ks ⊙ µt ⊘ Pt(K⊙Uks), (2.21)
where ks represents the Sinkhorn iteration. Note thatU is updated in each Sinkhorn
iteration as well, since it depends on u(t), hence the iteration index ks. However,
unlike in the bi-marginal case, (2.17) includes inequality conditions. As a result,
(2.21) might fall outside the allowed set. If this occurs, we will need to project
it back into the allowed set. This results in three different update rules for u(t),
depending on the sign of the Lagrangian multipliers. Similar to (2.14) and (2.15),
our formulation then takes the form
u(t),ks+1 =min(u(t),ks ⊙ µt ⊘ P kst(K⊙U ),1), for t ∈ Γ≤,
u(t),ks+1 = u(t),ks ⊙ µt ⊘ P (K⊙Ukst ), for t ∈ Γ=, (2.22)
u(t),ks+1 =max(u(t),ks ⊙ µ ⊘ P (K⊙Ukst t ),1), for t ∈ Γ≥,
15
2. Background and theory
where min and max denote element wise minimum and maximum, respectively.
These iterations continue until they have converged to a feasible solution of the
optimal transport problem. This can be measured in multiple ways, but a common
way is to proceed until the value ∥u(t),ks+1 − u(t),ks∥2 is smaller than some chosen
tolerance level.
This is still a challenging computation since the number of terms that must be
summed in order to compute Pt grows exponentially with the number of marginals in
the problem. However, if the underlying cost tensorC is of a certain graph-structure,
the projections can be computed efficiently by vector/matrix multiplications.
2.2.4 Graph-structured problems
The graph-structure of interest in this thesis is the tree-structure. The following
definition outlines a general tree, and this is what the theory in this thesis is built on.
Definition 2.2.1. [HRCK21, Definition 3.1] A graph Υ = (V ,E), with vertices V
and edges E , is a tree if it is acyclic and connected. The vertices with degree 1 are
called leaves, and we denote the set of leaves with L. For a vertex t ∈ V, the set of
neighbors Nt is defined as the set of vertices that have a common edge with t. The
path between two vertices t1 and tL is a sequence of edges that connect t1 and tL,
such that all edges are distinct. We also denote a path by the set of intermediate
vertices.
An illustration of a simple tree graph with five nodes can be seen in Figure 2.1
below. The vertices are the nodes, and the edges are the lines that connects the
nodes.
2
1 4
3
5
Figure 2.1: A simple tree graph with five nodes.
Suppose now that Υ = (V ,E) is a connected graph with T = ∣V ∣ nodes, and
consider the optimization problem
min ⟨C,M⟩ + ϵD(M)
M∈RNT+ (2.23)
s.t. Pt(M) = µt, t ∈ Ṽ ,
where Ṽ ⊆ V is a set of vertices. Now let the cost tensor C have the structure
C (t ,t )i1,...,iT = ∑ C
1 2
i , (2.24)t
( )∈E 1
,it2
t1,t2
16
2. Background and theory
where C(t1,t2) ∈ RN×N . This means that the cost term takes the form
⟨C,M⟩ = ∑ Ci1,...,iTMi1,...,iT
i1,...,iT
(t ,t )
= C 1 2∑ ∑ it ,i Mt i1,...,i1 2 T
i1,...,iT (t1,t2)∈E
(t ,t )
= 1 2∑ ∑ Cit ,i ∑ Mt i1,...,iT (2.25)
( 1 2t1,t2)∈E it1 ,it2 {i1,...,iT }/{it1 ,it2}
(t1,t= C 2
)
∑ ∑ i [Pt ,t (M)]t i ,i1 ,it2 1 2 t1 t2(t1,t2)∈E it1 ,it2
= ⟨C(t1,t2)∑ , Pt1,t2(M)⟩,
(t1,t2)∈E
where Pt1,t2(M) denotes the joint, i.e., bi-marginal, projection of the tensor M on
the two marginals t1 and t2, given by
[Pt1,t2(M)]it ,it ∶= ∑ Mi1,...,iT . (2.26)1 2
{i1,...,iT }/{it1 ,it2}
This is called an entropy regularized graph-structured multi-marginal optimal trans-
port problem when (2.23) has a cost tensor structured according to (2.24). In this
thesis, we refer to this as the standard optimal transport form.
The graph-structure of the optimal transport problem allows us to calculate
(2.22) by applying the following theorem, which holds for marginal projections,
i.e., for (2.7).
Theorem 2.2.1. [HRCK21, Theorem 3.2] Let K = exp (−C/ϵ) where C decouples
as (2.24) for the tree Υ = (V ,E), and let U = u(1)⊗u(2)⊗⋯⊗u(Υ). Define K(t1,t2) =
exp (−C(t1,t2)/ϵ) for (t1, t2) ∈ E . Then the projection on the t:th marginal of K⊙U
is of the form
Pt(K⊙U) = u(t) ⊙ ⊙α(t,r).
r∈Nt
The vectors α(t,r), for all ordered tuples (t, r) ∈ E , can be computed recursively start-
ing in the leaves of the tree according to
α =K(t,r)u(r)(t,r) for r ∈ L,
α (t,r) (r)(t,r) =K (u ⊙ ⊙ α(r,ℓ)) for r ∉ L.
ℓ∈Nr∖{t}
The Sinkhorn iterations can be used to efficiently solve these kinds of optimal
transport problems. Bi-marginal projections, as in (2.26), can be computed in a
similar way.
Proposition 2.2.1. [HRCK21, Proposition 3.3] Let the assumptions in Theorem
2.2.1 hold, let t1, tL ∈ V, and let t1, t2, . . . , tL denote the path between t1 and tL.
Then the bi-marginal projection of K ⊙U on the marginals t1 and tL, denoted by
Pt1,t (K⊙U), isL
⎛ L−1 ⎞
∏ diag (u(tℓ) ⊙ ⊙ α(t ,k))K(tℓ,tℓ+1) diag (u(tL) ⊙ℓ ⊙ α(tL,r)), (2.27)
⎝ ℓ=1 r∈Nt ⎠ℓ r∈Nt /{tL L−1}
r∉{t1,...,tL}
17
2. Background and theory
where α(t1,t2), for (t1, t2) ∈ E , are defined as in Theorem 2.2.1.
Theorem 2.2.1 and Proposition 2.2.1 can be simplified in the special case of a
so-called path graph, which is a simpler version of a tree. It is defined as follows.
Definition 2.2.2. [GY05, page 18] The path graph Pn is a tree characterized by
two nodes with a vertex degree of 1, while the remaining n−2 nodes possess a vertex
degree of 2. Consequently, a path graph can be visually represented in such a way
that all its vertices and edges are positioned along a single straight line.
An illustration of a simple path graph with four nodes can be seen in Figure 2.2
below. This type of graph can for example appear when the nodes represent different
time instances.
1 2 3 4
Figure 2.2: A simple path graph with four nodes.
Operating on path graphs enables a simplification of Theorem 2.2.1 and Propo-
sition 2.2.1, which are defined on general trees, to the theorem below.
Theorem 2.2.2. [EHJK20, Proposition 2] Given the notations above, it holds that
Pt(K⊙U) = u(t) ⊙ φ̂(t) ⊙ φ(t),
Pt,t+1(K⊙U) = diag(φ̂(t))diag(u(t))Kt,t+1 diag(u(t+1))diag(φ(t+1))
where
φ̂(t) = (K(t−1,t)
T
) diag (u(t−1))φ̂(t−1), for t = 2, . . . ,T ,
φ(t) =K(t,t+1) diag (u(t+1))φ(t+1), for t = 1, . . . ,T − 1.
Here, the starting values φ̂(1) and φ(T ) are defined as vectors of ones.
2.2.5 Generalized multi-marginal optimal transport
As mentioned in the introduction, our focus in this thesis is generalized optimal
transport problems. We repeat the description here for convenience, and comple-
ment this with the mathematical expression.
A generalization of (2.17) is where not all the marginals are given and fixed.
Instead they can be associated with convex cost functions. In the generalized prob-
lem, we also impose costs on the bi-marginal projections of the tensor. As before,
let Υ T= (V ,E) be a connected graph with T = ∣V ∣ nodes, and let C ∈ R̄N be a cost
tensor that takes the form (2.24). The convex graph-structured tensor optimization
problems of interest are problems of the form
min ⟨C,M⟩ + ϵD(M) + g(t)∑ (Pt(M)) + f (t1,t2)∑ (Pt1,t2(M))
M∈RNT+ t∈V (t1,t2)∈E
s.t. Pt(M) = µt, for t , (2.28)∈ VE ⊆ V
Pt(M) ≤ µt, for t ∈ VI ⊆ V ,
18
2. Background and theory
where g(t) and f (t1,t2) are proper, convex, and lower semi-continuous8 functions
affecting nodes and edges respectively. In particular, g(t) is the convex cost for
the mass in the nodes and f (t1,t2) is the convex cost of moving mass on the edges.
This is a convex problem since the functions are convex on a convex domain, and
known optimization theory for convex problems applies. A solution algorithm for
problems of the form (2.28) has been developed, based on coordinate ascent in the
Lagrangian dual of an equivalent problem, see [RHCK21]. In this thesis, we develop
a new method. To this end, we examine the entropic proximal gradient method in
the next section.
2.3 The entropic proximal gradient method
The entropic proximal gradient algorithm, which we wish to apply to (2.28), is
an optimization method commonly used for solving convex optimization problems.
It combines the gradient descent method with a proximal operator to handle the
potentially non-smooth part of the objective function. While the function does not
necessarily have to be non-smooth, the method is capable of addressing such cases.
To gain a clearer understanding of the entropic proximal gradient method, which
does not assume a Euclidean underlying space, we first review the proximal gradient
method that operates within a Euclidean space, see [Bec17, Chapter 10]. Consider
the following problem
min E(y) +G(y), (2.29)
y∈RN
where RN denotes the feasible set and we assume the following.
Assumption 1. [Bec17, Assumption 10.77]
(A) G ∶ RN → (−∞,∞] is proper, closed and convex.
(B) E ∶ RN → (−∞,∞] is proper, closed and convex; dom(G) ⊆ int(dom(E)) and
E is L -smooth9E over int(dom(E)).
(C) Let H(y) ∶= E(y) + G(y). The set of optimal solutions of problem (2.29) is
nonempty and denoted by Y ∗. The optimal value of the problem is denoted by
H(y)opt.
The proximal gradient method consists of a gradient step followed by a proximal
mapping, which is defined as follows.
Definition 2.3.1. (Proximal mapping)[Bec17, Definition 6.1] Given a function f ∶
RN → (−∞,∞], the proximal mapping of f is the operator given by
prox 1f(x) = argmin{f(u) + ∥u − x∥2}
u∈RN 2
for any x ∈ RN .
8See Definitions 2.1.7, 2.1.4 and 2.1.10.
9See Definition 2.1.11.
19
2. Background and theory
The update step for the proximal gradient method is given by
yk+1 = argmin 1E(yk) + ⟨∇E(yk), y − yk⟩ +G(y) + 2 ∣∣y − y
k∣∣2,
y∈RN tk
where tk is the step size at iteration k. This can be transformed into
yk+1 = argmin 1tkG(y) + k2 ∣∣y − (y − t
k
k∇E(y ))∣∣
2,
y∈RN
which by Definition 2.3.1 is the same as
yk+1 = proxt G(yk − tk∇E(yk)).k
In this context, we can clearly observe the combination of the gradient step and the
proximal mapping.
With the proximal gradient method defined, we can proceed to define the gener-
alized entropic proximal gradient method. However, we first need to introduce the
Bregman distance.
Definition 2.3.2. (Bregman distance)[Bec17, Definition 9.2] Let Ψ ∶ RN → (−∞,∞]
be a proper, closed and convex function that is differentiable over dom(∂Ψ). The
Bregman distance associated with Ψ is the function DΨ ∶ dom(Ψ)×dom(∂Ψ)→ R
given by
DΨ(x, y) = Ψ(x) −Ψ(y) − ⟨x − y,∇Ψ(y)⟩.
The generalized entropic proximal gradient algorithm then reads
yk+1 = argmin 1E(yk) +G(y) + DΨ(y, yk) + ⟨∇E(yk), y − yk⟩, (2.30)
y∈RN tk
where tk denotes the step size. We can express this method as taking a gradient
step with respect to E(y) and an entropic proximal step with respect to G(y). To
ensure convergence of (2.30), Ψ has to fulfill some assumptions as well. These are
listed below.
Assumption 2. [Bec17, Assumption 10.78]
• Ψ is proper, closed and convex.
• Ψ is differentiable over dom(∂Ψ).
• dom(G) ⊆ dom(Ψ).
• Ψ + δdom(G) is 1-strongly convex.
We can see that the first and second point of the assumption is redundant, since
it is already required in Definition 2.3.2. If we let
⎪⎧ n
Ψ ⎪∑z i=1 zi log zi, for all i such that zi ≥ 0,( ) = ⎨ (2.31)
⎪⎪∞, else,⎩
the normalized Kullback-Liebler relative entropy is obtained as the Bregman dis-
tance DΨ(x, y), see [Teb92],
n xi
DΨ(x, y) = KL(x∥y) =∑xi log + yi − xi. (2.32)
i=1 yi
20
2. Background and theory
The observant reader might notice the similarity with Definition 2.16 – this is actu-
ally its general form, where x =M and y = 1.
This choice of function, (2.31), fulfills the requirements in Assumption 2, which
will be shown i Section 3.4.1. In the theorem provided below, the convergence rate
of the entropic proximal gradient method is given. For proof, we refer to [Bec17,
Theorem 10.81].
Theorem 2.3.1. (O(1/k) rate of convergence of the entropic proximal gradient
method). Suppose that Assumptions 1 and 2 hold. Let {yk}k≥0 be the sequence of
(2.30) generated by the entropic proximal gradient method for solving problem (2.29)
with constant step size in which LE ∶= 1t for all k ≥ 0. Thenk
(a) the sequence {H(yk)}k≥0 is non-increasing;
(b) for any k ≥ 1 and y∗ ∈ Y ∗,
LEDΨ(x∗, x0)
H(xk) −Hopt ≤ .
k
21
2. Background and theory
22
3
The entropic proximal gradient
method for solving generalized
optimal transport problems
In this chapter, we derive a new method for solving problems of the form (2.28),
repeated here for convenience,
min ⟨C,M⟩ + ϵD(M) + (t)∑ g (Pt(M)) + f (t1,t2)∑ (PT t1,t2(M)),M∈RN+ t∈V (t1,t2)∈E
s.t. P (3.1)t(M) = µt, for t ∈ VE ⊆ V ,
Pt(M) ≤ µt, for t ∈ VI ⊆ V ,
where we define F (M) ∶= ∑ g(t)t∈V (Pt(M))+∑ (t1,t )( 2t1,t2)∈E f (Pt1,t2(M)). Our method
consists of solving a sequence of problems on standard optimal transport form, where
the sequence is indexed by ko, i.e.,
Mko+1 = argmin ⟨C̃ko ,M⟩ + ηD(M),
M∈RNT+
s.t. M (3.2)Pt( ) = µt, for t ∈ VE,
Pt(M) ≤ µt, for t ∈ VI ,
where C̃ko ∶= C + ∇F (Mko) − δ log(Mko) and η ∶= ϵ + δ. Here ϵ is the regularization
parameter and δ1 denotes the step size in the entropic proximal gradient algorithm.
Moreover, the fact that we use the gradient of F means that we restrict ourselves to
functions g and f that are continuously differentiable. We have that, under certain
conditions, Mko converges to a solution of (3.1), see Theorem 3.4.1.
In the upcoming sections, we derive this algorithm, and show that C̃ko has the
same graph structure as the underlying constant cost matrix C. Thereafter, each
problem (3.2) can be solved using the Sinkhorn iterations.
3.1 Deriving (3.2) and proving that C̃ko has the
same graph-structure as C
In order to derive (3.2), we utilize the entropic proximal gradient method, as outlined
in Section 2.3, for solving (3.1). We know that g and f are proper, convex and
1Note that this is not the indicator function.
23
3. The entropic proximal gradient method for solving generalized optimal
transport problems
lower semi-continuous functions. However, in order to apply the entropic proximal
gradient, it is also necessary for g and f to be Lg,f -smooth, see Assumption 1. This
enables the application of (2.30) to (3.1). We take an entropic proximal step with
respect to ⟨C,M⟩+ ϵD(M) and the constraints, and a gradient step with respect to
F (M), which yields
Mko+1 = argmin F (Mko) + ⟨C,M⟩ + ϵD(M) + δKL(M∥Mko) + ⟨∇F (Mko),M −Mko⟩,
M∈RNT+
s.t. Pt(M) = µt, for t ∈ VE,
Pt(M) ≤ µt, for t ∈ VI .
(3.3)
Remark 1. We can rewrite ⟨∇F (Mko),M−Mko⟩ = ⟨∇F (Mko),M⟩−⟨∇F (Mko),Mko⟩.
Here ⟨∇F (Mko),Mko⟩ is only a constant and is not relevant to solving our optimiza-
tion problem since we are only interested in the argument and not the actual value.
Thus only ⟨∇F (Mko),M⟩ remains. The same applies to the term F (Mko) – it is a
constant and can be ignored in the solving of the optimization problem.
However, (3.3) is not on the standard optimal transport form, i.e., not on the
form of (2.17). This is the subject of the following sections.
3.1.1 ∇F (Mko) has a graph-structure
In order to write (3.3) in the standard optimal transport form, we begin by interpret-
ing ⟨∇F (Mko),M⟩. The goal is to show that ∇F (Mko) has the same graph-structure
as C, as defined in (2.24). To this end, consider the following definition.
Definition 3.1.1. (Fréchet differentiable)[Lue69, page 172] Let T be a transforma-
tion defined on an open domain D in a normed space X and having range in a
normed space Y . If for fixed x ∈ D and each h ∈ X there exists σT (x;h) ∈ Y which
is linear and continuous with respect to h such that
lim ∥T (x + h) − T (x) − σT (x;h)∥ = 0,
∥h∥→0 ∥h∥
then T is said to be Fréchet differentiable at x and σT (x;h) is said to be the
Fréchet differential of T at x with increment h.
Let T ∶ RN → R be a function, and assume that T is Fréchet differentiable
throughout RN . We can interpret the variables of Definition 3.1.1 as follows: At a
fixed point x ∈ RN , the Fréchet differential σT (x;h) is, by definition, of the form
Axh, where Ax is a bounded linear operator from RN to R. Thus, as x varies
over RN , the correspondence x→ Ax defines a transformation from RN into R; this
transformation is called the Fréchet derivative ∇T of T . Thus we have, by definition,
σT (x;h) = ⟨∇T (x), h⟩. This means that the Fréchet derivative can be interpreted as
providing directional information about how the function T behaves in the vicinity
of a given point x, which is summarized in the following definition.
24
3. The entropic proximal gradient method for solving generalized optimal
transport problems
Definition 3.1.2. The gradient of T ∶ RN → R in a point x is the element ∇T (x)
s.t. for all h ∈ RN , ⟨∇T (x), h⟩ is the Fréchet derivative with increment h.
This interpretation aligns with the notion of a derivative in calculus, where it rep-
resents the rate of change of a function in a particular direction.
In our case, F is a transformation such as T in Definition 3.1.1, which means that
lim ∥F (M + δM) − F (M) − σF (M; δM)∥ = 0, (3.4)
∥δM∥→0 ∥δM∥
where δM represents a small change in the input M.
To prove that ∇F (Mko) is a tensor with the same graph-structure as C, we will
utilize Taylor’s Theorem for multivariate functions, listed below.
Theorem 3.1.1. [MT03, Page 160](Multivariate version of Taylor’s theorem) Let
f ∶ RN → R be a k-times continuously differentiable function at the point a ∈ RN .
Then there exist functions hα ∶ RN → R, where ∣α∣ = k, such that
Dαf(a)
f(x) = (x − a)α + h (x)(x − a)α∑ ∑ α ,
∣α∣≤k α! ∣α∣=k
and
lim hα(x) = 0.
x→a
Since we are only interested in the first order Taylor polynomial, the gradient of
F is our only concern. Hence, F needs to be at least C1, which is covered by the
assumption that g and f are Lg,f -smooth.
In the following theorem, we implement Taylor’s Theorem in order to rewrite
(3.4) and show that ∇F (Mko) is a tensor with the same graph-structure as C.
Theorem 3.1.2. Define F (M) ∶= ∑ g(t)t∈V (Pt(M)) + ∑ (t1,t2)(t1,t2)∈E f (Pt1,t2(M))
where g(t) and f (t1,t2) are proper, convex, lower semi-continuous, and Lg,f -smooth
functions. We have that
∇F (Mko) =∑P ∗t (∇g(t)(P kot(M ))) + ∗ (t∑ P (∇f 1,t2) kot1,t2 (Pt1,t2(M ))),
t∈V (t1,t2)∈E
which is a tensor with the same graph-structure as C, defined as (2.24).
Proof. We begin by expressing F (M+δM) in terms of the functions g and f . These
functions follow the same pattern, and we focus on g for better readability, and keep
f implicit. A difference to keep in mind is that g is a function on the vertices, i.e.,
nodes, and f on the edges. Doing a Taylor series expansion, we have that
F (M + δM) = g(t)∑ (P (M) + P (δM)) + f (t1,t2)t t ∑ (Pt1,t2(M) + Pt1,t2(δM))
t∈V (t1,t2)∈E
= g(t)∑ (Pt(M)) + ⟨∇g(t)(Pt(M)), Pt(δM)⟩ + f (t1,t2)∑ (. . . ) +O(δM)
t∈V (t1,t2)∈E
= F (M) + ⟨∇g(t)∑ (Pt(M)), Pt(δM)⟩ + ∑ ⟨⋅, ⋅⟩ +O(δM),
t∈V (t1,t2)∈E
25
3. The entropic proximal gradient method for solving generalized optimal
transport problems
where O(δM) = h1(M + δM)(δM). When δM → 0, then O(δM)/δM → 0. More-
over, we see that F (M+δM) = F (M)+∑t∈V⟨∇g(t)(Pt(M)), Pt(δM)⟩+∑(t1,t2)∈E⟨⋅, ⋅⟩+
O(δM), which gives
F (M + δM) − F (M) = (t)∑⟨∇g (Pt(M)), Pt(δM)⟩ + ∑ ⟨⋅, ⋅⟩ +O(δM).
t∈V (t1,t2)∈E
To understand the element ⟨∇g(t)(Pt(M)), Pt(δM)⟩, first note that this inner prod-
uct lives in RN . We consider the inner product with general variables in order to
ease notation, which gives
N
⟨y,Pt(x)⟩RN =∑ yiPt(x)i
i=1
N N N N N
=∑ yi ∑ ⋅ ⋅ ⋅ ∑ ∑ ⋅ ⋅ ⋅ ∑ xi1,...,iT (3.5)
i=1 i1=1 it−1=1 it+1=1 iT =1
N N
= ∑ ⋅ ⋅ ⋅ ∑ yitxi1,...,iT ,
i1=1 iT =1
where the definition of Pt(x), (2.7), is applied. By the definition of an adjoint oper-
ator2, we know that ⟨y,A(x)⟩RN = ⟨A∗(y), x⟩RNT . Reverting back to our notation,
with A = Pt, y = ∇g(t)(Pt(M)) and x = δM, we get
⟨∇g(t)(Pt(M)), Pt(δM)⟩RN = ⟨P ∗t (∇g(t)(Pt(M))), δM⟩RNT . (3.6)
Comparing the right hand side of (3.6) to Definition 3.1.2, we see that P ∗(∇g(t)t (Pt(M)))
is part of the gradient of F (M). We also add the function f to the expression, which
follow the same pattern, but is a function on the edges. This means that the gradient
of F (M) can be written as
∇F (M) = ∗ (t)∑Pt (∇g (Pt(M))) + P ∗ (∇f (t ,t )∑ 1 2t1,t2 (Pt1,t2(M))). (3.7)
t∈V (t1,t2)∈E
2See Definition 2.1.3.
26
3. The entropic proximal gradient method for solving generalized optimal
transport problems
We now have an expression for the gradient of F , and can express ⟨∇F (Mko),M⟩ by
⟨∇F (Mko),M⟩ = ⟨ P ∗ (t) k∑ ot (∇g (Pt(M )))
t∈V
+ P ∗ (∇f (t1,t∑ 2)t1,t2 (Pt1,t2(M
ko))),M⟩
(t1,t2)∈E RNT
= ⟨∑P ∗ (t)t (∇g (Pt(Mko))),M⟩
t∈V RNT
+ ⟨ P ∗ (t1,t2) k∑ ot1,t2(∇f (Pt1,t2(M ))),M⟩
(t1,t2)∈E RNT (3.8)
= ⟨ ∇g(t)∑ (P (Mkot )), Pt(M)⟩
t∈V RN
+ ⟨ ∇f (t1,t2)∑ (P (Mkot1,t2 )), Pt1,t2(M)⟩
(t1,t2)∈E RN
= ⟨∇g(t)(P (Mk∑ ot )), Pt(M)⟩
t∈V RN
+ ⟨∇f (t1,t ) k∑ 2 (P ot1,t2(M )), Pt1,t2(M)⟩ .
(t1,t2)∈E RN
Comparing (3.8) to (2.25), we see that ∇F (Mko) has the same graph-structure as
C, where ∇g acts on nodes and ∇f acts on edges. It does not matter whether we
act on both nodes and edges because we can always rewrite computations on the
nodes as computations on the edges by appropriate transformations, see
N
⟨Ct, Pt(M)⟩ = ∑ Ct,it[Pt(M)]it
it=1
N N
= ∑ Ct,it ∑[Pt,t̃(M)]it,it̃ (3.9)
it=1 it̃=1
N
= ∑ Ct,it[Pt,t̃(M)]it,i ,t̃
it,it̃=1
and we are done.
Remark 2. The choice of t̃ in (3.9) is not unique. Furthermore, we have an ex-
pression in order to convert marginal projections to bi-marginal projections by
N
Pt(M) = ∑ αt̃∑[Pt,t̃(M)]it,i ,t̃
t̃∈Nt it̃=1
where ∑t̃∈N αt̃ = 1 and Nt as in Definition 2.2.1.t
By Theorem 3.1.2, ∇F (Mko) has the same graph-structure as C, we can combine
27
3. The entropic proximal gradient method for solving generalized optimal
transport problems
these two, i.e., we can rewrite (3.3) as
Mko+1 = argmin ⟨C +∇F (Mko),M⟩ + ϵD(M) + δKL(M∥Mko),
M∈RNT+
s.t. M (3.10)Pt( ) = µt, for t ∈ VE,
Pt(M) ≤ µt, for t ∈ VI ,
and we are one step closer to rewriting (3.3) to standard optimal transport form.
However, it remains to merge ϵD(M) and δKL(M∥Mko) into a single expression.
3.1.2 Merging ϵD(M) and δKL(M∥Mko) into a single expres-
sion
Consider the following theorem.
Theorem 3.1.3. We have that
ϵD(M) + δKL(M∥Mko) = (ϵ + δ)( ∑ Mi1,...,iT log (Mi1,...,iT ) −Mi1,...,iT )
i1,...,iT
+M koi1,...,iT log(Mi1,...,iT )
+ ϵ + δMkoi ,...,iT ,1
i.e., an entropy term, a linear term, and constants.
Proof. The sum of the two entropy terms is
ϵD(M) + δKL(M∣∣Mko) =ϵ( ∑ Mi1,...,iT log(Mi1,...,iT ) −Mi1,...,iT + 1)
i1,...,iT
M log Mi1,...,i+ δ( ( T∑ i1,...,iT ) −M koMk i1,...,iT
+M
o i1,...,iT
).
i1,...,iT i1,...,iT
We can rewrite the δ-term as δ( k∑ oi1,...,iT Mi1,...,iT log(Mi1,...,iT )−Mi1,...,iT log(Mi1,...,iT )−
M ko koi1,...,iT +Mi ,...,iT ). Here, Mi1,...,iT log(M1 i1,...,iT ) is a linear term.
This gives us an entropy term,
(ϵ + δ)( ∑ Mi1,...,iT log (Mi1,...,iT ) −Mi1,...,iT ),
i1,...,iT
a linear term, M ko koi1,...,iT log(Mi1,...,iT ), and constants ϵ + δMi ,...,iT , and we are done.1
We have shown that ϵD(M) and δKL(M) can be merged into a single expression,
consisting of an entropy term, a linear term, and constants. Next, we need to incor-
porate these into (3.10). The linear term from Theorem 3.1.3,Mi1,...,iT log(Mkoi1,...,iT ),
can be added to ⟨C +∇F (Mko),M⟩, which gives
⟨C +∇F (Mko) − δ log(Mko),M⟩.
We define
C̃ko ∶=C +∇F (Mko) − δ log(Mko),
28
3. The entropic proximal gradient method for solving generalized optimal
transport problems
and get ⟨C̃ko ,M⟩.
As in Remark 1, ϵ and δMkoi ,...,iT are constants and they are therefore not relevant1
to solving our optimization problem and can be removed. We are left with (ϵ +
δ)(∑i1,...,iT Mi1,...,iT log (Mi1,...,iT )−Mi1,...,iT ) = (ϵ+δ)(D(M)−NT ). As before, here
we handle (ϵ+ δ)NT as in Remark 1, i.e., we remove it. We define η ∶= ϵ+ δ, and we
can write (3.10) as
Mko+1 = argmin ⟨C̃ko ,M⟩ + ηD(M),
M∈RNT+
s.t. M for (3.11)Pt( ) = µt, t ∈ VE,
Pt(M) ≤ µt, for t ∈ VI .
We have thus reformulated (3.3) into the standard formulation of an optimal trans-
port problem, enabling us to apply established theories for its solution. Left to show
is that (3.11) is a graph-structured problem with the same underlying graph as (3.1).
This is the focus in the next section.
3.1.3 The graph-structure of C̃ko
We know that
⟨C̃ko ,M⟩ = ⟨C +∇F (Mko) − δ log(Mko),M⟩
= ⟨C,M⟩ + ⟨∇F (Mko),M⟩ − ⟨δ log(Mko),M⟩.
We also know that C has a graph-structure, so computing ⟨C,M⟩ is straightforward.
In Section 3.1.1 we proved that ∇F (Mko) also have a graph-structure. Next, we
prove that also δ log(Mko) have a graph-structure.
Theorem 3.1.4. Given the notation above, C̃ko has the same graph-structure as C,
defined as (2.24).
Proof. We have that C̃ko = C + ∇F (Mko) − δ log(Mko). By Theorem 3.1.2 we have
that ∇F (Mko) is a tensor with the same graph-structure as C. Remains to prove
that log(Mko) does also.
To demonstrate that log(Mko+1) has a graph-structure, we use proof by induction.
First, we can easily define an initial transport matrixM0 asM0 = 1⇒ log(M0) = 0.
Assume now that log(Mko) has a graph-structure. This implies that C̃ko also has a
graph-structure.
From the Sinkhorn iterations in Section 2.2.3.2, we know that Mko+1 can be
computed as Kko ⊙Uko , where
⎡ C̃ko ⎤⎢ ⎥
Kko ∶= exp⎢ i1,...,iT ⎥ and Uko (t),ki ,...,iT ⎢ − ⎥ i ,...,iT = u
o , (3.12)
1 ∏⎢ η i⎥ 1
⎣ ⎦ t∈V
where
⎡ ⎤
(t),k ⎢ Λ(t),ko ⎥
u oi ∶= exp⎢⎢ − i ⎥⎥.
⎢ η ⎥
⎣ ⎦
29
3. The entropic proximal gradient method for solving generalized optimal
transport problems
Note that the notation Mko+1 = Kko ⊙ Uko originates from the fact that we are
working on a sequence of problems, and our cost matrix C̃ko changes with each
iteration. We thus have
log(Mko+1 ko koi ,...,iT ) = log([K ⊙U ]1 i1,...,iT )
= log([ Kko ⊙ u(t),k∏ ot1,t2 ∏ ] )
(t1,t2)∈E t∈V i1,...,iT
C̃ko (t),ko
= log exp t1,t2 Λ( ([ ∑ − −∑ ] ))
(t1,t2)∈E η t∈V η i1,...,iT
C̃ko (t),kot1,t2 Λ
= ∑ −∑ .
(t1,t2)∈E η t∈V η
This has a graph-structure, and as in Remark 2, we can transform this to computa-
tions on the edges.
Therefore, we conclude that log(Mko+1) has a graph-structure, completing the
induction proof.
We have now proved that C̃ko has the same graph-structure as C, and previously
known theory applies.
3.2 The Sinkhorn iterations
At this point, we have our generalized optimal transport problem on standard form,
as given in (3.11), and we need to perform Lagrange relaxation on the problem in
order to utilize the Sinkhorn iterations. This process was presented in detail in
Section 2.2.3.2, and we just present the final result here.
We define Kko and U as in (3.12) and (2.20), respectively. As mentioned in
the proof of Theorem 3.1.4, we are solving a sequence of problems where our cost
matrix C̃ko changes with each iteration, and we need to keep track of the current
iteration ko. Note that we do not assign an outer-iteration index to U because it is
the variable being updated. The notation Uko denotes the optimal solution to (3.2),
i.e., the tensor that the Sinkhorn iterations converge to. We update u(t),ks based on
the equality/inequality conditions on our marginals,
u(t),ks+1 =min(u(t),ks ⊙ µ ko kst ⊘ Pt(K ⊙U ),1), for t ∈ VI ,
u(t),ks+1 ( )
(3.13)
= u t ,ks ⊙ µt ⊘ Pt(Kko ⊙Uks), for t ∈ VE.
With the Sinkhorn iterations derived, we need to clarify the computation of C̃ko ,
which we do in the next section.
3.3 The computation of C̃ko
We have shown that C̃ko has the required graph-structure. However, as mentioned
in the proof of Theorem 3.1.2, some functions in (3.11) need a transformation in
30
3. The entropic proximal gradient method for solving generalized optimal
transport problems
order to correspond dimensionally, since some act on nodes and others on edges. We
have that, when writing all functions explicitly,
⟨C̃ko ,M⟩ = ∑ ∑ Cit [P (M)]1 ,it2 t1,t2 it1 ,it2
(t1,t2)∈E it1 ,it2
+ ∇g(t)∑∑ (P kot(M ))[Pt(M)]it
t∈V it
+ ∇f (t1,t2)∑ ∑ (Pt1,t2(Mko))[Pt1,t2(M)]it1 ,it2 (3.14)
(t1,t2)∈E it1 ,it2
− δ k −1∑ ∑ log(K ot1,t2 )[Pt1,t2(M)]it1 ,it2
(t1,t2)∈E it1 ,it2
+ δ log(u(t),ko−1∑∑ )[Pt(M)]it .
t∈V it
The functions ∇g(t) and log(u(t),ko) only acts on specific nodes, not edges. Therefore
we must redefine the way they act on our cost function C̃(t,t+1). This can be done
as shown in Remark 2.
In the next section we verify the convergence of our algorithm, along with deter-
mining an appropriate termination criterion.
3.4 Convergence and termination criterion for the
entropic proximal gradient method
Now that our algorithm has been formulated, it is crucial to verify its convergence
to an optimal solution and determine an appropriate termination criterion for the
outer iterations. We commence by proving convergence and subsequently address
the termination criterion.
3.4.1 Convergence
To prove convergence, we verify that the assumptions outlined in Section 2.3 hold.
We begin by splitting the system (3.1) into the two functions G and E as
G(M) ∶= ⟨C,M⟩ + ϵD(M) + δA(M),
and
E(M) ∶= g(t)(P (M)) + f (t ,t )∑ ∑ 1 2t (Pt1,t2(M)), (3.15)
t∈V (t1,t2)∈E
where A RNT= {M ∈ + ∶ Pt(M) = µt, t ∈ VE ⊆ V , Pt(M) ≤ µt, t ∈ VI ⊆ V}.
This is the same as the notation used in Section 2.3. We can assume that a
feasible transport matrix M exists, otherwise the transport problem would not be
of interest. We first take a look at G(M) that consists of four parts; ⟨C,M⟩, ϵD(M),
the constraints andM T∈ RN+ . It is sufficient to show that ⟨C,M⟩ and ϵD(M) fulfill
the conditions, since the constraints can be seen as an indicator function δA(M), as
in Definition 2.1.8, where A is the set that represents the constraints. This function
is obviously proper and lower semi-continuous when the set A, as defined by the
31
3. The entropic proximal gradient method for solving generalized optimal
transport problems
constraints, is non-empty. Moreover, δA is convex when the set A is convex, as is
the case here.
The inner product ⟨C,M⟩ is linear in M, ensuring its convexity and lower semi-
continuity. Furthermore, to guarantee the existence of a solution to the transport
problem such that ⟨C,M⟩ < ∞, we assume that C does not contain ∞. Even if a
feasible transport matrix exists, this condition is necessary. Consequently, ⟨C,M⟩
is considered proper, as we previously assumed the existence of a feasible transport
matrix.
The function ϵD(M) consists of the terms∑i1,...,iT (Mi1,...,iT log(Mi1,...,iT )−Mi1,...,iT +
1), which is just a sum of functions of the type f(x) ∶= x log(x) − x + 1. Since the
second derivative of f is positive for all x > 0, the function f is convex and obvi-
ously proper. We also know that each component of f is continuous, hence f is also
continuous and obviously lower semi-continuous.
Before stating the arguments for g(t) and f (t1,t2) it is important to note that they
do not operate on the space RNT on their own, as seen in (3.15), where for example
g(t) ∶ RN → R,
f (t1,t2) RN2∶ → R.
But as a composite function with Pt and Pt1,t2 respectively, we get
(g(t)
T
○ P ) ∶ RNt → R,
(f (t1,t2)
T
○ Pt1,t2) ∶ RN → R.
Both g(t) and f (t1,t2) can handle M ∈ RNT since the tensors are projected onto
vectors and matrices before being processed by these functions.
To satisfy the requirements in Assumption 1, both functions f and g must be
proper, convex, closed, and LE-smooth over the interior of the domain of E, denoted
as int(dom(E)). To demonstrate that dom(G) ⊆ int(dom(E)), it is sufficient to
show that RN+ ⊆ int(dom(g(t)
2 T
)) and RN+ ⊆ int(dom(f (t1,t2))), since dom(G) ⊆ RN+ .
As mentioned in Section 2.3, choosing Ψ(z N) = ∑i=1 zi log zi in Definition 2.3.2
will produce the normalized Kullback-Liebler relative entropy term (2.32). It now
remains to show that the chosen Ψ fulfills the conditions in Assumption 2. As
mentioned before, the second derivative of Ψ is positive for all x > 0 and hence
Ψ is proper convex for all x > 0. Ψ is also continuous since the product of two
continuous functions are continuous. It is obviously differentiable over dom(∂Ψ)
and dom(G) ⊆ dom(Ψ), given that the domain of Ψ is RNT+ . Lastly, Ψ + δdom(G) is
1-strongly convex by Theorem 2.1.5.
We have now proved the following theorem.
Theorem 3.4.1. Assume that there exists a feasible transport tensor M, C does
not contain ∞, both functions g ∶ RN → R and RN2f ∶ → R are Lg,f -smooth over
int(dom(g + f)). Then, if RN ⊆ int(dom(g(t)+ )) and RN
2
+ ⊆ int(dom(f (t1,t2))), we
have that the algorithm
Mko+1 = argmin ⟨C̃ko ,M⟩ + ηD(M),
M∈RNT+
s.t. Pt(M) = µt, for t ∈ VE,
Pt(M) ≤ µt, for t ∈ VI ,
32
3. The entropic proximal gradient method for solving generalized optimal
transport problems
converges to an optimal solution of (3.1).
Lastly, we need a termination criterion for our method, which is presented in the
next section.
3.4.2 Termination criterion
Next we examine what termination criterion we need for the outer level of the
algorithm. For the inner level, we can use the square error, ∥u(t),ks+1 − u(t),ks∥2,
compared to some chosen tolerance level. As a criterion to terminate the outer
level, we can implement ∑i1,...,iT (M
ko+1 k 2o
i1,...,iT −Mi1,...,iT ) . If this is less than ε > 0,
then the difference between Mko+1 and Mko are small, and we can terminate the
algorithm. We can calculate this efficiently by utilizing the following theorem, where
M2i ,...,iT denotes the element wise square.1
Theorem 3.4.2. Following the notation above, we have that
2
∑ (Mko+1 koi1,...,iT −Mi ,...,iT )1
i1,...,iT
can be more efficiently computed by
N N N
∑ [P1((Mko+1)2)] − 2 [P (Mko+1∑ 1 ⊙Mko)] +∑ [P k 21((M o) )] .
i=1 i i=1 i i=1 i
Proof. We have that
T −1 T
M (t,t+1) (t)i1,...,iT =Ki1,...,iTUi1,...,iT = (∏ Ki ,i + )(∏ui ),t t 1 t
t=1 t=1
and
T −1 T
M2 (t,t+1)
2 (t) 2
i1,...,iT = (∏ (Ki ,i ) )(∏ (u ) ).t t+1 it
t=1 t=1
This means that M2 has the same graph-structure as M. For ease of notation,
denote Mko+1 as M̂ koi ,...,iT i1,...,iT and Mi ,...,iT as Mi1,...,iT . We have1 1
2 ⎡ T −1 T T −1 T ⎤
2
⎢
M̂ M ⎢ (t,t+1) (t) (t,t+1) (t)
⎥
( ⎥i1,...,iT − i1,...,iT ) =⎢(∏ K̂it,i )( û ) − ( Kt+1 ∏ i ∏t it,i )( u )⎢ t+1
∏ it ⎥
⎣ t=1 t=1 t=1 t=1
⎥
⎦
T −1 T
(t,t+1) 2 (t) 2
=(∏ (K̂i ,i + ) )(∏ (ûi ) )t t 1 t
t=1 t=1
T −1 T T −1 T (3.16)
2 (t,t+1) (t) (t,t+1) (t)− (∏ K̂it,it+ )(∏ ûi )( K1 ∏t i ,i + )(∏ut t 1 i )t
t=1 t=1 t=1 t=1
T −1
( 2
T 2
t,t+1) (t)
+ (∏ (Ki ,i + ) )(∏ (u ) )t t 1 it
t=1 t=1
= M̂2i1,...,iT − 2M̂
2
i1,...,iTMi1,...,iT +Mi1,...,iT .
33
3. The entropic proximal gradient method for solving generalized optimal
transport problems
Moreover, we have that
[Pt(M)]it = ∑ Mi1,...,iT .
i1,...,iT /it
If we here sum over an arbitrary index, say 1, we get
N
∑ [P1(M)]i1 = ∑ Mi1,...,iT .
i1=1 i1,...,iT
This, together with (3.16), gives us
N N
∑ (M̂
2 2
i1,...,iT −Mi1,...,iT ) =∑ [P1((M̂) )] − 2∑ [P1(M̂⊙M)]
i1,...,iT i=1 i i=1 i
N
+ [P ((M)2∑ 1 )] ,
i=1 i
and we are done.
Next we present the algorithm in pseudo code, in order to facilitate better un-
derstanding.
34
3. The entropic proximal gradient method for solving generalized optimal
transport problems
3.5 The algorithm in pseudo code
In order to facilitate a better understanding of the algorithm, we present it in pseudo
code format, for both general trees and path graphs. The algorithm has two ”levels”,
repeatedly solving (3.11) using the Sinkhorn iterations. Let us call the Sinkhorn
iterations as the inner level, and solving (3.11) repeatedly, the outer level. The outer
level is almost the same for both general trees and path graphs. The difference lies
in the projections of the bi-marginal projections, and we present the algorithms in
the following sections.
3.5.1 General trees
The inner level for general trees is presented in Algorithm 1, which is from [HRCK21,
Algorithm 3.1], and the outer level is presented in Algorithm 2.
Algorithm 1 Sinkhorn method for the tree-structured multimarginal optimal trans-
port problem
Given: Tree Υ = (V ,E) with leaves L = {1,2, . . . , ∣L∣}
Initial guess u(t) for t ∈ L
for all ordered tuples (t1, t2) ∈ E do
Initialize α(t1,t2) by
α =K(t,r)u(r)(t,r) for r ∈ L,
α =K(t,r)(t,r) (u
(r) ⊙ ⊙ α(r,ℓ)) for r ∉ L
ℓ∈Nr∖{t}
end for
Initialize t ∈ L
while Sinkhorn not converged do
u(t) ← µt ⊘⊙k∈N αt (t,r)
for (t1, t2) ∈ E on the path from t to (t + 1 mod ∣L∣) do
Update α(t1,t2) by
α =K(t,r)(t,r) u
(r) for r ∈ L,
α =K(t,r)(t,r) (u
(r) ⊙ ⊙ α(r,ℓ)) for r ∉ L
ℓ∈Nr∖{t}
end for
t← t + 1 mod ∣L∣
end while
return u(t) for t ∈ L
35
3. The entropic proximal gradient method for solving generalized optimal
transport problems
Algorithm 2 Entropic proximal gradient method for the graph-structured multi-
marginal optimal transport problem
Function Generate ∇F (Mko)
∇g(P (Mkot ))← α∇g1(P (Mkot )) + (1 − α)∇g2(P (Mkot ))
∇F (Mko)← ∇g(Pt(Mko)) +∇f(P kot1,t2(M ))
return ∇F (Mko)
Function Generate C̃(t,t+1),ko
Generate ∇F (Mko)
u1 ← u(t) ⋅ 1T
u2 ← 1 ⋅ u(t)
T
u← αu1 + (1 − α)u2
C̃(t,t+1),ko ← C(t,t+1) +∇F (Mko) − δ(log(K(t,t+1),ko−1) − log(u))
return C̃(t,t+1),ko
Initial guess K0 = 1, Pt1,t2(Mko) = 1;
while M not converged do
for t in range(T ) do
Generate C̃(t,t+1),ko
( +1) exp C̃(t,t+1),kt,t ,k oK o ← ( − )η
end for
u, φ̂, φ← Algorithm 1
for t in range(T − 1) do
⎛ L−1 ⎞
Pt1,t2(Mko)← diag (u(t )∏ ℓ ⊙ ⊙ α(t ,k))K(tℓ,tℓ+1) diag (u(tL)⊙ℓ ⊙ α(t )L,r)
⎝ ℓ=1 r∈Nt ⎠ℓ r∈Nt /{tL L−1}
r∉{t1,...,tL}
end for
ko ← ko + 1
end while
return P (Mkot )
3.5.2 Path graphs
The inner algorithm for path graphs is presented in Algorithm 3, which is based
on [HRCK21, Algorithm 3.1], but adapted for path graphs. The outer algorithm is
presented in Algorithm 4.
36
3. The entropic proximal gradient method for solving generalized optimal
transport problems
Algorithm 3 Sinkhorn method for the path graph-structured multimarginal opti-
mal transport problem
Function Generate φ
φ(T −1) = 1
φ(T −2) ←K(T −2,T −1) diag (u(T −1))
for t in range(T -3, -1, -1) do
φ(t) ←K(t,t+1) diag (u(t+1))φ(t+1)
end for
return φ
Given: K, T marginals denoted µ, node size N , types = ’<’,’=’;
Initial guess u(t) = 1, for t = 1, . . . ,T , φ̂(0) = 1;
while Sinkhorn not converged do
Generate φ
for t in range(T ) do
Pt(Mko)← u(t) ⊙ φ̂(t) ⊙ φ(t)
u (t) (t)tmp ← u ⊙ µ ⊘ P (Mkot )
if type = ’<’ then
u(t) =min(utmp,1)
else if type = ’=’ then
u(t) = utmp
else if type = ’>’ then
u(t) =max(utmp,1)
end if
if t = 0 then
T
φ̂(t+1) ← (K(t−1,t)) (u(t−1))φ̂(t−1)
else if t ≠ T − 1 then
T
φ̂(t+1) ← (K(t−1,t)) diag (u(t−1))φ̂(t−1)
end if
end for
ks ← ks + 1
end while
Generate φ
return u, φ̂, φ
37
3. The entropic proximal gradient method for solving generalized optimal
transport problems
Algorithm 4 Entropic proximal gradient method for the graph-structured multi-
marginal optimal transport problem
Function Generate ∇F (Mko)
∇g(Pt(Mko))← α∇g1(P (Mko)) + (1 − α)∇g2(P (Mkot t ))
∇F (Mko)← ∇g(P kot(M )) +∇f(Pt,t+1(Mko))
return ∇F (Mko)
Function Generate C̃(t,t+1),ko
Generate ∇F (Mko)
u1 ← u(t) ⋅ 1T
T
u2 ← 1 ⋅ u(t)
u← αu1 + (1 − α)u2
C̃(t,t+1),ko ← C(t,t+1) +∇F (Mko) − δ(log(K(t,t+1),ko−1) − log(u))
return C̃(t,t+1),ko
Initial guess K0 = 1, P kot,t+1(M ) = 1;
while M not converged do
for t in range(T ) do
Generate C̃(t,t+1),ko
( +1) exp C̃(t,t+1),kK t,t ,k oo ← ( − )η
end for
u, φ̂, φ← Algorithm 3
for t in range(T − 1) do
Pt,t+1(Mko)← diag(φ̂(t))diag(u(t))K(t,t+1),ko diag(u(t+1))diag(φ(t+1))
end for
ko ← ko + 1
end while
return P (Mkot )
38
4
Implementation and numerical
results
In this chapter, we present the implementation of Algorithms 3 and 4, choice of
parameter value, and generated numerical results. While the theory in this the-
sis is developed for general trees, our implementation will be on path graphs, for
simplicity.
4.1 Implementation
The implementation of the entropic proximal gradient method has been done in
Python. We continue by defining a generalized multi-marginal optimal transport
problem by defining each parameter, such as the constant cost tensor C, functions
g and f .
4.1.1 Defining a generalized multi-marginal optimal trans-
port problem and its components
We define a generalized multi-marginal optimal transport problem as in (3.1), but
on a path graph. Let G = (V ,E) be a path graph with T = ∣V ∣ marginals of size
N , i.e., the marginals are 1-dimensional. In order to rewrite this problem to the
form (3.2), we need to define all functions and constraints. We begin by defining an
underlying constant cost tensor C with the structure (2.24). We take the interval
[0,1], and discretize it into N equidistant points, and define C as the tensor of all
distances between points, i.e., the value of entry Cij is given by j∣ i − 2N N ∣ . We
choose a variety of values for the regularization parameter ϵ and the step size δ for
the entropic gradient method.
The marginals associated with the problem are 1-dimensional, i.e., vectors. Their
values are generated from a normal distribution, meaning the highest concentration
of mass is at the center of the marginals. The start and end marginals are restricted
by equality constraints while the marginals in between are restricted by inequality
constraints. In this setting, we choose to set each value of µt, t ∈ VI , to be less or
equal to the total sum of µt, t ∈ VE.
As mentioned in Section 3.4.1, g and f must be chosen carefully so that Assump-
tion 1 is not violated. We opt for the ℓ2-norm as g(t) and the Frobenius norm for
f (t1,t2). Both these norms treat elements in vectors/matrices the same, as exempli-
fied by
39
4. Implementation and numerical results
∣∣x∣∣2ℓ = x
2
1 + x
2
2 +⋯ + x
2
2 N
,
N N
∣∣A∣∣2F =∑∑A
2
ij.
i=1 j=1
Moreover, all p-norms (1 < p < ∞) are smooth, including the Frobenius norm.
They are proper, convex, and continuous, thereby ensuring lower semi-continuity
as well.
The domain of G is obviously included in the interior of the domain of E, since
the domain of E is RNT , where the interior is the set itself. Consequently, the
relation dom(G) ⊆ int(dom(E)) holds true.
Now, we explicitly present what g and f are. As previously stated, we implement
the ℓ2-norm for g, but we also introduce a parameter νt. This represents a shift in
where the mass is centered. Without νt, the µt-term would only smooth the mass
around the center, which is not of interest. We can define νt in various ways. For
example, we can define it as a vector where the first half is zeros, and the second
half is ones, imposing a heavy penalty for the mass on the zero side. We get
g(t)(P (Mk 1o)) = ∥P (Mkot 2 t ) − ν
2
t∥ℓ .2
We need the gradient which is
∇g(t)(Pt(Mko)) = Pt(Mko) − νt.
For f , we implement the Frobenius norm,
f (t1,t2)(P (Mko)) = ∥P (Mko)∥2t1,t2 t1,t2 F ,
with the gradient
∇f (t1,t2)(Pt1,t2(Mko)) = 2P kot1,t2(M ).
With these functions and constraints, we can write our problem as in (3.2):
Mko+1 = argmin ⟨C̃ko ,M⟩ + ηD(M),
M∈R+N
T
s.t. (4.1)Pt(M) = µt, for t ∈ VE,
Pt(M) ≤ µt, for t ∈ VI .
We select the stopping criterion for the inner loop as the iteration at which the
absolute square of the difference between u(t),ks+1 and u(t),ks is less than a predeter-
mined tolerance level. For the outer level, we implement a surrogate to Theorem
3.4.2 due to time limitations. Instead of comparing Mko+1 and Mko , we compare
P (Mko+1t ) and P (Mkot ). This approach does not yield the same accuracy as the
theorem, but it provides a sufficiently good approximation.
As mentioned in Section 3.3, ∇g and log(u(t),ks) are both vectors, differing in
dimension from the other functions. How this can be handled is shown in the next
section.
40
4. Implementation and numerical results
4.1.2 Transformation from nodes to edges
The problem with ∇g and log(u(t),ks) is that both are vectors, differing in dimension
from C. However, referring back to Remark 2, we know that this is not an issue.
By multiplying the g-function with a vector of ones and the correct constants, we
obtain a matrix with the desired proportions. The same modifications can be applied
analogously to log(u(t),ks). In the following example, we demonstrate this concept,
focusing on the g-function.
4.1.2.1 Example of cost-transformation
For simplicity, assume that we have a problem that only depend on C and ∇g(t),
ignoring the other functions present in C̃ko , (4.1). Let the cost matrix C(t,t+1) be
defined by
⎡
⎢0 1 2 3⎤⎥
⎢ ⎥
Ct,t+1
⎢1 0 1 2⎥
= ⎢
⎢
⎢2 1 0 1
⎥ ,
⎥
⎥
⎢
⎣3 2 1 0⎥⎦
and ∇g(t) and ∇g(t+1) as
⎡
⎢1⎤ ⎡ ⎤⎥ ⎢5⎥
⎢ ⎥ ⎢ ⎥
(t) ⎢2⎥ ⎢6⎥∇g = ⎢ ⎥ , and ∇g(t+1)3 = ⎢ ⎥ .⎢ ⎥ ⎢⎢ ⎥ ⎢7⎥⎥
⎢
⎣4⎥ ⎢8⎥⎦ ⎣ ⎦
In this context, ∇g(t) and ∇g(t+1) represent additional costs associated with a
pair of marginals. Specifically, ∇g(t) indicates that it costs 1 additional cost-unit to
transport mass from position 1 to any other position, including remaining in place.
This principle applies uniformly across all positions; for example, moving mass from
position 4 to any other position incurs an extra cost of 4. Similarly, ∇g(t+1) applies
to marginal t+1, meaning that transporting mass to position 1 incurs an additional
cost of 5.
To facilitate computation, we need to transform ∇g(t) such that every row of
C(t,t+1) is incremented by ∇g(t). First off, by multiplying ∇g(t) by the vector 1T will
result in
⎡1⎤ ⎡⎢ ⎥ ⎢1 1 1 1⎤⎥
⎢ ⎥ ⎢ ⎥
(t)1T ⎢2⎥ 1 1 1 1 ⎢2 2 2 2⎥∇g = ⎢3⎥ [ ] = ⎢3 3 3 3⎥ .⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢4⎥ ⎢4 4 4 4⎥⎣ ⎦ ⎣ ⎦
Similarly we get
⎡
⎢1⎤ ⎡⎥ ⎢5 6 7 8⎤⎥
⎢1⎥ ⎢1 (t+1) T ⎢ ⎥ 5 6 7 8 ⎢5 6 7 8
⎥
⎥
(∇g ) = ⎢1⎥ [ ] = ⎢⎢ ⎥ ⎢⎢ ⎥ ⎢5 6 7 8
⎥ .
⎥
⎥
⎢ ⎥ ⎢
⎣1⎦ ⎣5 6 7 8⎥⎦
41
4. Implementation and numerical results
Our calculations are now complete and the costs can be added to C(t,t+1) without
complications. Denote the final cost as (t,t+1)C , and we get that
(t,t+1)
C = C(t,t+1)
T
+ α∇g(t)1T + (1 − α)1(∇g(t+1))
⎡ ⎤ ⎡
⎢0 1 2 3⎥ ⎢1 1 1 1⎤ ⎡ ⎤⎥ ⎢5 6 7 8⎥
⎢1 0 1 2⎥ ⎢⎢ ⎥ ⎢2 2 2 2⎥ ⎢⎥ ⎢5 6 7 8⎥⎥
= ⎢2 1 0 1⎥ + α ⎢3 3 3 3⎥ + (1 − α) ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢5 6 7 8⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢
⎣3 2 1 0⎦ ⎣4 4 4 4⎥ ⎢ ⎥⎦ ⎣5 6 7 8⎦
⎡
⎢5 − 4α 7 − 5α 9 − 6α 11 − 7α⎤⎥
⎢
⎢6 − 3α 6 − 4α 8 − 5α 10 − 6α⎥⎥
= ⎢
⎢
⎢7 − 2α 7 − 3α 7 − 4α 9 5
⎥ .
− α ⎥⎥
⎢
⎣ 8 − α 8 − 2α 8 − 3α 8 − 4α ⎥⎦
The constant α ensures that calculations are not repeated. Specifically, the cost
∇g(t+1) has already been incorporated into the next cost matrix (t+1,t+2)C during the
previous iteration. This adjustment is unnecessary at the leaves, as they do not
receive the cost twice.
4.1.3 The log-sum-exp-method
It is known that optimal transport problems can experience numerical instability
for small values of ϵ. A way to make the algorithms more numerically stable is to
apply the log-sum-exp-method in order to do calculations in the log-domain with
addition and subtraction, instead of matrix/vector multiplication, see [Sch19] and
[Win19]. This means that we need to take the logarithm of our functions and
variables, and adjust initial values. As many as possible of our calculations is done
in the log-domain, and are then converted back to the real domain by applying the
exponential function. Our variables are straightforward to compute, for example
taking log(u(t)) is taking the logarithm of the elements in u(t).
However, our functions require some calculations. From Theorem 2.2.2 we have
that the marginal projections can be calculated like P (K⊙U) = u(t)⊙φ̂(t)t ⊙φ(t) and
the bi-marginal P +1(K⊙U) = diag(φ̂(t))diag(u(t))K(t,t+1)t,t diag(u(t+1))diag(φ(t+1)).
For simplicity, we do this for projections with the standardM from Section 2.2.4, i.e.,
without the iteration index – it also holds for Mko . We need to take the logarithm
of the projections,
log(Pt(K⊙U))) = log(u(t) ⊙ φ̂(t) ⊙ φ(t))
= log(u(t)) + log(φ̂(t)) + log(φ(t)),
and
log(Pt,t+1(K⊙U)) = log (diag(φ̂(t))diag(u(t))K(t,t+1) diag(u(t+1))diag(φ(t+1))).
Now we examine what it means to take the logarithm of diag(a)B, for a general
42
4. Implementation and numerical results
vector a and matrix B. We have that, by definition of the logarithm,
diag N(a)B = [aiBij] ⇒i,j=1
log(diag(a)B) = [ log N(aiBij)]i,j=1⇒ (4.2)
log N N[ (a T NiBij)]i,j=1 = [ log(ai)1 ] =1 + [log(Bij)] .i,j i,j=1
Analogously we get
[ log N N T N(Bijaj)] =1 = [ log(Bij)] =1 + [1( log(aj)) ] =1. (4.3)i,j i,j i,j
Repeating (4.2) and (4.3) for log(Pt,t+1(K⊙U)), we get
log(Pt,t+1(K⊙U)) = [ log (t)
N (t) N (t,t+1) N
(φ̂i )1T ] =1 + [ log (u
T
i )1 ] =1 + [ log (Ki,j i,j ij )]i,j=1
1 log (t+1) T N 1 log (t+1) T N+ [ ( (uj )) ] + [ ( (φ )) ]i,j=1 j i,j=1.
(4.4)
This is the same idea as shown in Example 4.1.2.1, which is an implementation
Remark 2. The calculations of log(φ̂(t)) and log(φ(t)) are a bit more complex. We
begin with log(φ(t)) and get
log(φ(t)) = log(K(t,t+1) diag(u(t+1))φ(t+1))
N
⎛
log exp Cij (t+1)
⎞
= [ ( − )uj ] φ
(t+1)
⎝ ϵ i,j=1 ⎠
N
⎛ T ⎞
= log exp Cij[ (− + 1( log (t+1)(uj )) )] φ(t+1)
⎝ ϵ i,j=1 ⎠
⎛ N T ⎞
= log ∑ exp
Cij
(− + 1( log (t+1) (t+1)(uj )) )φj
⎝ j=1 ϵ ⎠
⎛ N T
log Cij (t+1) (t+1)
T ⎞
= ∑ exp(− + 1( log (u )) + 1( log (φ )) ) .
⎝ j=1 ϵ
j j
⎠
Analogously we get that
⎛ N CT
log (t) log exp ij 1 log (t−1)
T ( Tt−1) ⎞
(φ̂ ) = ∑ (− + ( (uj )) + 1( log (φj )) ) .
⎝ j=1 ϵ ⎠
An important thing to note is that everything that was initialized as ones, should
now be initialized as zeros, since log(1) = 0. Furthermore, 1 in (3.13) is replaced by
0, i.e., a vector of zeros. The observant reader might have already noticed, but when
applying the log-sum-exp-method we no longer implement K. Instead we operate
directly on −Cϵ .
Next we present our numerical results, with and without the log-sum-exp-method.
43
4. Implementation and numerical results
4.2 Numerical results
In this section, we present several numerical examples to illustrate the functionality
of our algorithm. We define our problem to have T = 32 marginals, each of size
N = 26. As mentioned in Section 4.1.1, for marginal t = 1 and t = T equality
constraints hold, while inequality constraints holds for the remaining marginals.
Throughout the examples, we assume a discretized normal distribution for both the
initial, t = 1, and final, t = T , marginals. This setup implies that the mass is centered
at the marginals with varying means, while the standard deviation is fixed at σ = 4.
Specifically, the initial marginal has a mean of m1 = N/3, and the final marginal has
a mean ofm2 = N/3+T /3. Moreover, the parameter ν is employed to ensure that the
resulting distributions approximate normal distributions with the specified means.
In Figure 4.1 we have implemented the algorithm with the simple cost functions
g and f , defined in Section 4.1.1, with and without the log-sum-exp-method.
0.010 0.010
0.008 0.008
0.006 0.006
0.004 0.004
0.002 0.002
0.000 0.000
30 30
25 25
0 20 20
5 15
0 15
In 10 10
5
d 10
10
ex on 15 m 20 5
Index 
a on
15
rg  i m 20
5
na 25 0 al rginal 25 0
(a) Outer count: 178 (b) log-sum-exp. Outer count: 176
Figure 4.1: Reference picture with simple cost functions.
Both the regular method and log-sum-exp-method yield similar results. The reg-
ular method requires two more iterations than the log-sum-exp-method; otherwise,
their performance is very similar. Note that the outer count refers to the number of
outer iterations performed before the algorithm converges.
4.2.1 No penalty from ∇F (M) for marginal t > N/4 and t <
N/2
In this section we start to experiment with the penalties. For marginals where
t > N/4 and t < N/2 we chose to not let ∇F (M) have an impact on the cost. This
is achieved by setting g(t) and f (t,t+1) to zero at each node, or edge, associated with
node t, where N/4 < t < N/2. The result can be seen in Figure 4.2. As in Figure
4.1, we get very similar results between the two methods.
44
Marginal number t
Value of marginal t
Marginal number t
Value of marginal t
4. Implementation and numerical results
0.010 0.010
0.008 0.008
0.006 0.006
0.004 0.004
0.002 0.002
0.000 0.000
30 30
25 25
0 20 20
5 15
0 15
In 10
5 10
d 10 Ind 10ex on 15 5 ex 15 m 5argi 20
on m 20
nal 25 0
arginal 25 0
(a) Outer count: 178. (b) log-sum-exp. Outer count: 176.
Figure 4.2: No penalty from ∇F (M) for marginal t > N/4 and t < N/2.
ϵ = 0.1, δ = 0.1
We then continued with the same conditions but experimented with different
values for ϵ and δ. We began by decreasing ϵ as much as possible before the methods
diverged and then proceeded to decrease δ. This approach led to different optimal
values of the problem (4.1) for different values of ϵ and δ, highlighting that the
methods require different parameter settings to maintain stability and convergence.
0.010
0.010 0.008
0.008 0.006
0.006 0.004
0.004 0.002
0.002 0.000
0.000 3025
30
25 0
20
5 15
0 20 I
10 10
5 15
ndex on 15 5
Ind 10
10 margi 20n 0
ex on 15
a 25
 m 5
l
argi 20nal 25 0 (b) log-sum-exp. Outer count: 69.
(a) Outer count: 26. ϵ = 0.07, δ = 0.07 ϵ = 0.09, δ = 0.07
Figure 4.3: No penalty from ∇F (M) for marginal t > N/4 and t < N/2.
While observing Figure 4.3, we can see that the methods can handle similar values
of ϵ and δ, but the log-sum-exp method requires a larger ϵ to converge. Moreover,
the log-sum-exp method requires more outer iterations to achieve convergence.
4.2.2 No penalty from ∇F (M) if marginal t mod 2 = 1
Next we implement the condition that ∇F (M) does not yield a penalty if marginal
t mod 2 = 1. We begin with standard values ϵ = δ = 0.1, see Figure 4.4. Here, both
methods yield the same result, with the same number of outer iterations.
45
Ma Mrg ai rn ga inl n alu  nm ub me br et r t
Value of marginal t Value of marginal t
Ma Mrg ai rn ga inl n alu  nm ub me br et r t
Value of marginal t Value of marginal t
4. Implementation and numerical results
0.010 0.010
0.008 0.008
0.006 0.006
0.004 0.004
0.002 0.002
0.000 0.000
30 30
25 25
0 20 0 20
5 15 15
In 10 10
5
d 10
10
ex I no dn 15 e m 5 x a or 20 n
15
 m 20 5gina al 25 0 rginal 25 0
(a) Outer count: 255. (b) log-sum-exp. Outer count: 255.
Figure 4.4: No penalty from ∇F (M) for marginal t mod 2 = 1. ϵ = 0.1, δ = 0.1.
As before, we decreased ϵ as much as possible before the methods diverged and
then proceeded to decrease δ. We see that the log-sum-exp can handle slightly
smaller values for ϵ. Moreover, it manages smaller values of δ and requires more
outer iterations to converge.
0.010
0.010 0.008
0.008 0.006
0.006 0.004
0.004 0.002
0.002 0.000
0.000 3025
30 0 2025 5 15
0 20 I
10 10
5 15
ndex on 15
10 10
 marg 20
5
Ind ine 15 al 25
0
x on m 5argi 20nal 25 0 (b) log-sum-exp. Outer count: 44.
(a) Outer count: 29. ϵ = 0.09, δ = 0.1. ϵ = 0.08, δ = 0.03.
Figure 4.5: No penalty from ∇F (M) for marginal t mod 2 = 1.
4.2.3 No penalty from ∇F (M) if marginal t mod 2 = 1 and/or
if marginal t > N/4 and t < N/2
Let us now combine the previous two sections into one by implementing a condition
where there is no penalty from ∇F (M) if the marginal t mod 2 = 1 and/or if t > N/4
and t < N/2. As before, we begin with standard values ϵ = δ = 0.1. This yielded
Figure 4.6, where we observe that both algorithms produce the same result at the
same outer count.
46
Ma Mrg ai rn ga inl n alu  nm ub me br et r t
Value of marginal t Value of marginal t
Ma Mrg ai rn ga inl n alu  nm ub me br et r t
Value of marginal t Value of marginal t
4. Implementation and numerical results
0.010 0.010
0.008 0.008
0.006 0.006
0.004 0.004
0.002 0.002
0.000 0.000
30 30
25 25
0 2015 0
20
5 5 15
In 10 10d 10 10ex 15 5
In
o dn e m x on
15 5
arg  i 20 mnal 25 0
argi 20nal 25 0
(a) Outer count: 255. (b) log-sum-exp. Outer count: 255.
Figure 4.6: No penalty from ∇F (M) if marginal t mod 2 = 1 and/or if marginal
t > N/4 and t < N/2. ϵ = 0.1, δ = 0.1.
When we decrease ϵ and δ, see Figure 4.7, we observe that the log-sum-exp-
method can handle smaller values of δ while still converging. The outer count is
larger for the log-sum-exp-method, which is expected due to the smaller proximal
step size δ.
0.010 0.010
0.008 0.008
0.006 0.006
0.004 0.004
0.002 0.002
0.000 0.000
30 30
25 25
0 20 0 20
5 15 5 15
In 10 10d 10ex I
10
 no dn 15 5 e m x o
15 5
ar ngi 20nal 25 0
 margi 20nal 25 0
(a) Outer count: 29. ϵ = 0.07, δ = (b) log-sum-exp. Outer count: 94.
0.08. ϵ = 0.08, δ = 0.009.
Figure 4.7: No penalty from ∇F (M) if marginal t mod 2 = 1 and/or if marginal
t > N/4 and t < N/2.
47
Ma Mrg ai rn ga inl n alu  nm ub me br et r t
Value of marginal t Value of marginal t
Ma Mrg ai rn ga inl n alu  nm ub me br et r t
Value of marginal t Value of marginal t
4. Implementation and numerical results
48
5
Discussion and conclusions
The goal of this thesis was to develop a new method for solving generalized op-
timal transport problems of the form (2.28). We have successfully done this by
transforming the generalized problem into a sequence of standard optimal transport
problems, and presenting rigorous theory to support our method. The method has
been successfully implemented using Python, but it currently lacks numerical sta-
bility and requires further refinement to, for example, handle marginals of higher
dimensions effectively.
Given that the Sinkhorn iterations are known to be numerically unstable for small
values of ϵ, we also implemented the log-sum-exp-method. In our case, we did not
observe any significant differences between the cases of with or without the log-
sum-exp-method, and it did not seem to lessen the observed numerical instability.
However, in some instances, it allowed us to reduce the step size δ.
We have theoretically proven the assumptions under which our method converges
and identified a suitable termination criterion. However, predicting whether the al-
gorithm would converge in practical settings remains challenging. This difficulty
may stem from the assumption in Theorem 3.4.1 that a feasible transport matrixM
exists, which cannot always be confirmed due to the complexity of optimal trans-
port problems. Moreover, since we did not explicitly implement Theorem 3.4.2 and
instead used a surrogate, our method may not achieve the required accuracy. Ad-
ditionally, fine-tuning the variables, adding or removing marginals, or changing the
size of the marginals sometimes resulted in a solution – understanding why requires
further investigation.
Finally, while the effectiveness of the method in its current state is uncertain,
it has the potential to improve computational performance for generalized opti-
mal transport problems. Further investigation is needed to validate and refine the
method, ensuring its robustness and reliability in various practical applications.
5.1 Future work
The numerical instability requires further examination, particularly to determine
whether the log-sum-exp-method can effectively handle these types of problems, as
we did not observe any significant differences in our experiments. Exploring the
impact of marginal sizes and variables such as ϵ, δ and cost functions ∇g and ∇f
might provide more insight into the reasons behind this instability.
Besides addressing the numerical instability, there are numerous avenues for im-
provement and future work. One interesting direction would be to implement the
49
5. Discussion and conclusions
algorithm on general trees instead of limiting it to path graphs. This extension would
enable the algorithm to address real-world problems that are often more complex
than what path graphs can handle. Additionally, exploring more complex optimal
transport problems featuring multiple marginals and intricate cost functions could
yield valuable insights.
Another avenue for future research would be to compare our algorithm to the one
derived in [HRCK21], which also handles generalized optimal transport problems.
Although we have not conducted a comparison, we understand that the algorithm in
[HRCK21] is more abstract in nature. In contrast, our algorithm may offer greater
practical usability, though this remains to be validated through further experimen-
tation and comparison.
Overall, while our method shows promise, extensive testing and refinement are
needed to fully realize its potential and ensure its robustness in a variety of prac-
tical applications. Future research should focus on these aspects to enhance the
algorithm’s performance and applicability.
50
Bibliography
[AEP+16] N. Andréasson, A. Evgrafov, M. Patriksson, E. Gustavsson,
Z. Nedělková, K. C. Sou, and M. Önnheim. An Introduction to Con-
tinuous Optimization. Studentlitteratur AB, 2016.
[Bec17] A. Beck. First-order methods in optimization. MOS-SIAM series on
optimization. MOS-SIAM, 2017.
[EHJK20] F. Elvander, I. Haasler, A. Jakobsson, and J. Karlsson. Multi-marginal
optimal transport using partial information with applications in robust
localization and sensor fusion. Signal Processing, 171:107474, 2020.
[GY05] J.L. Gross and J. Yellen. Graph Theory and Its Applications. Chapman
and Hall/CRC, 2nd edition, 2005.
[HRCK21] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson. Multimarginal opti-
mal transport with a tree-structured cost and the Schrödinger bridge
problem. SIAM Journal on Control and Optimization, 59(4):2428–2453,
2021.
[Lue69] D. G. Luenberger. Optimization by Vector Space Methods. John Wiley
& Sons, Inc, 1969.
[MT03] J. E. Marsden and A. Tromba. Vector Calculus. Macmillan, 2003.
[PC19] G. Peyré and M. Cuturi. Computational optimal transport: with appli-
cations to data science. Foundations and Trends® in Machine Learning,
11(5–6):355–607, 2019.
[RHCK21] A. Ringh, I. Haasler, Y. Chen, and J. Karlsson. Graph-structured tensor
optimization for nonlinear density control and mean field games. arXiv
preprint arXiv:2112.05645, 2021.
[Sch19] B. Schmitzer. Stabilized sparse scaling algorithms for entropy regu-
larized transport problems. SIAM Journal on Scientific Computing,
41(3):A1443–A1481, 2019.
[Sin67] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row
and column sums. The American Mathematical Monthly, 74(4):402–405,
1967.
[Teb92] M. Teboulle. Entropic proximal mappings with applications to nonlin-
ear programming. Mathematics of Operations Research, 17(3):670–690,
1992.
[Tho18] M. Thorpe. Introduction to optimal transport. F2.08, Centre for Math-
ematical Sciences, University of Cambridge, 2018.
[Win19] L. Winkler. Sinkhorn iterations, 2019. https://ludwigwinkler.
github.io/blog/Sinkhorn/ [Accessed: 2024-03-12].
51
DEPARTMENT OF MATHEMATICAL SCIENCES
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se