Static and Dynamic Optimization (42111)
Build. 303b, room 048 Section for Dynamical Systems
Dept. of Applied Mathematics and Computer Science The Technical University of Denmark
Email: nkpo@dtu.dk phone: +45 4525 3356 mobile: +45 9351 1161
2019-11-24 14:37
Lecture 12: Stochastic Dynamic Programming
Outline of lecture
Recap: L11 Deterministic Dynamic Programming (D) Dynamics Programming (C)
Stochastics (Random variable) Stochastic Dynamic Programming Booking profiles
Stochastic Bellman
Stochastic optimal stepping (SDD) Reading guidance: DO p. 83-92.
Dynamic Programming (D)
Find a sequence of decisionsui i= 0, ,1, . . . N which takes the system xi+1=fi(xi, ui) x0=x0
along a trajectory, such that the cost function
J=φ(xN) +
N−1X
i=0
Li(xi, ui)
is minimized.
Dynamic Programming
The Bellman function (the optimal cost to go) is defined as:
Vi(xi) = min
uNi−1
Ji(xi, uN−1i )
and is a function of the present state,xi, and index,i.
In particular
VN(xN) =φN(xN)
Theorem
The Bellman functionVi, is given by the backwards recursion Vi(xi) = min
ui
h
Li(xi, ui) +Vi+1(xi+1) i
xi+1=fi(xi, ui) x0=x0
with the boundary condition
VN(xN) =φN(xN)
Bellman equation is afunctional equation, gives asufficient conditionandV0(x0) =J∗.
Dynamic programming
ui=arg min
ui
h
Li(xi, ui) +Vi+1(fi(xi, ui)
| {z }
xi+1
)
| {z }
Wi(xi,ui)
i
If a maximization problem: min→max.
Type of solutions
−5 0
5 0
5
10 0
5 10 15 20 25
xt
x Vt(x)
time (i)
V
Fish bone method (Graphical method) Schematic method (Tables)−>programming Analytical (e.g. Sep. of variable)
Analytical:
Guess the type of functionality inVi(x)i.e. up to a number of parameter. Check if it satisfy the Bellman equation. This results in a (number of) recursion(s) for the parameter(s).
Continuous Dynamic Programming
Find the input functionut, t∈R, (more precisely{u}T0) that takes the system
˙
x=ft(xt, ut) x0=x0 t∈[0, T] (1) such that the cost function
J=φT(xT) + Z T
0
Lt(xt, ut)dt (2)
is minimized. Define the truncated performance index (cost to go) Jt(xt,{u}Tt) =φT(xT) +
Z T t
Ls(xs, us)ds The Bellman function (optimal cost to go) is defined by
Vt(xt) = min
{u}Tt
h
Jt(xt,{u}Tt)i We have the following theorem, which states a sufficient condition.
Theorem
The Bellman functionVt(xt), satisfy the equation
−∂Vt(xt)
∂t = min
ut
Lt(xt, ut) +∂Vt(xt)
∂x ft(xt, ut)
Hamilton Jacobi Bellman (3) This is a PDE with boundary conditions
VT(xT) =φT(xT)
7 / 29
Continuous Dynamic Programming
Proof.
In discrete time we have the Bellman equation V¯i(xi) = min
ui
h L¯i(xi, ui) + ¯Vi+1(xi+1) i
with the boundary condition
V¯N(xN) =φN(xN)
t + ∆ t i + 1
t i
Then
Vt(xt) = min
ut
Zt+∆t
t
Lt(xt, ut)dt+Vt+∆t(xt+∆t)
Apply a Taylor expansion onVt+∆t(xt+∆t)
Vt(xt) = min
ut
h
Lt(xt, ut)∆t+Vt(xt) +∂Vt(xt)
∂x ft ∆t+∂Vt(xt)
∂t ∆t+o(|∆t|)i
Continuous Dynamic Programming
Proof.
Vt(xt) = min
ut
h
Lt(xt, ut)∆t+Vt(xt)+∂Vt(xt)
∂x ft∆t+∂Vt(xt)
∂t ∆t+o(|∆t|)i
(just a copy)
Collect the terms which do not depend on the decision (ut):
Vt(xt) =Vt(xt) +∂Vt(xt)
∂t ∆t+ min
ut
h
Lt(xt, ut) ∆t+∂Vt(xt)
∂x ft(xt, ut) ∆ti +o(|∆t|)
In the limit∆t→0(and after divide with∆t):
−∂Vt(xt)
∂t = min
ut
Lt(xt, ut) +∂Vt(xt)
∂x ft(xt, ut)
The HJB equation:
−∂Vt(xt)
∂t = min
u
Lt(xt, ut) +∂Vt(xt)
∂x ft(xt, ut)
(just a copy)
The Hamiltonian function
Ht(xt, ut, λTt) =Lt(xt, ut) +λTtft(xt, ut) The HJB equation can also be formulated as
−∂Vt(xt)
∂t = min
ut
Ht(xt, ut,∂Vt(xt)
∂x )
Link to Pontryagins maximum principle:
λTt =∂Vt(xt)
∂x
˙
xt = ft(xt, ut) State equation
−λ˙Tt = ∂
∂xt
Ht Costate equation
ut = arg min
ut [Ht] Optimality condition
Motion control
Consider the system
˙
xt=ut x0=x0 and the performance index
J=1 2px2T+
Z T 0
1 2u2t dt The HJB equation, (3), gives:
−∂Vt(xt)
∂t = min
ut
1
2u2t+∂Vt(xt)
∂x ut
VT(xT) =1 2px2T The minimization can be carried out and gives a solution w.r.t. utwhich is
ut=−∂Vt(xt)
∂x
So if the Bellman function is known the control action, the decision can be determined from this.
If the result above is inserted in the HJB equation we get
−∂Vt(xt)
∂t = 1
2
∂Vt(xt)
∂x 2
−
∂Vt(xt)
∂x 2
= −1 2
∂Vt(xt)
∂x 2
which is a partial differential equation with the boundary condition VT(xT) =1
2px2T
PDE:
−∂Vt(xt)
∂t =−1 2
∂Vt(xt)
∂x 2
(just a copy)
Inspired of the boundary condition we guess on a candidate function of the type Vt(x) = 1
2stx2 where the time dependence is in the function,st. Since
∂V
∂x =stx ∂V
∂t =1 2s˙tx2 the following equation
−1
2s˙tx2=−1 2(stx)2 must be valid for anyx, i.e. we can findstby solving the ODE
˙
st=s2t sT=p
backwards. This is actually (a simple version of) the continuous time Riccati equation. The solution can be found analytically or by means of numerical methods. Knowing the function,st, we can find the control input
ut=−∂Vt(xt)
∂x =−stxt
Stochastic Dynamic Programming
The Bank loan
Deterministic:
xi+1= (1 +r)xi−ui x0=x0 Stochastic:
xi+1= (1 +ri)xi−ui x0=x0
0 5 10 15 20 25
0 1 2 3 4 5 6 7 8 9 10
Rate of interests
%
time (month)
0 1 2 3 4 5 6 7 8 9 10 0
0.5 1 1.5 2 2.5 3 3.5 4 4.5
5x 104 Bank balance
Balance
time (year)
0 1 2 3 4 5 6 7 8 9 10
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
5x 104 Bank balance
Balance
time (year)
Discrete Random Variable
X∈
x1, x2, ..., xm ∈Rn pk=P
X=xk ≥0 Xm
k=1
pk= 1
1 2 3 4 5 6 7 8
0 0.2 0.4
En Xo
= Xm
k=1
pkxk
En g(X)o
= Xm
k=1
pkg(xk)
Stochastic Dynamic Programming
Consider the problem of minimizing (in some sense):
J=φN(xN,eN) +
N−1X
i=0
Li(xi, ui,ei)
subject to
xi+1=fi(xi, ui,ei) x0=x0 and the constraints
(xi, ui, ei)∈Vi (xN, eN)∈VN
eimight bevectorsreflectingmodel errorsor directstochastic effects.
Ranking performance Indexes
Wheneiand others are stochastic variable, what do we mean by that one strategy is better than another.
In a deterministic situation we mean that
J1> J2 (J1 (J2)being the objective function for strategy1 (2)).
In a stochastic situation we can choose the definition En
J1
o
>En J2
o
but others do exists. This choice reflects some kind of average consideration.
Example: Booking profiles
Normally a plane is over booked, ie. more tickets are sold than the number of seatsxN. Letxi
be the number of sold tickets on the beginning of dayi.
0 1 2 N
IfxN< xN we have empty seats - money out the window.
IfxN> xN we have to pay compensations - also money out the window.
So we want to find a strategy such we are minimizing:
En
φ(xN−xN)o
Letwibe therequestsfor a ticket on dayi (with probability: P
wi=k =pk)
and letvibe number ofcancellationson dayi (with probabilityP
vi=k =qk).
Dynamics:
xi+1=xi+min(ui, wi)−vi ei= wi
vi
Decision information: ui(xi).
Stochastic Bellman Equation
Consider the problem of minimizing:
J=En
φ(xN, eN) +
N−1X
i=0
Li(xi, ui, ei)o
subject to
xi+1=fi(xi, ui, ei) x0=x0 and the constraints
(xi, ui, ei)∈Vi (xN, eN)∈VN
Theorem
The Bellman function (optimal cost to go),Vi(xi)is given by the (backward) recursion:
Vi(xi) = min
ui
En
Li(xi, ui, ei) +Vi+1(xi+1)o
xi+1=fi(xi, ui, ei) VN(xN) =E
n
φN(xN, eN)o
where the optimization is subject to the constraints and the available information.
Discrete (SDD) case
Ifeiis discrete, ie.
ei∈
e1i, e2i, ... emi pki =P n
ei=eki o
k= 1, 2, ... m then the stochastic Bellman equation can be expressed as
Vi(xi) = min
ui
Xm
k=1
pkih
Li(xi, ui, eki) +Vi+1(fi(xi, ui, eki)
| {z }
xi+1
)i
| {z }
Wi(xi,ui)
with boundary condition
VN(xN) = Xm
k=1
pkNφN(xN, ekN)
The entries in the scheme below are now expected values (ie. weighted sums).
Wi ui Vi(xi) u∗i(xi)
xi 0 1 2 3
0 1 2 3 4
Optimal stochastic stepping (SDD)
Consider the system
xi+1=xi+ui+ei x0= 2, where
ei∈
−1 0 1 ui∈ {−1, 0, 1}∗ xi∈ {−2,−1, 0, 1, 2}
and
pki ei
xi -1 0 1 -2 0 12 12 -1 0 12 12 0 12 0 12 1 12 12 0 2 12 12 0
J=En x24+
X3 i=0
x2i+u2io
Notice, no stochastic components.
Optimal stochastic stepping (SDD)
Firstly, from
J=En x24+
X3 i=0
x2i+u2io
(no stochastics in cost) we establishV4(x4) =x24. We are assuming perfect state information.
x4 V4
-2 4
-1 1
0 0
1 1
2 4
Optimal stochastic stepping (SDD)
Then we establish theW3(x3, u3)function (the cost to go):
W3(x3, u3) = Xm
k=1
pk3 h
L3(x3, u3, ek3) +V4(f3(x3, u3, ek3)i
W3(x3, u3) =p13
x23+u23+V4(x3+u3+e13)
e13, p13 +p23
x23+u23+V4(x3+u3+e23) e23, p23 +p33
x23+u23+V4(x3+u3+e33) e33, p33
| {z }
L3(x3,u3,ek3)
| {z }
f3(x3,u3,ek3)
or more compact:
W3(x3, u3) =x23+u23+p13
V4(x3+u3+e13) +p23
V4(x3+u3+e23) +p33
V4(x3+u3+e33)
Optimal stochastic stepping (SDD)
W3(x3, u3) = X3 k=1
pkh
x23+u23+V4(x3+u3+ek3)i
W3(0,−1) =1 2
02+ (−1)2+V4(0−1−1)
(−1, 1 2) +0
02+ (−1)2+V4(0−1 +0)
(0,0) +1
2
02+ (−1)2+V4(0−1 +1)
(1, 1 2)
=1
2(1 + 4) + 0 +1
2(1 + 0) = 3
W3 u3
x3 -1 0 1
-2 ∞ 6.5 5.5
-1 4.5 1.5 2.5
0 3 1 3
1 2.5 1.5 4.5
2 5.5 3.5 ∞
x4 V4
-2 4
-1 1
0 0
1 1
2 4
(just for reference)
Optimal stochastic stepping (SDD)
W3 u3 V3(x3) u∗3(x3)
x3 -1 0 1
-2 ∞ 6.5 5.5 5.5 1
-1 4.5 1.5 2.5 1.5 0
0 3 1 3 1 0
1 2.5 1.5 4.5 1.5 0
2 5.5 3.5 ∞ 3.5 0
W2 u2 V2(x2) u∗2(x2)
x2 -1 0 1
-2 ∞ 7.5 6.25 6.25 1
-1 5.5 2.25 3.25 2.25 0
0 4.25 1.5 3.25 1.5 0
1 3.25 2.25 4.5 2.25 0
2 6.25 6.5 ∞ 6.25 -1
Optimal stochastic stepping (SDD)
W1 u1 V1(x1) u∗1(x1)
x1 -1 0 1
-2 ∞ 8.25 6.88 6.88 1
-1 6.25 2.88 3.88 2.88 0
0 4.88 2.25 4.88 2.25 0
1 3.88 2.88 6.25 2.88 0
2 6.88 8.25 ∞ 6.88 -1
W0 u0 V0(x0) u∗0(x0)
x0 -1 0 1
2 7.56 8.88 ∞ 7.56 -1
Trace back: ui(xi). A feed back solution. Not a time function.
Deterministic setting (xi+1=xi+ui i= 0, ...3)
i 0 1 2 3
u∗i -1 0 0 0
Stochastic setting (xi+1=xi+ui+ei i= 0, ...3)
x0 u∗0 x1 u∗1 x2 u∗2 x3 u∗3
-2 1 -2 1 -2 1
-1 0 -1 0 -1 0
0 0 0 0 0 0
1 0 1 0 1 0
2 -1 2 -1 2 -1 2 0
Concluding remarks
Discrete state and decision space.
Approximation. Grid covering state and decision space.
Curse of dimensions - combinatoric explosion.