**Markov decision processes**

Bo Friis Nielsen^{1}

1DTU Informatics

02407 Stochastic Processes 9, November 1 2022

**Bo Friis Nielsen** **Markov decision processes**

**Markov decision processes**

Today:

I Markov decision processes Next week

I Brownian motion

**Bo Friis Nielsen** **Markov decision processes**

**Renewal reward processes**

ClaimsY_{i},i∈Nare generated according to a renewal process
{N(t);t ≥0}. The accumulatedZ(t)claim up to timetis

Z(t) =

N(t)

X

i=1

Y_{i}

N(t) Number of claims up to timet
Y_{i} Size of claimi

Z(t) Accumulated claim up to timet In general hard to analyse, but

E

e^{−θZ}^{(t)}

= E

E h

e^{−θZ}^{(t)}

N(t)i

=E

E h

e^{−θ}^{P}^{N(t}^{i=1}^{)}^{Y}^{i}

N(t)i

= E

E

N(t)

Y

i=1

e^{−θY}^{i}

N(t)

=E

N(t)

Y

i=1

Eh
e^{−θY}^{i}

N(t)

**Renewal reward processes**

E

e^{−θZ}^{(t)}

= E Eh

e^{−θZ(t}^{)}

N(t)i

=E Eh

e^{−θ}^{P}^{N(t)}^{i=1} ^{Y}^{i}

N(t)i

= E

E

N(t)

Y

i=1

e^{−θY}^{i}

N(t)

=E

N(t)

Y

i=1

E h

e^{−θY}^{i}
N(t)

= E

N(t)

Y

i=1

E h

e^{−θY}^{i}
i

=E

E h

e^{−θY}^{i}
iN(t)

=φ_{N(t)} φ_{Y}_{i}(θ)

Explicit solution ifN(t)is a phase type renewal process andY_{i}
are phase type distributed.

A renewal reward model of pairs(X_{i},Y_{i}),X_{i} andY_{i} need not be
independent.

**Markov chain**

If we think of a phase type renewal process in terms of the underlying Markov jump process, then rewards are associated with transitions in a Markov process.

If we further assume that the states have some physical meaning, then we could have rewards associated with other transitions (claims) or sojourns (premiums)

**Bo Friis Nielsen** **Markov decision processes**

**Markov chain**

A Markov chain{X_{t};t∈N∪ {0}in discrete time is

characterised by its transition probability matrix/matrices**P,P**t

and the initial probability vector**p**_{0}.

We could have rewards associated with transitions in the Markov chain.

Suppose we had acces to/control over the transitions

mechanism such that we hadP_{t}(a)where the argumentais
some action we can take.

See Chapter one of Puterman for examples.

Tamping as an example.

**Bo Friis Nielsen** **Markov decision processes**

**Markov Decision Process set up**

I Decision epochs

T ={1,2, . . . ,N},T ={1,2, . . . ,∞},T = [0;∞[

I State spaceS,S=∪_{t}S_{t}
I Action setsA_{s,t},A=∪_{s∈S}A_{s,t}
I Rewards**R**_{t}_{,A}_{s,t}

I Typically expected rewardsrt(s,a) = P

j∈St

rt(s,a,j)pt(j|s,a)

I For finite horizonrN(s)- scrap value
I Transition probabilities of**P**_{t,A}_{s,t}

I **P**_{A}_{s} could be a Markov transition kernel on a general space
I Even if time is discrete it could be randomised

(exponential)

{T,S,A_{s},p_{t}(j|s,a),r_{t}(s,a)}Markov decision process

**Process**

Xt State occupied timet
Y_{t} Action taken at timet

r_{t}(s,a) Reward received at timet visiting statestaking
actiona

Zt History process at timet

**Decision rules and policies**

d_{t} Decision rule. Action taken at timet.d_{t} :S_{t} →A_{s,t}
Zt History,Zt = (s_{1},a_{1},s_{2},a_{2}, . . . ,s_{N})

Π Policy, complete collection of decision rules
Π = (d_{1},d_{2}, . . .)

**Bo Friis Nielsen** **Markov decision processes**

**Policy types/decision rules 2.1.4**

K ∈ {SD,SR,MR,MD,HD,HR,MR,MD}

**SD** Stationary deterministic policy, (sometimespure
policy)

**SR** Stationary random policy
**MD** Markovian deterministic
**MR** Markovian random

**HD** History dependent deterministic
**HR** Hisory random

Relations

ΠSD⊂ΠSR⊂ΠMR⊂ΠHR ΠSD⊂ΠMD⊂ΠMR⊂ΠHR ΠSD⊂ΠMD⊂ΠHD ⊂ΠHR

Given a policy we have a stochastic process - typically a Markov reward process

**Bo Friis Nielsen** **Markov decision processes**

**One step MDP**

r_{1}(s,a^{0}) +E^{Π}s (v(X_{2})) =r_{1}(s,a^{0}) + P

j∈S

p_{1}(j|s,a^{0})v(j)

max

a^{0}∈As

(

r_{1}(s,a^{0}) + P

j∈S

p_{1}(j|s,a^{0})v(j)
)

=
r_{1}(s,a^{∗}) + P

j∈S

p_{1}(j|s,a^{∗})v(j)
argmax

x∈X

={x^{0}∈X|∀x ∈X :g(x^{0})≥g(x)}

a^{∗}_{s} = argmax

a^{0}∈As,t

(

r_{1}(s,a^{0}) + P

j∈S

p_{1}(j|s,a^{0})v(j)
)

**Expected total reward criterion 4.1.2**

**Definition**

v_{N}^{π}(s) =E^{π}s

(_{N−1}
X

t=1

r_{t}(X_{t},Y_{t}) +r_{N}(X_{n})
)

(1)

v_{N,λ}^{π} (s) =E^{π}s

(_{N−1}
X

t=1

λ^{t−1}rt(Xt,Yt) +λ^{N−1}r_{N}(Xn)
)

(2)

**Optimal policies**

v_{N}^{π}^{?}(s) ≥ v_{N}^{π}(s), s ∈S
v_{N}^{π}^{?}^{}(s) + ≥ v_{N}^{π}(s), s ∈S

v_{N}^{?}(s) = sup

π∈ΠHR
v_{N}^{π}(s)

(v_{N}^{?}(s) = max

π∈ΠHRv_{N}^{π}(s)

!

v_{N}^{π}^{?}(s) = v_{N}^{?}(s), s ∈S
v_{N}^{π}^{?}^{}(s) + > v_{N}^{?}(s), s ∈S

**Bo Friis Nielsen** **Markov decision processes**

**Optimal stopping**

**Decision epochs:** T ={1,2, . . . ,N}, N ≤ ∞
**States:** S=S^{0}∪ {∆}

**Actions:** A_{s}=

{C,Q} s∈S^{0}
{C} s= ∆

**Rewards:** r_{t}(s,a) =

−f_{t}(s) s∈S^{0} a=C
g_{t}(s) s∈S^{0} a=Q

0 s= ∆

r_{N}(s) =h(s)
**Transition probabilities**

pt(j|s,a) =

pt(j|s) s∈S^{0} j∈S^{0} a=C
1 s∈S^{0} j= ∆ a=Q
1 s=j = ∆ a=C

0 otherwise

**Bo Friis Nielsen** **Markov decision processes**

**Secretary problem setup**

I Ncandidates apply for a position I Objective is to hire the best person

I A decision needs to be taken immediately after the interview

s=

0 Current candidate not best so far 1 Current candidate best so far

∆ Interview process stopped
f_{t}(s) =0, g_{t}(0) =0, g_{t}(1) = _{N}^{t}

**Backward induction algorithm**

**1.** Sett=Nandu^{∗}_{N}(s_{N}) =r_{N}(s_{N}), ∀s_{N} ∈S
**2.** t^{−−}For eachs_{t} ∈S

u^{∗}_{t}(st) = max

a∈A_{st}

rt(st,a) +X

j∈S

pt(j|s_{t},a)u^{∗}_{t+1}(j)

A^{∗}_{s,t} = argmax

a∈A_{st}

rt(st,a) +X

j∈S

pt(j|s_{t},a)u^{∗}_{t+1}(j)

**3.** Ift=1 stop, otherwise return to step 2

**Secretary problem backward induction**

u_{N}^{∗}(s) =

0 s=0
1 s=1
0 s= ∆
u_{t}^{∗}(s) =

max

0,_{1+t}^{1} u_{t+1}^{∗} (1) + _{t+1}^{t} u_{t}^{∗}_{+1}(0)

s=0 max

g(t) +u_{t+1}^{∗} (∆),_{1+t}^{1} u_{t+1}^{∗} (1) + _{t+1}^{t} u_{t+1}^{∗} (0)

s=1

u_{t}^{∗}_{+1}(∆) s= ∆

**Bo Friis Nielsen** **Markov decision processes**

**Secretary problem backward induction - cont**

u_{t}^{∗}(s) =

1

1+tu^{∗}_{t+1}(1) + _{t}_{+1}^{t} u^{∗}_{t+1}(0) s=0
max

t

N,_{1+t}^{1} u_{t+1}^{∗} (1) + _{t+1}^{t} u_{t}^{∗}_{+1}(0)

s=1

0 s= ∆

u_{t}^{∗}(s) =

1

1+tu_{t}^{∗}_{+1}(1) + _{t+1}^{t} u^{∗}_{t+1}(0) s=0
max _{N}^{t} ,u^{∗}_{t}(0)

s=1

0 s= ∆

A^{∗}_{s,t}(0) =

C s=0

Q: _{N}^{t} >u_{t}^{∗}(0) s=1

C s= ∆

**Bo Friis Nielsen** **Markov decision processes**

Assumeu_{t}^{∗}(1)≥ _{N}^{τ} thenu_{t}^{∗}(1) =u_{t}^{∗}(0)≥ _{N}^{τ}
So: u_{t}^{∗}_{−1}(1) = max ^{τ−1}_{N} ,u_{t}^{∗}_{−1}(0)

withu^{∗}_{t−1}(0) = _{τ−1+1}^{1} u^{∗}_{t}(1) + _{τ−1+1}^{τ−1} u_{t}^{∗}(0) =u_{t}^{∗}(0)≥ _{N}^{τ} > ^{τ−1}_{N}

**Calculation of** u

_{t}

^{∗}

## (0)

u_{t}^{∗}(0) = _{1+t}^{1} u_{t+1}^{∗} (1) + _{t+1}^{t} u_{t}^{∗}_{+1}(0) = _{1+t}^{1} ^{1+t}_{N} +_{t+1}^{t} u_{t}^{∗}_{+1}(0) =

1

N+_{t}_{+1}^{t} u_{t+1}^{∗} (0)
u_{N}^{∗}(0) =0

u_{N−1}^{∗} (0) = _{N}^{1} +^{N−1}_{N} ·0= _{N}^{1}
u_{N−2}^{∗} (0) = _{N}^{1} +^{N−2}_{N−1}·_{N}^{1} = ^{N−2}_{N}

1

N−2+_{N−1}^{1}
u_{N−3}^{∗} (0) = _{N}^{1} +^{N−3}_{N−2}·^{N−2}_{N}

1

N−2+_{N−1}^{1}

=

N−3 N

1

N−3+_{N−2}^{1} +_{N−1}^{1}
Sou_{t}^{∗}(0) = _{N}^{t} PN−1

k=t 1 k and τ = max

t

PN−1 k=t 1

k >1 RN

1 1

x = log(N):PN−1 k=t 1

k

= log∼ ^{N−1}_{t}
,
log ^{N}_{τ} ∼

==1⇒τ =Ne^{−1}