Markov Decision Processes

Let $\mathcal{S}$ denote the set of all possible states and, for any $s\in\mathcal{S}$, let $\mathcal{A}(s)$ denote the set of all possible actions at state $s$. We will assume that $\mathcal{S}$ is countable and that $\mathcal{A}(s)$ is finite for all $s\in\mathcal{S}$. Let $T$ denote the time step variable and let $t$ denote a particular time step.

Let $S_t$ denote the random variable that is the state at time $t$ and let $A_t$ denote the action taken by the agent at time $t$. Let $R_{t+1}$ denote the reward from taking action $A_t$ at time $t$. We will use the conventions that $s$ is a possible outcome of $S_t$, that $a$ is a possible outcome of $A_t$, that $s’$ is a possible outcome of $S_{t+1}$, and that $r$ is a possible outcome of $R_{t+1}$. Note that

$$ \condpmf{s'}{s,a}=\cp{S_{t+1}=s'}{S_{t}=s,A_{t}=a}=\sum_{r\in\mcalR}\condpmf{s',r}{s,a} $$

This conditional PMF is called the transition function. Similarly

$$ \condpmf{r}{s,a}=\cp{R_{t+1}=r}{S_{t}=s,A_{t}=a}=\sum_{s'\in\mcalS}\condpmf{s',r}{s,a} $$

Hence

$$\align{ r(s,a)\equiv\Ec{R_{t+1}}{S_{t}=s,A_{t}=a}=\sum_{r\in\mcalR}r\cdot\condpmf{r}{s,a}=\sum_{r\in\mcalR}r\sum_{s'\in\mcalS}\condpmf{s',r}{s,a} }$$

Define $\prq{\cdot}\equiv\cp{\cdot}{S_{t}=s,A_{t}=a}$. Then

$$ \cpq{\cdot}{S_{t+1}=s'}=\cp{\cdot}{S_{t}=s,A_{t}=a,S_{t+1}=s'} \tag{by 1. Ross, p.94} $$

Hence

$$\align{ \condpmf{r}{s,a,s'} &= \cp{R_{t+1}=r}{S_{t}=s,A_{t}=a,S_{t+1}=s'} \\\\ &= \cpq{R_{t+1}=r}{S_{t+1}=s'} \\\\ &= \frac{\prq{R_{t+1}=r,S_{t+1}=s'}}{\prq{S_{t+1}=s'}} \\\\ &= \frac{\cp{R_{t+1}=r,S_{t+1}=s'}{S_{t}=s,A_{t}=a}}{\cp{S_{t+1}=s'}{S_{t}=s,A_{t}=a}} \\\\ &= \frac{\condpmf{r,s'}{s,a}}{\condpmf{s'}{s,a}} }$$

Hence

$$\align{ r(s,a,s')\equiv\Ec{R_t}{S_{t}=s,A_{t}=a,S_{t+1}=s'}=\sum_{r\in\mcalR}r\cdot\condpmf{r}{s,a,s'}=\sum_{r\in\mcalR}r\cdot\frac{\condpmf{r,s'}{s,a}}{\condpmf{s'}{s,a}} }$$

This expected value is called the reward function.

Now consider the sum

$$ \sum_{k=0}^{\infty}R_{t+1+k} $$

Note that this sum can be unbounded. Also, note that, intuitively, the reward $r$ at time $T=t+1$ should have more value than the same reward $r$ at time $T=t+100$. For these two reasons, we introduce the discount parameter $\gamma$. Then we define the cumulative reward at time step $t$ to be

$$ G_t\equiv\sum_{k=0}^{\infty}\gamma^{k}R_{t+1+k}=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\dotsb $$

Note that

$$\align{ G_t &\equiv \sum_{k=0}^{\infty}\gamma^{k}R_{t+1+k} \\ &= \gamma^{0}R_{t+1+0}+\sum_{k=1}^{\infty}\gamma^{k}R_{t+1+k} \\ &= R_{t+1}+\gamma\sum_{k=1}^{\infty}\gamma^{k-1}R_{t+1+k} \\ &= R_{t+1}+\gamma\sum_{k=0}^{\infty}\gamma^{k}R_{t+2+k} \\ &= R_{t+1}+\gamma\cdot G_{t+1} }$$

The Markov Property is this: if we know the state at time step $t+1$, then the previous states and actions are irrelevant in determining the probability of rewards at time steps $t+1,t+2,\dots$. Hence, if the Markov Property holds, then

$$\align{ &\condpmf{g_{t+1}}{s',a,s}=\condpmf{g_{t+1}}{s'} \\\\ &\condpmf{g_{t+1}}{s',r,a,s}=\condpmf{g_{t+1}}{s'}\tag{MDP.0} \\\\ &\condpmf{r_{t+2}}{s',a,s}=\condpmf{r_{t+2}}{s'} \\\\ }$$

A Markov Decision Process is a tuple

$$ \prn{\mcalS,\set{\mcalA(s):s\in\mcalS},\set{\condpmf{s'}{s,a}:s,s'\in\mcalS\text{ and }a\in\mcalA(s)},\set{r(s,a,s')},\gamma,\mcalH} $$

where the Markov Property holds and where $\mcalH$ is the time horizon. We will generally assume that the horizon is infinite unless otherwise stated.

Given an agent, we define a policy $\pi$ to be the conditional PMF that that agent takes a particular action given a particular state. That is, $\condpmfsym{\pi}{a}{s}$ is the probability that that agent takes action $a$ given the state $s$. More explicitly, let $\Pi$ denote the set of all possible policies on $\mcalS$ and $\mcalA$ and, for a policy $\pi$, we define

$$ \condpmfsym{\pi}{a}{s}\equiv\cpwrt{\pi}{A_T=a}{S_T=s}\equiv\cp{A_T=a}{S_T=s,\Pi=\pi}\qd\text{for all time steps }T $$

Notice that the conditional PMF $\condpmfsym{\pi}{a}{s}$ is constant relative to the time step variable $T$. Time-invariant policies are referred to as stationary.

But it is often helpful to define a policy where the conditional PMF’s vary with the time variable $T$. Indeed, there is no mathematical or probabilistic reason that precludes us from defining a policy with a different PMF at each time step. And when we learn about optimal policies (below), we can easily imagine some optimal policies whose behavior differs between time steps.

Hence, more generally, we define a nonstationary policy $\pi$ to be a set of conditional PMF’s:

$$ \condpmfsym{\pi}{a}{s} \equiv \cases{\cpwrt{\pi_{0}}{A_{0}=a}{S_{0}=s}\\\cpwrt{\pi_{1}}{A_{1}=a}{S_{1}=s}\\\vdots\\\cpwrt{\pi_{t}}{A_{t}=a}{S_{t}=s}\\\cpwrt{\pi_{t+1}}{A_{t+1}=a}{S_{t+1}=s}\\\vdots} $$

such that

$$ 1=\sum_{a\in\mcalA(s)}\cpwrt{\pi_{T}}{A_T=a}{S_T=s}\qd\text{for all }s\in\mcalS\text{ and all }T=0,1,2,\dots $$

We will assume that all policies are nonstationary unless stated otherwise.

Define the state-value function $v_{\pi}$ by

$$ v_{\pi}(s)\equiv\Ecwrt{\pi}{G_t}{S_t=s} $$

Define the action-value function $q_{\pi}$ by

$$ q_{\pi}(s,a)\equiv\Ecwrt{\pi}{G_t}{S_t=s,A_t=a} $$

Before proceeding, recall that for discrete random variables $X$ and $Y$, we have

$$\align{ \E{X} = \sum_xx\cdot\pmfwrt{X}{x} = \sum_xx\sum_y\pmf{x,y} = \sum_x\sum_yx\cdot\pmf{x,y} = \sum_y\sum_xx\cdot\pmf{x,y} }$$

The second equality follows from 1. Ross, p.233. Also recall that

$$\align{ \Ewrt{X,Y}{X+Y} &= \sum_x\sum_y(x+y)\cdot\pmfwrt{X,Y}{x,y}\tag{by 1. Ross, p.298, Prop 2.1} \\ &= \sum_x\sum_y(x+y)\sum_z\pmfwrt{X,Y,Z}{x,y,z}\tag{by 1. Ross, p.233} \\ &= \sum_x\sum_y\sum_z(x+y)\cdot\pmfwrt{X,Y,Z}{x,y,z} \\ &= \sum_z\sum_x\sum_y(x+y)\cdot\pmfwrt{X,Y,Z}{x,y,z} }$$

Proposition MDP.1

$$\align{ v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\condpmf{s',r}{a,s}\prn{r+\gamma v_{\pi}(s')} }$$

Note This equation is called the Bellman Equation for $v_{\pi}$.

Proof We compute

$$\align{ v_{\pi}(s) &\equiv \Ecwrt{\pi}{G_t}{S_t=s} \\ &= \Ecwrt{\pi}{R_{t+1}+\gamma G_{t+1}}{S_t=s} \\ &= \sum_{r}\sum_{g_{t+1}}\condpmf{r,g_{t+1}}{s,\pi}(r+\gamma g_{t+1})\label{Wayne} \\ &= \sum_{s'}\sum_{r}\sum_{g_{t+1}}\sum_{a}\condpmf{s',r,g_{t+1}, a}{s,\pi}(r+\gamma g_{t+1}) \\ &= \sum_{a}\sum_{s'}\sum_{r}\sum_{g_{t+1}}\condpmf{s',r,g_{t+1}, a}{s,\pi}(r+\gamma g_{t+1}) \\ &= \sum_{a}\sum_{s'}\sum_{r}\sum_{g_{t+1}}\condpmf{s',r,g_{t+1}}{a, s,\pi}\pi(a|s)(r+\gamma g_{t+1}) \\ &= \sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\sum_{g_{t+1}}p(s',r,g_{t+1} |a, s,\pi)(r+\gamma g_{t+1}) \\ &= \sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\sum_{g_{t+1}}\condpmf{s',r}{a, s,\pi}p(g_{t+1}|s', r, a, s,\pi)(r+\gamma g_{t+1}) \\ &= \sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\condpmf{s',r}{a, s,\pi}\sum_{g_{t+1}}p(g_{t+1}|s', r, a, s,\pi)(r+\gamma g_{t+1}) \\ &= \sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\condpmf{s',r}{a, s,\pi}\sum_{g_{t+1}}\condpmf{g_{t+1}}{s',\pi}(r+\gamma g_{t+1})\tag{by MDP.0} \\ &= \sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\condpmf{s',r}{a, s,\pi}\Prn{r\sum_{g_{t+1}}\condpmf{g_{t+1}}{s',\pi}+\gamma\sum_{g_{t+1}}\condpmf{g_{t+1}}{s',\pi}g_{t+1}} \\ &= \sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\condpmf{s',r}{a, s,\pi}\Prn{r+\gamma\sum_{g_{t+1}}\condpmf{g_{t+1}}{s',\pi}g_{t+1}} \\ &=\sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\condpmf{s',r}{a,s,\pi}\prn{r+\gamma v_{\pi}(s')} \\ &=\sum_{a}\pi(a|s)\sum_{s'}\sum_{r}\condpmf{s',r}{a,s}\prn{r+\gamma v_{\pi}(s')} }$$

In the last equality, we drop the condition that the policy is $\pi$ because neither $s’$ nor $r$ depend on the policy $\pi$ if we’re given the state and action from the previous time step.

$\wes$

Suppose $\sup_{\pi}v_{\pi}(s)$ exists and is bounded for all $s\in\mcalS$. Define $v_{*}:\mcalS\mapsto\wR$ by

$$ v_{*}(s)\equiv\sup_{\pi}v_{\pi}(s)\qd\text{for all }s\in\mcalS $$

If $\pi_{*}(s)\in\argsup{\pi}v_{\pi}(s)$, then $\pi_{*}(s)$ is said to be an optimal policy for $s$ and $\gamma$. If this holds for all $s\in\mcalS$, then $\pi_{*}$ is said to be an optimal policy for $\gamma$. When $\gamma$ is clear, we just say $\pi_{*}$ is an optimal policy.

Suppose $\sup_{\pi}q_{\pi}(s,a)$ exists and is bounded for all $s\in\mcalS$ and all $a\in\mcalA(s)$. Define $q_{*}:\mcalS\times\mcalA\mapsto\wR$ by

$$ q_{*}(s,a)\equiv\sup_{\pi}q_{\pi}(s,a)\qd\text{for all }s\in\mcalS,a\in\mcalA(s) $$

Going forward, we will assume that $\sup_{\pi}v_{\pi}(s)$ and $\sup_{\pi}q_{\pi}(s,a)$ exist and are bounded for all $s\in\mcalS$ and all $a\in\mcalA(s)$.

Proposition MDP.2 Let $s\in\mcalS$ and suppose there exists a policy $\pi_{*}$ such that

$$ \pi_{*}\in\argsup{\pi}v_{\pi}(s) $$

Then $v_{\pi}(s’)\leq v_{\pi_{*}}(s’)$ for all policies $\pi$ and for all possible states $s’$ in the next time step.

Proof Suppose, by way of contradiction, there exists a policy $\pi$ and a state $s_0’$ such that $v_{\pi}(s_0’)\gt v_{\pi_{*}}(s_0’)$. Define the policy $\phi_t$ by

$$ \cpwrt{\phi_t}{A_T=a}{S_T=s}=\condpmfsym{\phi_t}{a}{s}\equiv\cases{\condpmfsym{\pi}{a}{s}&\text{if }T=t+1\text{ and }s=s_0'\\\condpmfsym{\pi_{*}}{a}{s}&\text{else}} $$

Then

$$\align{ v_{\phi_t}(s) &= \sum_{a}\condpmfsym{\phi_t}{a}{s}\sum_{s'}\sum_{r}\condpmf{s',r}{a,s}\prn{r+\gamma v_{\phi_t}(s')}\tag{by MDP.1} \\ &= \sum_{a}\condpmfsym{\phi_t}{a}{s}\Prngg{\sum_{r}\condpmf{s_0',r}{a,s}\prn{r+\gamma v_{\phi_t}(s_0')}+\sum_{s'\neq s_0'}\sum_{r}\condpmf{s',r}{a,s}\prn{r+\gamma v_{\phi_t}(s')}} \\ &= \sum_{a}\condpmfsym{\pi_{*}}{a}{s}\Prngg{\sum_{r}\condpmf{s_0',r}{a,s}\prn{r+\gamma v_{\pi}(s_0')}+\sum_{s'\neq s_0'}\sum_{r}\condpmf{s',r}{a,s}\prn{r+\gamma v_{\pi_{*}}(s')}} \\ &\gt \sum_{a}\condpmfsym{\pi_{*}}{a}{s}\Prngg{\sum_{r}\condpmf{s_0',r}{a,s}\prn{r+\gamma v_{\pi_{*}}(s_0')}+\sum_{s'\neq s_0'}\sum_{r}\condpmf{s',r}{a,s}\prn{r+\gamma v_{\pi_{*}}(s')}} \\ &= \sum_{a}\condpmfsym{\pi_{*}}{a}{s}\sum_{s'}\sum_{r}\condpmf{s',r}{a,s}\prn{r+\gamma v_{\pi_{*}}(s')} \\ &= v_{\pi_{*}}(s)\tag{by MDP.1} }$$

This contradicts $v_{\pi_{*}}(s)=\sup_{\pi}v_{\pi}(s)$.

$\wes$

Proposition MDP.3 For all states $s\in\mcalS$ and all policies $\pi$, we have

$$ v_{\pi}(s) = \sum_{a\in\mcalA({s})}\plcypmf{a}{s}\cdot q_{\pi}(s,a) $$

Proof We have

$$\align{ v_{\pi}(s) &= \Ecwrt{\pi}{G_t}{S_t=s} \\ &= \sum_{g_t}g_t\cdot\condpmf{g_t}{s,\pi} \\ &= \sum_{g_t}g_t\cdot\sum_{a\in\mcalA({s})}\condpmf{g_t}{s,a,\pi}\condpmf{a}{s,\pi} \\ &= \sum_{g_t}\sum_{a\in\mcalA({s})}g_t\cdot\condpmf{g_t}{s,a,\pi}\condpmf{a}{s,\pi} \\ &= \sum_{a\in\mcalA({s})}\sum_{g_t}g_t\cdot\condpmf{g_t}{s,a,\pi}\condpmf{a}{s,\pi} \\ &= \sum_{a\in\mcalA({s})}\condpmf{a}{s,\pi}\sum_{g_t}g_t\cdot\condpmf{g_t}{s,a,\pi} \\ &= \sum_{a\in\mcalA({s})}\condpmf{a}{s,\pi}\cdot\Ecwrt{\pi}{G_t}{S_t=s,A_t=a} \\ &= \sum_{a\in\mcalA({s})}\plcypmf{a}{s}\cdot q_{\pi}(s,a) }$$

$\wes$

Proposition MDP.4 For all states $s\in\mcalS$ and all actions $a\in\mcalA(s)$, we have

$$ q_{\pi}(s,a) = \Ec{R_{t+1}+\gamma v_{\pi}(S_{t+1})}{S_t=s,A_t=a} $$

Note This equation is called the Bellman Equation for $q_{\pi}$. When we combine this equation with MDP.3, we can find a variety of forms for the Bellman Equation for $q_{\pi}$.

Proof We have

$$\align{ q_{\pi}(s,a) &= \Ecwrt{\pi}{G_t}{S_t=s,A_t=a} \\ &= \Ecwrt{\pi}{R_{t+1}+\gamma G_{t+1}}{S_t=s,A_t=a} \\ &= \Ecwrt{\pi}{R_{t+1}}{S_t=s,A_t=a}+\gamma\Ecwrt{\pi}{G_{t+1}}{S_t=s,A_t=a} \\ &= \Ecwrt{\pi}{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{g_{t+1}}g_{t+1}\cdot\condpmf{g_{t+1}}{s,a,\pi} \\ &= \Ecwrt{\pi}{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{g_{t+1}}g_{t+1}\cdot\sum_{s'}\condpmf{g_{t+1}}{s',s,a,\pi}\condpmf{s'}{s,a,\pi} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{g_{t+1}}g_{t+1}\cdot\sum_{s'}\condpmf{g_{t+1}}{s',s,a,\pi}\condpmf{s'}{s,a}\tag{MDP.4.1} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{g_{t+1}}\sum_{s'}g_{t+1}\cdot\condpmf{g_{t+1}}{s',s,a,\pi}\condpmf{s'}{s,a} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{s'}\sum_{g_{t+1}}g_{t+1}\cdot\condpmf{g_{t+1}}{s',s,a,\pi}\condpmf{s'}{s,a} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{s'}\condpmf{s'}{s,a}\sum_{g_{t+1}}g_{t+1}\cdot\condpmf{g_{t+1}}{s',s,a,\pi} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{s'}\condpmf{s'}{s,a}\sum_{g_{t+1}}g_{t+1}\cdot\condpmf{g_{t+1}}{s',\pi}\tag{by MDP.0} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{s'}\condpmf{s'}{s,a}\cdot\Ecwrt{\pi}{G_{t+1}}{S_{t+1}=s'} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\sum_{s'}\condpmf{s'}{s,a}\cdot v_{\pi}(s')\tag{MDP.4.2} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\Ec{v_{\pi}(S_{t+1})}{S_t=s,A_t=a} \\ &= \Ec{R_{t+1}+\gamma v_{\pi}(S_{t+1})}{S_t=s,A_t=a} }$$

In MDP.4.1, we remove the condition $\pi$ from both the first expectation and the last conditional PMF. We can do this because neither $R_{t+1}$ nor $s’$ depend on the policy if we’re given the state and action at time $t$.

Alternatively, we can compute this as follows:

$$\align{ q_{\pi}(s,a) &= \Ecwrt{\pi}{G_t}{S_t=s,A_t=a} \\ &= \Ecwrt{\pi}{R_{t+1}+\gamma G_{t+1}}{S_t=s,A_t=a} \\ &= \sum_{r}\sum_{g_{t+1}}(r+\gamma g_{t+1})\cdot\condpmf{r,g_{t+1}}{s,a,\pi} \\ &= \sum_{r}\sum_{g_{t+1}}(r+\gamma g_{t+1})\cdot\sum_{s'}\condpmf{r,g_{t+1}}{s',s,a,\pi}\cdot\condpmf{s'}{s,a,\pi} \\ &= \sum_{r}\sum_{g_{t+1}}\sum_{s'}(r+\gamma g_{t+1})\cdot\condpmf{r,g_{t+1}}{s',s,a,\pi}\cdot\condpmf{s'}{s,a,\pi} \\ &= \sum_{s'}\sum_{r}\sum_{g_{t+1}}(r+\gamma g_{t+1})\cdot\condpmf{r,g_{t+1}}{s',s,a,\pi}\cdot\condpmf{s'}{s,a,\pi} \\ &= \sum_{s'}\condpmf{s'}{s,a,\pi}\sum_{r}\sum_{g_{t+1}}(r+\gamma g_{t+1})\cdot\condpmf{r,g_{t+1}}{s',s,a,\pi} \\ &= \sum_{s'}\condpmf{s'}{s,a,\pi}\sum_{r}\sum_{g_{t+1}}\prn{r\cdot\condpmf{r,g_{t+1}}{s',s,a,\pi}}+\prn{\gamma g_{t+1}\cdot\condpmf{r,g_{t+1}}{s',s,a,\pi}} \\ &= \sum_{s'}\condpmf{s'}{s,a,\pi}\Prngg{\sum_{r}\sum_{g_{t+1}}r\cdot\condpmf{r,g_{t+1}}{s',s,a,\pi}+\sum_{r}\sum_{g_{t+1}}\gamma g_{t+1}\cdot\condpmf{r,g_{t+1}}{s',s,a,\pi}} \\ &= \sum_{s'}\condpmf{s'}{s,a,\pi}\Prngg{\sum_{r}r\cdot\sum_{g_{t+1}}\condpmf{r,g_{t+1}}{s',s,a,\pi}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\sum_{r}\condpmf{r,g_{t+1}}{s',s,a,\pi}} \\ &= \sum_{s'}\condpmf{s'}{s,a,\pi}\Prngg{\sum_{r}r\cdot\condpmf{r}{s',s,a,\pi}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\condpmf{g_{t+1}}{s',s,a,\pi}} \\ &= \sum_{s'}\condpmf{s'}{s,a,\pi}\Prngg{\sum_{r}r\cdot\condpmf{r}{s',s,a,\pi}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\condpmf{g_{t+1}}{s',\pi}}\tag{by MDP.0} \\ &= \sum_{s'}\condpmf{s'}{s,a}\Prngg{\sum_{r}r\cdot\condpmf{r}{s',s,a}+\gamma\sum_{g_{t+1}}g_{t+1}\cdot\condpmf{g_{t+1}}{s',\pi}} \\ &= \sum_{s'}\condpmf{s'}{s,a}\Prngg{\sum_{r}r\cdot\condpmf{r}{s',s,a}+\gamma v_{\pi}(s')} \\ &= \sum_{s'}\condpmf{s'}{s,a}\sum_{r}r\cdot\condpmf{r}{s',s,a}+\gamma \sum_{s'}\condpmf{s'}{s,a}\cdot v_{\pi}(s') \\ &= \sum_{s'}\sum_{r}r\cdot\condpmf{r}{s',s,a}\cdot\condpmf{s'}{s,a}+\gamma\Ec{v_{\pi}(S_{t+1})}{S_t=s,A_t=a} \\ &= \sum_{r}\sum_{s'}r\cdot\condpmf{r}{s',s,a}\cdot\condpmf{s'}{s,a}+\gamma\Ec{v_{\pi}(S_{t+1})}{S_t=s,A_t=a} \\ &= \sum_{r}r\cdot\sum_{s'}\condpmf{r}{s',s,a}\cdot\condpmf{s'}{s,a}+\gamma\Ec{v_{\pi}(S_{t+1})}{S_t=s,A_t=a} \\ &= \sum_{r}r\cdot\condpmf{r}{s,a}+\gamma\Ec{v_{\pi}(S_{t+1})}{S_t=s,A_t=a} \\ &= \Ec{R_{t+1}}{S_t=s,A_t=a}+\gamma\Ec{v_{\pi}(S_{t+1})}{S_t=s,A_t=a} \\ &= \Ec{R_{t+1}+\gamma v_{\pi}(S_{t+1})}{S_t=s,A_t=a} }$$

$\wes$

Proposition MDP.5 Let $s_0\in\mcalS$, let $a_0\in\mcalA(s_0)$, and let $\pi$ be any policy on $\mcalS$. Define the policy $\phi_t$ on $\mcalS$ by

$$ \cpwrt{\phi_t}{A_T=a}{S_T=s}=\condpmfsym{\phi_t}{a}{s}\equiv\cases{\condpmfsym{\pi}{a}{s}&\text{if }T\neq t\text{ or }s\neq s_0\\1&\text{else if }a=a_0\\0&\text{else}} $$

Then, for all $a\in\mcalA(s_0)$, we have

$$ q_{\phi_{t}}(s_0,a) = q_{\pi}(s_0,a) \\ $$

Proof If $g_{t+1}$ is a possible outcome of $G_{t+1}$, then $\condpmf{g_{t+1}}{s’,\phi_t}=\condpmf{g_{t+1}}{s’,\pi}$ since $\phi_t=\pi$ for $T=t+1,t+2,\dots$. Also note that, for any $a\in\mcalA(s_0)$, we have $\condpmf{s’}{s_0,a,\phi_t}=\condpmf{s’}{s_0,a,\pi}$. That is, if we know the state and action in the current time step, then the policy is irrelevant to determining the next state. Hence, for any $a\in\mcalA(s_0)$, we have

$$\align{ q_{\phi_t}(s_0,a) &= \Ecwrt{\phi}{G_t}{S_t=s_0,A_t=a} \\ &= \Ecwrt{\phi_t}{R_{t+1}+\gamma G_{t+1}}{S_t=s_0,A_t=a} \\ &= \Ecwrt{\phi_t}{R_{t+1}}{S_t=s_0,A_t=a}+\Ecwrt{\phi_t}{\gamma G_{t+1}}{S_t=s_0,A_t=a} \\ &= \Ecwrt{\phi_t}{R_{t+1}}{S_t=s_0,A_t=a}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\condpmf{g_{t+1}}{s_0,a,\phi_t} \\ &= \Ecwrt{\phi_t}{R_{t+1}}{S_t=s_0,A_t=a}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\sum_{s'}\condpmf{g_{t+1}}{s',s_0,a,\phi_t}\cdot\condpmf{s'}{s_0,a,\phi_t} \\ &= \Ecwrt{\phi_t}{R_{t+1}}{S_t=s_0,A_t=a}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\sum_{s'}\condpmf{g_{t+1}}{s',\phi_t}\cdot\condpmf{s'}{s_0,a,\phi_t}\tag{by MDP.0} \\ &= \Ecwrt{\pi}{R_{t+1}}{S_t=s_0,A_t=a}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\sum_{s'}\condpmf{g_{t+1}}{s',\pi}\cdot\condpmf{s'}{s_0,a,\pi} \\ &= \Ecwrt{\pi}{R_{t+1}}{S_t=s_0,A_t=a}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\sum_{s'}\condpmf{g_{t+1}}{s',s_0,a,\pi}\cdot\condpmf{s'}{s_0,a,\pi} \\ &= \Ecwrt{\pi}{R_{t+1}}{S_t=s_0,A_t=a}+\sum_{g_{t+1}}\gamma g_{t+1}\cdot\condpmf{g_{t+1}}{s_0,a,\pi} \\ &= \Ecwrt{\pi}{R_{t+1}}{S_t=s_0,A_t=a}+\Ecwrt{\pi}{\gamma G_{t+1}}{S_t=s_0,A_t=a} \\ &= \Ecwrt{\pi}{R_{t+1}+\gamma G_{t+1}}{S_t=s_0,A_t=a} \\ &= \Ecwrt{\pi}{G_t}{S_t=s_0,A_t=a} \\ &= q_{\pi}(s_0,a) }$$

$\wes$

Proposition MDP.6 Let $s_0\in\mcalS$ and define $a_{*}$ by

$$ a_{*}\in\argmax{a\in\mcalA(s_0)}\Prn{\sup_{\pi}q_{\pi}(s_0,a)}=\argmax{a\in\mcalA(s_0)}q_{*}(s_0,a) $$

Suppose there exists a policy $\pi_{**}$ such that

$$ \pi_{**}\in\argsup{\pi}q_{\pi}(s_0,a_{*}) $$

And define the policy $\phi_t$ by

$$ \cpwrt{\phi_t}{A_T=a}{S_T=s}=\condpmfsym{\phi_t}{a}{s}\equiv\cases{\condpmfsym{\pi_{**}}{a}{s}&\text{if }T\neq t\text{ or }s\neq s_0\\1&\text{else if }a=a_{*}\\0&\text{else}} $$

Then $\phi_t$ is an optimal policy:

$$ \phi_t\in\argsup{\pi}v_{\pi}(s_0) $$

and

$$ v_{\phi_t}(s_0)=\sup_{\pi}v_{\pi}(s_0)=v_{*}(s_0) $$

Proof For any policy $\pi$, we have

$$\align{ v_{\pi}(s_0) &= \sum_{a\in\mcalA(s_0)}\condpmfsym{\pi}{a}{s_0}\cdot q_{\pi}(s_0,a) \\ &\leq \sum_{a\in\mcalA(s_0)}\condpmfsym{\pi}{a}{s_0}\cdot q_{*}(s_0,a) \\ &\leq \sum_{a\in\mcalA(s_0)}\condpmfsym{\pi}{a}{s_0}\cdot q_{*}(s_0,a_{*}) \\ &= q_{*}(s_0,a_{*})\cdot\sum_{a\in\mcalA(s_0)}\condpmfsym{\pi}{a}{s_0} \\ &= q_{*}(s_0,a_{*}) \\ &= q_{\pi_{**}}(s_0,a_{*}) \\ &= q_{\phi_t}(s_0,a_{*})\tag{by MDP.5} \\ &= \sum_{a\in\mcalA(s_0)}\condpmfsym{\phi_t}{a}{s_0}\cdot q_{\phi_t}(s_0,a) \\ &= v_{\phi_t}(s_0) }$$

$\wes$

Proposition MDP.7 Let $s_0\in\mcalS$ and let $a_{*}\in\argmax{a\in\mcalA(s_0)}q_{*}(s_0,a)$. Suppose $\argsup{\pi}q_{\pi}(s_0,a_{*})$ is nonempty. Then

$$ v_{*}(s_0) = \max_{a\in\mcalA(s)}q_{*}(s_0,a) $$

Proof For any policy $\pi$, we have

$$\align{ v_{\pi}(s_0) &= \sum_{a\in\mcalA({s_0})}\plcypmf{a}{s_0}\cdot q_{\pi}(s_0,a) \\ &\leq \sum_{a\in\mcalA({s_0})}\plcypmf{a}{s_0}\cdot q_{*}(s_0,a) \\ &\leq \sum_{a\in\mcalA({s_0})}\plcypmf{a}{s_0}\cdot q_{*}(s_0,a_{*}) \\ &= q_{*}(s_0,a_{*})\cdot\sum_{a\in\mcalA({s_0})}\plcypmf{a}{s_0} \\ &= q_{*}(s_0,a_{*}) \\ &= \max_{a\in\mcalA(s_0)}q_{*}(s_0,a) }$$

Since this is true for all policies $\pi$, then it must be that

$$ v_{*}(s_0)=\sup_{\pi}v_{\pi}(s_0)\leq\max_{a\in\mcalA(s_0)}q_{*}(s_0,a) $$

Suppose, by way of contradiction, that we have strict inequality:

$$ v_{*}(s_0)\lt\max_{a\in\mcalA(s_0)}q_{*}(s_0,a) \tag{MDP.7.1} $$

Define the policy $\pi_{**}$ by

$$ \pi_{**}\in\argsup{\pi}q_{\pi}(s_0,a_{*}) $$

Then

$$ q_{\pi_{**}}(s_0,a_{*}) = \sup_{\pi}q_{\pi}(s_0,a_{*}) = q_{*}(s_0,a_{*}) = \max_{a\in\mcalA(s_0)}q_{*}(s_0,a) $$

Also define the policy $\phi_t$ by

$$ \cpwrt{\phi_t}{A_T=a}{S_T=s}=\condpmfsym{\phi_t}{a}{s}\equiv\cases{\condpmfsym{\pi_{**}}{a}{s}&\text{if }T\neq t\text{ or }s\neq s_0\\1&\text{else if }a=a_{*}\\0&\text{else}} $$

Then MDP.5 gives the third equality:

$$\align{ v_{\phi_t}(s_0) = \sum_{a\in\mcalA(s_0)}\condpmfsym{\phi_t}{a}{s_0}q_{\phi_t}(s_0,a) = q_{\phi_t}(s_0,a_{*})= q_{\pi_{**}}(s_0,a_{*}) = \max_{a\in\mcalA(s_0)}q_{*}(s_0,a) \gt v_{*}(s_0) }$$

This contradicts the definition of $v_{*}(s_0)\equiv\sup_{\pi}v_{\pi}(s_0)$. Hence assumption MDP.7.1 is false and we have

$$ v_{*}(s_0)=\max_{a\in\mcalA(s_0)}q_{*}(s_0,a) $$

$\wes$

Proposition MDP.8 Let $s_0\in\mcalS$ and suppose there exists a policy $\pi_{*}$ such that

$$ \pi_{*}\in\argsup{\pi}v_{\pi}(s_0) $$

Let $a_0\in\mcalA(s_0)$ and define the policy $\phi_t$ by

$$ \cpwrt{\phi_t}{A_T=a}{S_T=s}=\condpmfsym{\phi_t}{a}{s}\equiv\cases{\condpmfsym{\pi_{*}}{a}{s}&\text{if }T\neq t\text{ or }s\neq s_0\\1&\text{else if }a=a_0\\0&\text{else}} $$

Then, for all actions $a\in\mcalA(s_0)$, we have

$$ \phi_t\in\argsup{\pi}q_{\pi}(s_0,a) $$

and

$$ q_{\phi_t}(s_0,a)=\sup_{\pi}q_{\pi}(s_0,a)=q_{*}(s_0,a) $$

Additionally, define $a_{\pi*}$ by

$$ a_{\pi*}\in\argmax{a\in\mcalA(s_0)}q_{\pi_{*}}(s_0,a) $$

Then

$$ a_{\pi*}\in\argmax{a\in\mcalA(s_0)}q_{\phi_t}(s_0,a) $$

Proof For any policy $\pi$ and any action $a\in\mcalA(s_0)$, we compute

$$\align{ q_{\pi}(s_0,a) &= \Ecwrt{\pi}{R_{t+1}}{S_t=s_0,A_t=a}+\gamma\sum_{s'}\condpmf{s'}{s_0,a}v_{\pi}(s')\tag{by MDP.4.2} \\ &\leq \Ecwrt{\pi_{*}}{R_{t+1}}{S_t=s_0,A_t=a}+\gamma\sum_{s'}\condpmf{s'}{s_0,a}v_{\pi_{*}}(s')\tag{by MDP.2} \\ &= q_{\pi_{*}}(s_0,a)\tag{by MDP.4.2} \\ &= q_{\phi_t}(s_0,a)\tag{by MDP.5} }$$

And for any action $a$, we compute

$$ q_{\phi_t}(s_0,a) = q_{\pi_{*}}(s_0,a) \leq q_{\pi_{*}}(s_0,a_{\pi*}) = q_{\phi_t}(s_0,a_{\pi*}) $$

$\wes$

Proposition MDP.9 Let $s\in\mcalS$ and suppose there exists a policy $\pi_{*}$ such that

$$ \pi_{*}\in\argsup{\pi}v_{\pi}(s) $$

Then

$$ v_{*}(s) = \max_{a\in\mcalA(s)}q_{\pi_{*}}(s,a) $$

Proof Let $s_0\in\mcalS$. Define $a_{\pi*}\in\argmax{a\in\mcalA(s_0)}q_{\pi_{*}}(s_0,a)$ and define the policy $\phi_t$ by

$$ \cpwrt{\phi_t}{A_T=a}{S_T=s}=\condpmfsym{\phi_t}{a}{s}\equiv\cases{\condpmfsym{\pi_{*}}{a}{s}&\text{if }T\neq t\text{ or }s\neq s_0\\1&\text{else if }a=a_{\pi*}\\0&\text{else}} $$

Then for any policy $\pi$, we have

$$\align{ v_{\pi}(s_0) &= \sum_{a\in\mcalA({s_0})}\plcypmf{a}{s_0}\cdot q_{\pi}(s_0,a) \\ &\leq \sum_{a\in\mcalA({s_0})}\plcypmf{a}{s_0}\cdot q_{\phi_t}(s_0,a)\tag{by MDP.8} \\ &\leq \sum_{a\in\mcalA({s_0})}\plcypmf{a}{s_0}\cdot q_{\phi_t}(s_0,a_{\pi*})\tag{by MDP.8} \\ &= q_{\phi_t}(s_0,a_{\pi*})\cdot\sum_{a\in\mcalA({s_0})}\plcypmf{a}{s_0} \\ &= q_{\phi_t}(s_0,a_{\pi*}) \\ &= q_{\pi_{*}}(s_0,a_{\pi*})\tag{by MDP.5} \\ &= \max_{a\in\mcalA(s_0)}q_{\pi_{*}}(s_0,a) }$$

Since this is true for all policies $\pi$, then it must be that

$$ v_{*}(s_0)=\sup_{\pi}v_{\pi}(s_0)\leq\max_{a\in\mcalA(s_0)}q_{\pi_{*}}(s_0,a) $$

Suppose, by way of contradiction, that we have strict inequality:

$$ v_{*}(s_0)\lt\max_{a\in\mcalA(s_0)}q_{\pi_{*}}(s_0,a) \tag{MDP.9.1} $$

Then MDP.5 gives the third equality:

$$\align{ v_{\phi_t}(s_0) = \sum_{a\in\mcalA(s_0)}\condpmfsym{\phi_t}{a}{s_0}q_{\phi_t}(s_0,a) = q_{\phi_t}(s_0,a_{\pi*})= q_{\pi_{*}}(s_0,a_{\pi*}) = \max_{a\in\mcalA(s_0)}q_{\pi_{*}}(s_0,a) \gt v_{*}(s_0) }$$

This contradicts the definition of $v_{*}(s_0)\equiv\sup_{\pi}v_{\pi}(s_0)$. Hence assumption MDP.9.1 is false and we have

$$ v_{*}(s_0)=\max_{a\in\mcalA(s_0)}q_{\pi_{*}}(s_0,a) $$

Since $s_0\in\mcalS$ was chosen arbitrarily, then this equality holds for all $s\in\mcalS$.

$\wes$

Proposition MDP.10 Suppose the policy $\pi$ is deterministic and define $\pi(s)$ to be the action taken in state $s$. Then

$$ v_{\pi}(s) = q_{\pi}(s,\pi(s)) $$

Proof We have

$$\align{ v_{\pi}(s) &= \Ecwrt{\pi}{G_t}{S_t=s} \\ &= \Ec{G_t}{S_t=s,A_t=\pi(S_t),A_{t+1}=\pi(S_{t+1}),\dots} \\ &= \sum_{g_t}g_t\cdot\cp{G_t=g_t}{S_t=s,A_t=\pi(S_t),A_{t+1}=\pi(S_{t+1}),\dots} \\ &= \sum_{g_t}g_t\cdot\cp{G_t=g_t}{S_t=s,A_t=\pi(s),A_{t+1}=\pi(S_{t+1}),\dots} \\ &= \Ecwrt{\pi}{G_t}{S_t=s,A_t=\pi(s)} \\ &= q_{\pi}(s,\pi(s)) }$$

$\wes$

Proposition MDP.11 Let $\pi$ and $\phi$ be any pair of deterministic policies such that

$$ v_{\pi}(s) \leq q_{\pi}(s,\phi(s)) \qd\text{for all }s\in\mcalS $$

Then the policy $\phi$ must be at least as good as $\pi$. That is

$$ v_{\phi}(s) \geq v_{\pi}(s) \qd\text{for all }s\in\mcalS $$

Proof For all $s\in\mcalS$, we have

$$\align{ v_{\pi}(s) &\leq q_{\pi}(s,\phi(s)) \\ &= \Ec{R_{t+1}}{S_t=s,A_t=\phi(s)}+\gamma\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot v_{\pi}(s')\tag{by MDP.4.2} \\ &\leq \Ec{R_{t+1}}{S_t=s,A_t=\phi(s)}+\gamma\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot q_{\pi}(s',\phi(s')) \\ &= \Ecwrt{\phi}{R_{t+1}}{S_t=s}+\gamma\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\Prn{\Ec{R_{t+2}}{S_{t+1}=s',A_{t+1}=\phi(s')}+\gamma\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot v_{\pi}(s'')} \\ &= \Ecwrt{\phi}{R_{t+1}}{S_t=s}+\gamma\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\Prn{\Ecwrt{\phi}{R_{t+2}}{S_{t+1}=s'}+\gamma\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot v_{\pi}(s'')} \\ &= \Ecwrt{\phi}{R_{t+1}}{S_t=s}+\gamma\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\Ecwrt{\phi}{R_{t+2}}{S_{t+1}=s'} \\ &+ \gamma\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\gamma\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot v_{\pi}(s'') \\ &= \Ecwrt{\phi}{R_{t+1}}{S_t=s}+\gamma\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\Ecwrt{\phi}{R_{t+2}}{S_{t+1}=s',S_{t}=s,A_t=\phi(s)}\tag{by MDP.0} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot v_{\pi}(s'') \\ &= \Ecwrt{\phi}{R_{t+1}}{S_t=s}+\gamma\Ecwrt{\phi}{R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot v_{\pi}(s'') \\ &\leq \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot q_{\pi}(s'',\phi(s'')) \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \Prn{\Ec{R_{t+3}}{S_{t+2}=s'',A_{t+2}=\phi(s'')}+\gamma\sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''')} \\ }$$

$$\align{ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \Prn{\Ecwrt{\phi}{R_{t+3}}{S_{t+2}=s''}+\gamma\sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''')} \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot\Ecwrt{\phi}{R_{t+3}}{S_{t+2}=s''} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \gamma\sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot\Ecwrt{\phi}{R_{t+3}}{S_{t+2}=s'',S_{t+1}=s',A_{t+1}=\phi(s')}\tag{by MDP.0} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot\Ecwrt{\phi}{R_{t+3}}{S_{t+2}=s'',S_{t+1}=s',A_{t+1}=\phi(s')}\tag{by MDP.0} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\cdot\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\Ecwrt{\phi}{R_{t+3}}{S_{t+1}=s',A_{t+1}=\phi(s')} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\cdot\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\Ecwrt{\phi}{R_{t+3}}{S_{t+1}=s',A_{t+1}=\phi(s')} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\cdot\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\Ecwrt{\phi}{R_{t+3}}{S_{t+1}=s'} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\cdot\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\Ecwrt{\phi}{R_{t+3}}{S_{t+1}=s',S_t=s,A_t=\phi(s)}\tag{by MDP.0} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\cdot\Ecwrt{\phi}{R_{t+3}}{S_t=s,A_t=\phi(s)} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}}{S_{t}=s} \\ &+ \gamma^2\cdot\Ecwrt{\phi}{R_{t+3}}{S_t=s} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}}{S_{t}=s} \\ &+ \gamma^3\sum_{s'}\condpmf{s'}{s,\phi(s)}\cdot\sum_{s''}\condpmf{s''}{s',\phi(s')}\cdot \sum_{s'''}\condpmf{s'''}{s'',\phi(s'')}\cdot v_{\pi}(s''') \\ &= \dotsb \\ &= \Ecwrt{\phi}{R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\gamma^3R_{t+4}+\dotsb}{S_{t}=s} \\ &= v_{\phi}(s) }$$

TODO Prove this formally with an induction argument.

$\wes$

References

Ross, Sheldon (2010). A First Course in Probability, $8^{th}$ Edition