Deep Reinforcement Learning Bootcamp

I found this bootcamp on Deep Reinforcement Learning to be very helpful. In general, the labs consist of coding exercises that reinforce the content and papers presented in the videos and slides. Here are my coded solutions to the labs.

However, there are a few lab exercises that require some math computations. This blog post presents these computations.

Lab 4, part 3.3

In Lab 4, part 3.3, we are asked to compute and code $\nabla_{\theta}\log\prn{\plcypmfprms{\theta}{a}{s}}$. The exercise notes that the action follows a multivariate Gaussian distribution with identity covariance $I$ and mean $\mu=\theta\tilde{s}$, where $\tilde{s}$ is the vector obtained by appending $1$ to the state $s$. This trick allows us to avoid adding a separate bias term. The coded computation for $\log\prn{\plcypmfprms{\theta}{a}{s}}$ is given in lab4/simplepg/main.py by

def point_get_logp_action(theta, ob, action):
    """ 
    :param theta: A matrix of size |A| * (|S|+1)
    :param ob: A vector of size |S|
    :param action: A vector of size |A|
    :return: A scalar
    """
    ob_1 = include_bias(ob)
    mean = theta.dot(ob_1)
    zs = action - mean
    return -0.5 * np.log(2 * np.pi) * theta.shape[0] - 0.5 * np.sum(np.square(zs))

First let’s verify that the action follows a multivariance Gaussian with mean $\theta\tilde{s}$ and identity covariance. For any given action $a$, we have

$$\align{ \log\prn{\plcypmfprms{\theta}{a}{s}} &= -\frac12\log(2\pi)\norm{A} - \frac12\sum_{k=1}^{\norm{A}}(a_k-(\theta\tilde{s})_k)^2 \\\\ &= -\frac12\norm{A}\log(2\pi) - \frac12(a-\theta\tilde{s})^T(a-\theta\tilde{s}) \\\\ &= -\log\Prn{(2\pi)^{\frac12\norm{A}}} - \frac12(a-\theta\tilde{s})^T(a-\theta\tilde{s}) \\\\ &= -\log\Prn{\sqrt{(2\pi)^{\norm{A}}}} + \log\Prn{\expCbr{-\frac12(a-\theta\tilde{s})^T(a-\theta\tilde{s})}} \\\\ &= \log\Prn{\expCbr{-\frac12(a-\theta\tilde{s})^T(a-\theta\tilde{s})}} - \log\Prn{\sqrt{(2\pi)^{\norm{A}}}} \\\\ &= \log\Prngg{\frac{\expCbr{-\frac12(a-\theta\tilde{s})^T(a-\theta\tilde{s})}}{\sqrt{(2\pi)^{\norm{A}}}}} \\\\ &= \log\Prngg{\frac{\expCbr{-\frac12(a-\theta\tilde{s})^TI^{-1}(a-\theta\tilde{s})}}{\sqrt{(2\pi)^{\norm{A}}\norm{I}}}} }$$

Since the $\log$ function is injective, then it must be that

$$ \plcypmfprms{\theta}{a}{s} = \frac{\expCbr{-\frac12(a-\theta\tilde{s})^TI^{-1}(a-\theta\tilde{s})}}{\sqrt{(2\pi)^{\norm{A}}\norm{I}}} $$

Since $a$ was chosen arbitrarily and the right hand side is the density of the multivariate Gaussian with mean $\theta\tilde{s}$ and covariance $I$, then our verification is complete.

Let’s compute the partials:

$$\align{ \wpart{\log\prn{\plcypmfprms{\theta}{a}{s}}}{\theta_{ij}} &= -\frac12\sum_{k=1}^{\norm{A}}\wpartb{(a_k-(\theta\tilde{s})_k)^2}{\theta_{ij}} \\ &= -\frac12\sum_{k=1}^{\norm{A}}2(a_k-(\theta\tilde{s})_k)\wpartb{a_k-(\theta\tilde{s})_k}{\theta_{ij}} \\ &= -\sum_{k=1}^{\norm{A}}(a_k-(\theta\tilde{s})_k)\wpartb{a_i-(\theta\tilde{s})_k}{\theta_{ij}} \\ &= -\sum_{k=1}^{\norm{A}}(a_k-(\theta\tilde{s})_k)\wpartb{-(\theta\tilde{s})_k}{\theta_{ij}} \\ &= \sum_{k=1}^{\norm{A}}(a_k-(\theta\tilde{s})_k)\wpartb{(\theta\tilde{s})_k}{\theta_{ij}} \\ &= (a_i-(\theta\tilde{s})_i)\tilde{s}_j }$$

Hence

$$\align{ \nabla_{\theta}\log\prn{\plcypmfprms{\theta}{a}{s}} &= \pmtrx{ (a_1-(\theta\tilde{s})_1)\tilde{s}_1&\dotsb&(a_1-(\theta\tilde{s})_1)\tilde{s}_{\norm{S}+1}\\ \vdots&\ddots&\vdots\\ (a_\norm{A}-(\theta\tilde{s})_\norm{A})\tilde{s}_1&\dotsb&(a_\norm{A}-(\theta\tilde{s})_\norm{A})\tilde{s}_{\norm{S}+1} } \\\\ &= \pmtrx{a_1-(\theta\tilde{s})_1\\\vdots\\a_\norm{A}-(\theta\tilde{s})_\norm{A}}\pmtrx{\tilde{s}_1&\dotsb&\tilde{s}_{\norm{S}+1}} \\\\ &= \cbrfit{\pmtrx{a_1\\\vdots\\a_\norm{A}}-\pmtrx{(\theta\tilde{s})_1\\\vdots\\(\theta\tilde{s})_\norm{A}}}\pmtrx{\tilde{s}_1\\\vdots\\\tilde{s}_{\norm{S}+1}}^T \\\\ &= (a-\theta\tilde{s})\tilde{s}^T \\\\ &= (a-\theta\tilde{s})\otimes\tilde{s} }$$

where the symbol $\otimes$ denotes the outer product. Hence our coded gradient is

def point_get_grad_logp_action(theta, ob, action):
    """
    :param theta: A matrix of size |A| * (|S|+1)
    :param ob: A vector of size |S|
    :param action: A vector of size |A|
    :return: A matrix of size |A| * (|S|+1)
    """
    grad = np.zeros_like(theta)
    "*** YOUR CODE HERE ***"
    ob_1 = include_bias(ob)
    grad = np.outer(action - theta.dot(ob_1), ob_1)
    return grad

Lab 4, part 3.6

In lab 4, part 3.6, we are given that the action space is discrete and that the probability of an action $a$ is given by $\text{softmax}(\theta\tilde{s})_a$. We are also given that the loss function is the $\log$ of the $\text{softmax}$:

def compute_logits(theta, ob):
    """ 
    :param theta: A matrix of size |A| * (|S|+1)
    :param ob: A vector of size |S|
    :return: A vector of size |A|
    """
    ob_1 = include_bias(ob)
    logits = ob_1.dot(theta.T)
    return logits


def cartpole_get_logp_action(theta, ob, action):
    """ 
    :param theta: A matrix of size |A| * (|S|+1)
    :param ob: A vector of size |S|
    :param action: An integer
    :return: A scalar
    """
    return log_softmax(compute_logits(theta, ob))[action]

def log_softmax(logits):
    return logits - scipy.special.logsumexp(logits, axis=-1, keepdims=True)

We are then asked to compute and code the gradient for this. Let’s first write this loss as a simple function of $\theta$, denoted by $f(\theta)$:

$$\align{ f(\theta) &\equiv \text{log_softmax}(\text{logits})[a] \\\\ &= \Prngg{\text{logits} -^B \log\Prn{\sum_{k=1}^{\norm{A}}\exp\prn{\text{logits}_k}}}[a] \\\\ &= \text{logits}[a] - \log\Prn{\sum_{k=1}^{\norm{A}}\exp\prn{\text{logits}_k}} \\\\ &= (\tilde{s}^T\theta^T)[a] - \log\Prn{\sum_{k=1}^{\norm{A}}\exp\prn{(\tilde{s}^T\theta^T)_k}} \\\\ &= \tilde{s}^T(\theta_{a})^T - \log\Prn{\sum_{k=1}^{\norm{A}}\exp\prn{\tilde{s}^T(\theta_k)^T}} \\\\ &= \theta_{a}\tilde{s} - \log\Prn{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})} }$$

where $-^B$ denotes broadcast subtract. Now it’s easy to verify that the $\text{log_softmax}$ function defined in the given code is valid.

Let’s compute the partials:

$$ \wpart{f(\theta)}{\theta_{ij}} = \wpart{(\theta_{a}\tilde{s})}{\theta_{ij}} - \wpart{\log\Prn{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}}{\theta_{ij}} $$

The first term on the right is

$$ \wpart{(\theta_{a}\tilde{s})}{\theta_{ij}} = \cases{\tilde{s}_j&\text{if }a=i\\0&\text{else}} = \indic{a=i}\tilde{s}_j $$

The second term on the right is

$$\align{ \wpart{\log\Prn{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}}{\theta_{ij}} &= \frac{\wpartb{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}{\theta_{ij}}}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})} \\\\ &= \frac{\sum_{k=1}^{\norm{A}}\wpartb{\exp(\theta_k\tilde{s})}{\theta_{ij}}}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})} \\\\ &= \frac{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})\wpart{(\theta_k\tilde{s})}{\theta_{ij}}}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})} \\\\ &= \frac{\exp(\theta_i\tilde{s})\tilde{s}_j}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})} \\\\ &= \frac{\exp(\theta_i\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}\tilde{s}_j }$$

Hence

$$\align{ \wpart{f(\theta)}{\theta_{ij}} &= \indic{a=i}\tilde{s}_j - \frac{\exp(\theta_i\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}\tilde{s}_j \\ &= \Prngg{\indic{a=i} - \frac{\exp(\theta_i\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}}\tilde{s}_j }$$

Hence

$$\align{ \nabla_{\theta}f &= \pmtrx{ \Prn{\indic{a=1} - \frac{\exp(\theta_1\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}}\tilde{s}_1&\dotsb& \Prn{\indic{a=1} - \frac{\exp(\theta_1\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}}\tilde{s}_{\norm{S}+1}\\ \vdots&\ddots&\vdots\\ \Prn{\indic{a=\norm{A}} - \frac{\exp(\theta_{\norm{A}}\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}}\tilde{s}_1&\dotsb& \Prn{\indic{a=\norm{A}} - \frac{\exp(\theta_{\norm{A}}\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}}\tilde{s}_{\norm{S}+1} } \\\\ &= \pmtrx{ \indic{a=1} - \frac{\exp(\theta_1\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}\\\vdots\\ \indic{a=\norm{A}} - \frac{\exp(\theta_{\norm{A}}\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})} } \pmtrx{\tilde{s}_1&\dotsb&\tilde{s}_{\norm{S}+1}} \\\\ &= \cbrfit{\pmtrx{\indic{a=1}\\\vdots\\\indic{a=\norm{A}}} -\pmtrx{ \frac{\exp(\theta_1\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})}\\\vdots\\ \frac{\exp(\theta_{\norm{A}}\tilde{s})}{\sum_{k=1}^{\norm{A}}\exp(\theta_k\tilde{s})} }} \pmtrx{\tilde{s}_1\\\vdots\\\tilde{s}_{\norm{S}+1}}^T \\\\ &= (e_a-\text{softmax}(\theta\tilde{s}))\tilde{s}^T \\\\ &= (e_a-\text{softmax}(\theta\tilde{s}))\otimes\tilde{s} }$$

where $e_a$ is a one-hot vector with all entries zero except the $a^{th}$ entry, where the value is $1$. Hence our coded gradient is

def cartpole_get_grad_logp_action(theta, ob, action):
    """ 
    :param theta: A matrix of size |A| * (|S|+1)
    :param ob: A vector of size |S|
    :param action: An integer
    :return: A matrix of size |A| * (|S|+1)
    """
    grad = np.zeros_like(theta)
    "*** YOUR CODE HERE ***"
    ob_1 = include_bias(ob)
    e_a = np.zeros((theta.shape[0],))
    e_a[action] = 1 
    probs = softmax(theta.dot(ob_1))
    grad = np.outer(e_a - probs, ob_1)
    return grad