A blog post by Ertem Nusret Tas

In this blog post, I will summarize an application of information theory within reinforcement learning as described in the paper ‘An Information-Theoretic Analysis of Thompson Sampling’ by Daniel Russo and Benjamin Van Roy. In this context, I will first describe the setup and lay out the problem statement for reinforcement learning. Then, I will explain how Thompson Sampling works and why it is a good solution for the problem. Finally, I will talk about how information theory helps in a formal treatment of the performance of Thompson Sampling as well as for incorporating different types of knowledge into the final solution.

As the setting for our problem, consider a video game where there is a universe \mathcal{P} of planets with different laws of physics. As a player, we are placed at a planet p^* ; however, we do not apriori know which planet we are placed at. Regardless, we are allowed to have a prior distribution on which planet it could be.

At each time step t = 1,2,... of the game, we perform an action A_t from the set of available actions \mathcal{A} , and observe an outcome for our action denoted by Y_{t,A_t} from the set \mathcal{Y} . Note that if we knew what planet we were placed at, we would have known its laws, and thus to some extent, the rough outcome for any action. More formally, p^* is chosen from a given family of distributions \mathcal{P} such that the set of outcomes Y_t = \{Y_{t,a}|a \in \mathcal{A}\} has distribution p^* over \mathcal{Y}^{|\mathcal{A}|} . We further assume that given the distribution p^* , \{Y_t\}_{t \in \mathbb{N}} forms an i.i.d sequence.

Now, there exists a reward function called R , which specifies a reward R(y) for any observed output y . As a player, our goal is to maximize our accumulated reward over time. This would have been easy with the knowledge of the distribution p^* ; because then, we could have simply selected the same action A^* with the largest expected reward, i.e A^* = \argmax_{a \in \mathcal{A}} E[R(Y_{t,a})] , at each time step t . However, without this knowledge, we are doomed to perform actions that we will eventually regret (like trying to jump on a planet with a huge gravity). More formally, by time step T , we will incur a total *regret* of \mathrm{Regret}(T) = \sum_{t=1}^T R(Y_{t,A^*})-R(Y_{t,A_t}) .

Although the situation looks bleak, not all hope is lost! Bad actions can hurt us but they still teach us about the laws of physics. (After trying to jump, you would know that diving should be much more fun on a planet with high gravity.) Thus, we can hope to decrease the amount of regret we incur at future time steps by discovering new actions. In fact, even after discovering a good action, an expert player would occasionally diverge from it to explore better actions. In this context, we can come up with *policies* \pi that strike a good balance between exploration of new actions and the exploitation of the good actions already found. Our goal in this regard would be finding policies with a good balance of exploration and exploitation that minimize the expected total regret, i.e the **Bayesian regret** \mathbb{E}[\mathrm{Regret}(T)] , for the duration T of the game.

One policy that tries to balance exploration and exploitation is Thompson Sampling (TS). (To be more exact, policies *call *algorithms to receive actions and TS is one such algorithm.) To understand TS, let’s define F_t as the knowledge we have learned from our actions and their outcomes before time t . More formally, F_t is the sigma algebra of the action-outcome tuples observed until time t : F_t = \sigma((A_1,Y_{1,A_1}), (A_2,Y_{2,A_2}), ..., (A_{t_1},Y_{1,A_{t-1}})) . Having defined F_t , TS chooses the next action at time t to be A_t = a with probability \mathbb{P}(A^* = a|F_t) . In other words, given all our past observations F_t , if we believe a certain action a to be optimal for our current estimate of p^* , then, TS urges us to select it for the next time step with high probability. Hence, under TS, \mathbb{P}(A = a|F_t) = \mathbb{P}(A^* = a|F_t) . This equation gives TS the other name it commonly goes by: probability matching.

One could ask why we should sample different actions at all instead of sticking with the action a^* for which \mathbb{P}(A^* = a|F_t) is maximized. After all, a^* is the action that is *most likely* to be optimal given our past observations. However, note that such a policy would dramatically reduce exploration efforts and make it likely for us to stick to a sub-optimal action indefinitely. As mentioned before, a good policy walks on the thin line between exploration and exploitation.

Next, we quantify the performance of TS by analyzing Bayesian regret. This is exactly where information theory comes to our aid!! For this purpose, the paper defines the concept of** Information Ratio (IR)**:

\Gamma_t = \frac{\mathbb{E}_t[R(Y_{t,A^*})-R(Y_{t,A_t})]^2}{I_t(A^*;(A_t,Y_{t,A_t})}

Observe that IR associated with time step t is the ratio of the expected regret (rather its square) at that time step over the mutual information between the optimal action given p^* and the action-output tuple observed at that time step. The subscript on the expectation and mutual information signifies that all of these values are conditioned on our past observations F_t , and are calculated using the posterior distribution \mathbb{P}(.|F_t) . Notice that this makes \Gamma_t a random variable.

How does IR help us? Proposition 1 of the paper shows that it can be used to bound Bayesian regret: If \Gamma_t \leq \overline{\Gamma} for all t for some \overline{\Gamma} , then, under TS,

\mathbb{E}[\mathrm{Regret}(T)] \leq \sqrt{\overline{\Gamma}H(A^*)T}

As argued by the paper, this bound on Bayesian regret carries two types of information:

**Soft Knowledge** stands for our prior knowledge for p^* . This information enters the bound through the entropy H(A^*) term. For instance, if there were only two distributions p_1,p_2 in \mathcal{P} with a single yet distinct best action for both p_1 and p_2 , then in the case that both distributions are equally likely, H(A^*) would be the entropy of a Bernoulli-1/2 random variable, which is 1. However, if our prior knowledge told us that we are more likely to start with distribution p_1 , then H(A^*) , thus the bound on Bayesian regret, will be lower. This implies that a more *informed* player is expected to have a smaller *regret* in the game. Although this is an intuitive observation, many of the past results on Bayesian regret did not incorporate prior knowledge into the bound. This highlights the strength of information theoretic methods in analyzing Bayesian regret.

**Hard Knowledge** reflects our knowledge of the structure of the distribution family \mathcal{P} . This information enters the bound through the term \overline{\Gamma} . For instance, in a family where the outcome of an action does not tell us anything about the outcomes for other actions, \overline{\Gamma} is close to the upper bound |\mathcal{A}|/2 . On the other hand, in a family where we can simultaneously learn the outcomes Y_{t,a} for all of the actions a at any time step, it is possible to show that \overline{\Gamma} \leq 1/2 . Thus, in the case of *full information*, we would in expectation incur a smaller regret. Notice how \overline{\Gamma} depends on the distribution family.

(The fact that the bound above features a \sqrt{T} term implies the ability of a player using TS to learn about better actions over time. Note that the difference between the upper bounds for different horizons T and T+ \Delta approaches 0 as T \to \infty , implying that the amount of regret acquired over a fixed interval decays to zero as time grows.)

Bounding Bayesian regret in terms of IR reduces the problem to bounding IR itself. In this context, the paper presents three main results for bounding IR under TS:

**A general upper bound:**\overline{\Gamma} \leq |\mathcal{A}|/2 . This bound becomes order optimal for distribution families where the outcome of an action does not tell anything about the outcomes for other actions.**An upper bound for the full information case:**\overline{\Gamma} \leq 1/2 .**Linear Optimization:**Suppose \mathcal{A} \subseteq \mathbb{R}^d and for each p \in \mathcal{P} , there exists a parameter \theta_p \in \mathbb{R}^d such that for all a \in \mathcal{A} , \mathbb{E}[R(Y_{t,a})] = a^T \theta_p . (See the algorithm above which describes TS for such a parametric family of distributions.) Then, the IR is upper bounded by \overline{\Gamma} \leq d/2 . Observe that this case lies between the no-information and full information cases. Thus, the bound on IR is smaller than the general upper bound, yet larger than the full information case, reflecting the specific structure of this distribution family featuring dimension d .

Finally, we have seen how information theory can help with the analysis of Bayesian regret and bring knowledge ignored by many of the past works to bear on the regret bound. We have also learned about the basics of the model used by reinforcement learning as well as certain designs principles such as exploration vs. exploitation. In particular, we have observed how one particularly strong algorithm, Thompson Sampling, finds a good balance between exploration and exploitation. I hope you have enjoyed my summary of the paper (see references), and for more information, please check out the original work. (It is a great read!)

**References:**

Daniel Russo and Benjamin Van Roy. 2016. An information-theoretic analysis of Thompson sampling. *J. Mach. Learn. Res.* 17, 1 (January 2016), 2442–2471.