2019-02-26

Momentum, Curveball, and Nesterov Acceleration


This post explores some relationships between the recent Curveball optimization method (https://arxiv.org/abs/1805.08095, more recent version at https://openreview.net/forum?id=Sygx4305KQ) and Nesterov momentum (as popularized by http://proceedings.mlr.press/v28/sutskever13.html).

Suppose we are iteratively optimizing a scalar-valued function $f(x)$ wrt $x$ by gradient descent. We wish to take a step $z$ that minimizes $f(x + z)$, whose Taylor series is given by
$$f(x + z) = f(x) + z^\top f'(x) + \frac{1}{2} z^\top f''(x) z + \ldots$$
where $f'(x)$ denotes the gradient of $f$ evaluated at $x$, and similarly $f''(x)$ is the Hessian. Second-order methods locally model $f(x)$ by a second-order approximation $\hat{f}$:
$$\hat{f}(x + z) = f(x) + z^\top f'(x) + \frac{1}{2} z^\top f''(x) z,$$
according to which the optimal step is
$$z^\star = \operatorname{argmin}_z f(x) + z^\top f'(x) + \frac{1}{2} z^\top f''(x) z = (f''(x))^{-1} f'(x).$$
However, multiplying by the inverse Hessian is expensive.

The authors of Curveball propose instead to solve for $z^\star$ by gradient descent, and to interleave this inner optimization with the outer optimization over $x$:
\begin{align}
z &\leftarrow \rho z - \beta \frac{\partial \hat{f}(x + z)}{\partial z} \\
x &\leftarrow x + z
\end{align}
where $\rho, \beta$ are hyperparameters.
Now $\frac{\partial \hat{f}(x + z)}{\partial z} = f'(x) + f''(x) z$; if the approximation $\hat{f}$ were first-order instead, the $f''(x) z$ term would disappear and this would be equivalent to classical momentum.

But if we're going to use gradients to optimize $z$, then why make the second-order approximation at all? Why not let $\hat{f} = f$ and optimize the step $z$ to minimize $f(x + z)$:
\begin{align}
z &\leftarrow \rho z - \beta \frac{\partial f(x + z)}{\partial z} \\
x &\leftarrow x + z
\end{align}
By a change of variables $y = x + z$, the derivative $\frac{\partial f(x + z)}{\partial z} = \frac{\partial f(y)}{\partial y} \frac{\partial y}{\partial z}$ is simply $f'(x + z)$. From this it is easy to see that this update rule is practically equivalent to that for Nesterov momentum given in Sutskever et al 2013:
\begin{align}
z &\leftarrow \rho z - \beta f'(x + \rho z) \\ x &\leftarrow x + z.
\end{align}
The only difference is the weighting by $\rho$ of the tentative step in the evaluation point $x + \rho z$ of $f'$.

What does that mean? It means that if classical momentum is first-order and Curveball is second-order, well then Nesterov momentum is the infinite-order version of momentum. It's tempting to conclude that Curveball is a worse version of Nesterov momentum, as it makes an unnecessary second-order approximation. However, the Curveball paper goes on to adopt all the usual second-order tricks (GGN, trust region, etc.). It's unclear how the final algorithm performs relative to Nesterov momentum.

2019-02-16

On the Variance of Unbiased Online Recurrent Optimization

This blog post summarizes the brand new paper On the Variance of Unbiased Online Recurrent Optimization (Cooijmans & Martens, 2019), an extensive investigation into the UORO (Tallec & Ollivier, 2017) algorithm for online training of recurrent neural networks. The work was done in close collaboration with James Martens, who supervised me during my internship at DeepMind where the work started.

I'm excited about this work; what seemed like a simple problem -- effective forward-mode differentiation -- turned out to be a rich source of deeply interesting connections and possibilities.

Differentiation through time


We'll consider a general class of differentiable recurrent models with state $h_t$ governed by:
$$
h_t = F (h_{t - 1}, x_t ; \theta_t) .
$$
Here $x_t$ is an observation made at time $t$, and $\theta_t$ are the RNN's parameters (e.g. weight matrix). As usual, the parameters are shared over time, i.e. $\theta_t = \theta$; having the subscript $t$ on the parameters $\theta$ conveniently lets us refer to the partial derivatives $\frac{\partial h_t}{\partial \theta}$ by the total derivative $\frac{\mathrm{d} h_t}{\mathrm{d} \theta_t}$. We will generally use the notation $\mathcal{J}^{y}_{x}$ for the (total) Jacobian matrix of $y$ with respect to $x$.

State propagation through an RNN

At each step, the RNN incurs a loss $L_t$ which is a differentiable function of the hidden state $h_t$. In order to optimize $\theta$ to minimize the total loss $L = \sum_{t = 1}^T L_t$ over a sequence of length $T$, we require an estimate of the gradient $\mathcal{J}^L_{\theta}$.
$$
\mathcal{J}^L_{\theta} = \sum_{t = 1}^T \sum_{s = 1}^T
   \mathcal{J}^{L_t}_{\theta_s} = \underbrace{\sum_{s = 1}^T \left( \sum_{t =
   s}^T \mathcal{J}^{L_t}_{h_s} \right)
   \mathcal{J}^{h_s}_{\theta_s}}_{\text{reverse accumulation}} =
   \underbrace{\sum_{t = 1}^T \mathcal{J}^{L_t}_{h_t} \left( \sum_{s = 1}^t
   \mathcal{J}^{h_t}_{\theta_s} \right)}_{\text{forward accumulation}}
$$
Each of the terms $\mathcal{J}^{L_t}_{\theta_s}$ indicates how the use of the parameter $\theta$ at time $s$ contributed to the loss at time $t$. The triangular/causal structure $\mathcal{J}^{L_t}_{\theta_s} =\mathbb{1}_{s \leqslant t} \mathcal{J}^{L_t}_{\theta_s}$ allows two useful factorizations of the double sum.

The first, labeled reverse accumulation, is used by the popular Backpropagation Through Time algorithm (BPTT). In BPTT we run the model forward to compute the activations and losses, and subsequently run backward to propagate gradient $\mathcal{J}^L_{h_t}$ back through time:
$$
\mathcal{J}^L_{h_t} =\mathcal{J}^L_{h_{t + 1}} \mathcal{J}^{h_{t +
   1}}_{h_t} +\mathcal{J}^{L_t}_{h_t} .
$$
By following this recursion, we can aggregate the terms $\mathcal{J}^L_{\theta_t} =\mathcal{J}^L_{h_t} \mathcal{J}^{h_t}_{\theta_t}$ to compute the gradient. The backpropagation $\mathcal{J}^L_{h_{t + 1}} \mathcal{J}^{h_{t + 1}}_{h_t}$ is a vector-matrix product, which has the same cost as the forward propagation of state $h_t$ in a typical RNN.

Back-propagation of $\mathcal{J}^L_{h_t}$ by BPTT

The second factorization (forward accumulation) is used by the Real-Time Recurrent Learning algorithm (RTRL). The recursion
$$
\mathcal{J}^{h_t}_{\theta} =\mathcal{J}^{h_t}_{h_{t - 1}} \mathcal{J}^{h_{t
   - 1}}_{\theta} +\mathcal{J}^{h_t}_{\theta_t}
$$
is chronological and can be computed alongside the RNN state. Given $\mathcal{J}^{h_t}_{\theta}$, we can compute the term $\mathcal{J}^{L_t}_{\theta} =\mathcal{J}^{L_t}_{h_t} \mathcal{J}^{h_t}_{\theta}$ and immediately update the parameter $\theta$ (with some technical caveats). The drawback is that the forward propagation $\mathcal{J}^{h_t}_{h_{t - 1}} \mathcal{J}^{h_{t - 1}}_{\theta}$ is an expensive matrix-matrix product. Whereas BPTT cheaply propagated a vector $\mathcal{J}^L_{h_t}$ of the same size as the RNN state, RTRL propagates a matrix $\mathcal{J}^{h_t}_{\theta}$ that consists of one parameter-sized vector for each hidden state. Since typically the parameter is quadratic in the size of the hidden state, this is cubic, and the forward-propagation is quartic (i.e. for a meager 100 hidden units, RTRL is 10,000 times more expensive than BPTT!).

Forward-propagation of $\mathcal{J}^{h_t}_{\theta}$ by RTRL

It is important to note that the jacobians involved in these recursions depend on the activations $h_t$ and other intermediate quantities. In RTRL, these quantities naturally become available in the order in which they are needed, after which they may be forgotten. BPTT revisits these quantities in reverse order, which requires storing them in a stack.

This is the main drawback of BPTT: its storage grows with the sequence length $T$, which limits the temporal span of dependencies it can capture (as in truncated BPTT) and the rate at which parameter updates can occur.

RTRL has some things to recommend it, if only we had a way of dealing with that giant matrix $\mathcal{J}^{h_t}_{\theta}$.

Unbiased Online Recurrent Optimization


UORO (Tallec & Ollivier, 2017) approximates RTRL by maintaining a rank-one estimate of $\mathcal{J}^{h_t}_{\theta}$ using random projections. A straightforward derivation starts from the expression
$$
\mathcal{J}^{h_t}_{\theta} = \sum_{s \leqslant t} \mathcal{J}^{h_t}_{h_s}
   \mathcal{J}^{h_s}_{\theta_s} .
$$
Into each term, we insert a random rank-one matrix $\nu_s \nu_s^{\top}$ with expectation $\mathbb{E} [\nu_s \nu_s^{\top}] = I$ (e.g. random signs):
$$
\mathcal{J}^{h_t}_{\theta} \approx \sum_{s \leqslant t}
   \mathcal{J}^{h_t}_{h_s} \nu_s \nu_s^{\top} \mathcal{J}^{h_s}_{\theta_s} .
$$
The random projections onto $\nu_s$ serve the compress the matrices $\mathcal{J}^{h_t}_{h_s}$ and $\mathcal{J}^{h_s}_{\theta_s}$ into vector-sized quantities. But accumulating this sum online is still expensive: we must either accumulate the matrix-sized quantities $\mathcal{J}^{h_t}_{h_s} \nu_s \nu_s^{\top} \mathcal{J}^{h_s}_{\theta_s}$ or the sequence of pairs of vectors $\mathcal{J}^{h_t}_{h_s} \nu_s$, $\nu_s^{\top} \mathcal{J}^{h_s}_{\theta_s}$.

We can pull the same trick again and rely on noise to entangle corresponding pairs. Let $\tau$ be a random vector with expectation $\mathbb{E} [\tau \tau^{\top}] = I$. Then
$$
\mathcal{J}^{h_t}_{\theta} \approx \sum_{s \leqslant t}
   \mathcal{J}^{h_t}_{h_s} \nu_s \nu_s^{\top} \mathcal{J}^{h_s}_{\theta_s} =
   \sum_{s \leqslant t} \mathcal{J}^{h_t}_{h_s} \nu_s \tau_s \tau_s
   \nu_s^{\top} \mathcal{J}^{h_s}_{\theta_s} \approx \left( \sum_{s \leqslant
   t} \mathcal{J}^{h_t}_{h_s} \nu_s \tau_s \right) \left( \sum_{s \leqslant t}
   \tau_s \nu_s^{\top} \mathcal{J}^{h_s}_{\theta_s} \right) .
$$
For simplicity, we will replace the independent noise variables $\tau_t,\nu_t$ by a single random vector $u_t = \tau_t \nu_t$.

Now we have two vector-valued sums, which we can efficiently maintain online:
\begin{align}
  \tilde{h}_t & = \mathcal{J}^{h_t}_{h_{t - 1}} \tilde{h}_{t - 1} + u_t \\
  \tilde{w}_t^{\top} & = \tilde{w}_{t - 1}^{\top} + u_t^{\top}
  \mathcal{J}^{h_t}_{\theta_t} .
\end{align}
This joint recursion is similar to that for $\mathcal{J}^{h_t}_{\theta}$ in RTRL; the approximation $\tilde{h}_t \tilde{w}_t^{\top}$ is used as a rank-one stand-in for $\mathcal{J}^{h_t}_{\theta}$. Notice how the unwieldy matrix-matrix product in RTRL has been replaced by a cheap matrix-vector product $\mathcal{J}^{h_t}_{h_{t - 1}} \tilde{h}_{t - 1}$: UORO is as cheap as BPTT.

Forward-propagation of noise in UORO
Accumulation of back-propagated noise in UORO

So that's the basic workings of UORO: randomly project in state space and then randomly project once again in time. Both of these projections introduce errors in the approximation $\mathcal{J}^{h_t}_{\theta} \approx \tilde{h}_t \tilde{w}_t^{\top}$, due to connecting the wrong elements together in the matrix-matrix product $\mathcal{J}^{h_t}_{h_s} \mathcal{J}^{h_s}_{\theta_s}$ (*spatial* cross-terms) and due to connecting the wrong time steps $q \neq r$ together in $\mathcal{J}^{h_t}_{h_r} \nu_r \nu_q^{\top} \mathcal{J}^{h_q}_{\theta_q}$ (*temporal* cross-terms). In expectation, these errors cancel out and the approximation is unbiased.

Variance reduction through iterative rescaling


UORO came with a variance reduction technique that iteratively rescales all quantities involved:
\begin{align}
  \tilde{h}_t & = \gamma_t \mathcal{J}^{h_t}_{h_{t - 1}} \tilde{h}_{t - 1} +
  \beta_t u_t\\
  \tilde{w}_t^{\top} & = \gamma_t^{- 1} \tilde{w}_{t - 1}^{\top} +
  \beta_t^{- 1} u_t^{\top} \mathcal{J}^{h_t}_{\theta_t} .
\end{align}
The coefficients $\gamma_t, \beta_t$ serve to reduce the norms of undesired temporal cross-terms (e.g. $\gamma_t \beta_t^{- 1} \mathcal{J}^{h_t}_{h_{t - 1}} \tilde{h}_{t - 1} u_t^{\top} \mathcal{J}^{h_t}_{\theta_t}$) while keeping corresponding terms (e.g. $\gamma_t \gamma_t^{- 1} \mathcal{J}^{h_t}_{h_{t - 1}} \tilde{h}_{t - 1} \tilde{w}_{t - 1}^{\top}$) unaffected. In practice it seems like the brunt of the work is done by $\gamma_t$, which distributes, across $\tilde{h}_t$ and $\tilde{w}_t$, the contraction of $\tilde{h}_{t - 1}$ due to forward propagation through $\mathcal{J}^{h_t}_{h_{t - 1}}$ (aka gradient vanishing).

In our paper we argue that this variance reduction scheme, although cheap and very effective, has some room for improvement. UORO's coefficients are chosen to minimize
$$
\mathbb{E}_{\tau_t} [\| \tilde{h}_t \tilde{w}_t^{\top} -
   (\mathcal{J}^{h_t}_{h_{t - 1}} \tilde{h}_{t - 1} \tilde{w}_{t - 1}^{\top} +
   u_t u_t^{\top} \mathcal{J}^{h_t}_{\theta_t}) \|^2_F],
$$
i.e. the expected norm of the error in $\tilde{h}_t \tilde{w}_t^{\top}$ as a rank-one approximation of the rank-two matrix $\mathcal{J}^{h_t}_{h_{t - 1}} \tilde{h}_{t - 1} \tilde{w}_{t - 1}^{\top} + u_t u_t^{\top} \mathcal{J}^{h_t}_{\theta_t}$. This is a natural quantity to target, but it ignores the bigger picture: downstream, the approximate Jacobians $\tilde{h}_t \tilde{w}_t^{\top} \approx \mathcal{J}^{h_t}_{\theta}$ are used to produce a sequence of gradient estimates $\mathcal{J}^{L_t}_{h_t} \tilde{h}_t \tilde{w}_t^{\top} \approx \mathcal{J}^{L_t}_{\theta}$, which are aggregated by some optimization process into a *total gradient estimate*
$$
\sum_{t \leqslant T} \mathcal{J}^{L_t}_{h_t} \tilde{h}_t \tilde{w}_t^{\top}
   \approx \sum_{t \leqslant T} \mathcal{J}^{L_t}_{\theta}
   =\mathcal{J}^L_{\theta} .
$$
Notice in particular that each of the terms $\mathcal{J}^{L_t}_{h_t} \tilde{h}_t \tilde{w}_t^{\top}$ is based largely on the same random quantities, which produces interactions between consecutive gradient estimates. In our paper we instead seek to minimize the *total variance*
$$
\mathbb{E}_u \left[ \| \sum_{t \leqslant T} \mathcal{J}^{L_t}_{h_t}
   \tilde{h}_t \tilde{w}_t^{\top} -\mathcal{J}^L_{\theta} \|^2 \right] .
$$
Since consecutive gradient estimates are not independent, the variance of the sum is not simply the sum of the variances.

Theoretical framework


We analyze a generalization of UORO's recursions:
\begin{align}
  \tilde{h}_t & = \mathcal{J}^{h_t}_{h_{t - 1}} \tilde{h}_{t - 1}
  +\mathcal{J}^{h_t}_{z_t} Q_t u_t\\
  \tilde{w}_t^{\top} & = \tilde{w}_{t - 1}^{\top} + u_t^{\top} Q_t^{- 1}
  \mathcal{J}^{z_t}_{\theta_t} .
\end{align}
The symbolic variable $z_t$ may refer to any cut vertex along the path from $\theta_t$ to $h_t$. In vanilla UORO, $z_t \equiv h_t$, so projection occurs in state space. Other choices include projection in parameter space ($z_t \equiv \theta_t$) and projection in preactivation space (which has convenient structure).

We also replaced the scalar coefficients $\gamma_t, \beta_t$ by matrices $Q_t$ (see the paper for the details). These matrices transform the noise vectors $u_t$; the section on REINFORCE below reveals an interpretation of these matrices as modifying the covariance of exploration noise.

We define the following shorthands:
\begin{align}
& b^{(t) \top}_s =\mathcal{J}^{L_t}_{z_s} & \text{and} & & J_s  = \mathcal{J}^{z_s}_{\theta_s}.
\end{align}
Now the total gradient estimate,
\begin{align}
\sum_{t \leqslant T} \mathcal{J}^{L_t}_{h_t} \tilde{h}_t \tilde{w}_t^{\top}
&= \sum_{t \leqslant T} \mathcal{J}^{L_t}_{h_t}
\left( \sum_{s \leqslant t} \mathcal{J}^{h_t}_{z_s} Q_s u_s \right)
\left( \sum_{s \leqslant t} u_s^{\top} Q_s^{-1} \mathcal{J}^{z_s}_{\theta_s} \right)
\\&= \left( \sum_{s \leqslant t} b^{(t)}_s Q_s u_s \right)
   \left( \sum_{s \leqslant t} u_s^{\top} Q_s^{-1} J_s \right)
,
\end{align}
can be expressed as
$$
\sum_{t \leqslant T} \left(\begin{array}{c}
     b^{(t)}_1\\
     \vdots\\
     b^{(t)}_T
   \end{array}\right)^{\top} \left(\begin{array}{ccc}
     Q_1 &  & \\
     & \ddots & \\
     &  & Q_T
   \end{array}\right) \left(\begin{array}{c}
     u_1\\
     \vdots\\
     u_T
   \end{array}\right) \left(\begin{array}{c}
     u_1\\
     \vdots\\
     u_T
   \end{array}\right)^{\top} \left(\begin{array}{ccc}
     Q_1 &  & \\
     & \ddots & \\
     &  & Q_T
   \end{array}\right)^{- 1} \left(\begin{array}{ccc}
     S^{(t)}_1 &  & \\
     & \ddots & \\
     &  & S^{(t)}_T
   \end{array}\right) \left(\begin{array}{c}
     J_1\\
     \vdots\\
     J_T
   \end{array}\right),
$$
where $S^{(t)}_s =\mathbb{1}_{s \leqslant t} I$ enforces causality: the estimate $\mathcal{J}^{L_t}_{h_t} \tilde{h}_t \tilde{w}_t^{\top}$ at time $t$ does not involve contributions $\mathcal{J}^{z_s}_{\theta_s}$ from future ($s > t$) parameter applications. This property already holds on the $b$ side, as $b^{(t)}_s =\mathbb{1}_{s \leqslant t} b^{(t)}_s$.

Giving names to these concatenated quantities, we may write
$$
  \sum_{t \leqslant T} \mathcal{J}^{L_t}_{h_t} \tilde{h}_t \tilde{w}_t^{\top}
 = \sum_{t \leqslant T} b^{(t) \top} Q u u^{\top} Q^{- 1} S^{(t)} J.
$$
This is a fairly simple expression, which makes it easy to analyze the behavior of the estimator. We see immediately that the estimator is unbiased, as $\mathbb{E}_u [Q u u^{\top} Q^{- 1}] = Q\mathbb{E}_u [u u^{\top}] Q^{- 1} = Q Q^{- 1} = I$ ($Q$ is assumed to be independent of the noise $u$). In the paper we also derive the variance; for this blog post it will be enough to note that it is dominated by a product of traces,
$$
V (Q) = \sum_{s \leqslant T} \sum_{t \leqslant T} \operatorname{tr} \left( \sum_{r
   \leqslant T} b^{(s)}_r b_r^{(t) \top} Q_r Q_r^{\top} \right) \operatorname{tr}
   \left( \sum_{q \leqslant T} S^{(t)}_q J_q J_q^{\top} S^{(s)}_q (Q_q
   Q_q^{\top})^{- 1} \right) .
$$

Toward improved variance reduction


We wish to minimize this wrt $Q$, but in a way that corresponds to a practical algorithm. We investigate the case where $Q$ has the form $Q_t = \alpha_t Q_0$ for scalar coefficients $\alpha_t$ and a constant $Q_0$ matrix. Intuitively, the $\alpha_t$ could be chosen by an iterative rescaling rule similar to the $\gamma_t, \beta_t$ scheme from UORO, while $Q_0$ would be based on supposedly slow-moving second-order information.

Joint optimization of these quantities turned out to be analytically intractable, and even alternately optimizing the $\alpha_t$ and $Q_0$ is difficult. Still, we made some headway on these problems; of particular interest is the quantity
$$
B = \sum_{s \leqslant T} \sum_{t \leqslant T} \left( \sum_{q = 1}^{\min (s,
   t)} \alpha_q^{- 2} \| a_q \|^2 \right) \left( \sum_{r \leqslant T}
   \alpha_r^2 b^{(s)}_r b^{(t) \top}_r \right) = \sum_{q \leqslant T} \sum_{r
   \leqslant T} \frac{\alpha_r^2}{\alpha_q^2} \| a_q \|^2 \left( \sum_{s =
   q}^T b^{(s)}_r \right) \left( \sum_{t = q}^T b^{(t)}_r \right)^{\top},
$$
which gives the optimal $Q_0$ as $Q_0 = B^{- 1 / 4}$ when projection occurs in preactivation space (see the paper). The vector $a_t$ is the input to the RNN (i.e. the previous state, the current observation and a bias), which shows up here due to the convenient structure of the backward Jacobian $J_t = \mathcal{J}^{z_t}_{\theta_t} = I \otimes a_t^{\top}$ in the case of
preactivation-space projection.

In the optimal case $Q_0 = B^{- 1 / 4}$, the variance contribution $V (Q)$ above reduces from $\operatorname{tr} (B) \operatorname{tr} (I)$ to $\operatorname{tr} (B^{1 / 2})^2$, which by Cauchy-Schwarz is an improvement to the extent that the eigenspectrum of $B$ is lopsided rather than flat. We show empirically, in a small-scale controlled setting in which $B$ is known (by backprop), that the optimal (in our framework) $\alpha_t$ and $Q_0$ result in significant variance reduction.

Of course, it is not obvious how to implement these ideas effectively. The theoretically optimal choices for these quantities depend on information that is unknown. For instance, the matrix $B$ is sort of a weighted covariance of sums of future gradients $\sum_{s = q}^T b^{(s)}_r$; if we knew these gradients we wouldn't need to maintain the RTRL Jacobian $\mathcal{J}^{h_t}_{\theta}$ in the first place! One cool idea is to estimate it online using the same rank-one tricks we use to estimate the gradient, which results in a self-improving algorithm. We show one such estimator, but it (and others like it) fared poorly in practice, as one might have guessed.

However, the theory is meant to guide practice, not to dictate it. Unbiased estimation of $B$ is not necessary (as the overall algorithm is unbiased for any $Q_0$), and nor is it particularly desirable if it comes at the cost of injecting more noise into the system. It is well-known from optimization that second-order information is hard to estimate reliably in the stochastic setting. Most likely there exist heuristic choices of $Q_0$ (guided by the theory) that enable variance reduction without introducing additional noise variables and which may be more amenable to noisy estimation in the first place.

Projecting in preactivation space


As also discovered by Mujika et al. (2018), it is usually possible to do away with random projection onto spatial noise vectors $\nu_t$ altogether. The trick is that the gradient contribution $\mathcal{J}^L_{\theta_t}$ due to application of the parameter $\theta$ at time $t$ is already rank-one if the parameter plays the role of a weight matrix (as it almost always does). E.g. if the recurrent neural network takes the form
$$
p_t = W_t a_t \\
h_t = f (p_t)
$$
such that $\theta_t = \operatorname{vec} (W_t)$ and with $a_t = \left(\begin{array}{ccc}
  h_{t - 1}^{\top} & x_t^{\top} & 1 \end{array}\right)^{\top}$ being the concatenated input to the RNN at time $t$, then the gradient with respect to $W_t$ is just the outer product of
the gradient with respect to the preactivations $p_t$ and the inputs $a_t$:
$$
\nabla_{W_t} L = (\mathcal{J}^L_{p_t})^\top a_t^\top
$$
The notation becomes a bit of a trainwreck when we switch to working with the vectorization $\theta_t = \operatorname{vec}(W_t)$ in order to speak of the Jacobians $\mathcal{J}^{h_t}_{\theta_t}$. Let's switch to Kronecker products for \otimes' sake:
$$
\mathcal{J}^{h_t}_{\theta_t} = \mathcal{J}^{h_t}_{p_t} (I \otimes a^{\top})
$$
This is formally a matrix, but it's more naturally thought of as a third-order $H \times P \times A$ tensor with elements $\frac{\partial h_{t i}}{\partial p_{t j}} a_{t k}$. ($H$, $P$, $A$ are the dimensions of $h_t$, $p_t$ and $a_t$ respectively.)

The key observation is that this third-order tensor can be broken up into an outer product of an $H \times P$ matrix $\mathcal{J}^L_{p_t}$ and an $A$-dimensional vector $a_t$ without any random projection. Plain UORO on the other hand would stochastically break it up into an outer product of an $H$-dimensional vector $\mathcal{J}^{h_t}_{p_t} u_t$ and a vectorized $P \times A$ matrix $u_t^\top \mathcal{J}^{p_t}_{\theta_t} = u_t^\top (I \otimes a_t^\top) = \operatorname{vec}(u_t a_t^\top)$. Although the factorization without projection does not introduce extra variance, it does introduce extra computation, as we are now propagating a matrix again as in RTRL (albeit a smaller one).

Now the approximate Jacobian takes the form $\tilde{H}_t (I \otimes \tilde{a}_t) \approx \mathcal{J}^{h_t}_{\theta}$, with $\tilde{H}_t$ and $\tilde{a}_t$ maintained according to
\begin{align}
  \tilde{H}_t & = \gamma_t \mathcal{J}^{h_t}_{h_{t - 1}} \tilde{H}_{t - 1} +
  \beta_t \tau_t \mathcal{J}^{h_t}_{p_t} \\
  \tilde{a}_t & = \gamma_t^{- 1} \tilde{a}_{t - 1} +
  \beta_t^{- 1} \tau_t a_t .
\end{align}
These are similar to UORO's recursions, except that the Jacobian factors $\mathcal{J}^{h_t}_{p_t}$ and $I \otimes a_t^\top$ are not projected onto noise vectors $u_t$ but rather multiplied by random signs $\tau_t$ before being accumulated into their respective sums.

Each gradient $\mathcal{J}^{L_t}_{\theta}$ is estimated by $\mathcal{J}^{L_t}_{h_t} \tilde{H}_t (I \otimes \tilde{a}_t) = \operatorname{vec} ( (\mathcal{J}^{L_t}_{h_t} \tilde{H}_t)^\top \tilde{a}_t^\top)$, which can still be computed without explicitly forming $\tilde{H}_t (I \otimes \tilde{a}_t^\top)$.

Forward-propagation of noise in preactivation space

Accumulation of activations

We show in the paper that this property can be exploited to reduce the variance contribution $V (Q)$ by a factor equal to the dimension of the preactivations, at the cost of increasing the computational complexity by the same factor. Typically, this means going from $\mathcal{O} (H^2)$ time (vanilla UORO) to $\mathcal{O} (H^3)$ time (preactivation-space UORO); a serious increase but still better than RTRL's $\mathcal{O} (H^4)$ time ($H$ is the size of the hidden state $h_t$).

The group that discovered this algorithm around the same time (the aforementioned Approximating Real-Time Recurrent Learning with Random Kronecker Factors (Mujika et al 2018)) released a new paper a few days ago, Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning (Benzing et al 2019). I haven't had time to read it in full, but it looks like they found a way to optimize the low-rank approximation without biasing the approximation.

A link to REINFORCE


Finally, we show a near-equivalence between REINFORCE and UORO when the former is used to train a stochastic RNN with the following recurrent structure:
$$
h_t = F (\bar{h}_{t - 1}, x_t ; \theta_t) \\
\bar{h}_t = h_t + \sigma Q_t u_t
$$
Here $u_t \sim \mathcal{N} (0, I)$ is additive Gaussian noise, and $\sigma$ determines the level of noise. The invertible matrix $Q_t$ transforms the standard normal noise $u_t$ and corresponds to a covariance matrix, but the reason it is included here is because it will end up playing the same role as the $Q_t$ matrix discussed in the variance reduction section above. Effectively, the stochastic hidden state $\bar{h}_t \sim \mathcal{N} (h_t, \sigma^2 Q_t Q_t^{\top})$ is sampled from a Gaussian distribution centered on the deterministic hidden state $h_t$. We assume the loss $L_t$ to be a differentiable function of $\bar{h}_t$.

RNN with state perturbation

To be clear, the sole purpose of injecting this noise is so that we may apply REINFORCE to estimate gradients through the system and compare these estimates to those of UORO. The stochastic transition distribution will be our policy from the REINFORCE perspective, which suggests actions $\bar{h}_t$ given states $h_t$. We compute the REINFORCE estimator by running the stochastic RNN forward, thus sampling a trajectory of states $\bar{h}_t$, and at each step computing
$$
L_t \nabla_{\theta} \log p (\bar{h}_t | \bar{h}_0, \bar{h}_1 \ldots
   \bar{h}_{t - 1} ; \theta) \approx \mathcal{J}^{L_t}_{\theta} .
$$
The estimate consists of the loss $L_t$ times the score function of the trajectory. Intuitively, higher rewards (equivalently, lower losses) "reinforce" directions in parameter space that bring them about.

The score function $\bar{w}_t^{\top} = \nabla_{\theta} \log p (\bar{h}_t | \bar{h}_0, \bar{h}_1 \ldots \bar{h}_{t - 1} ; \theta)$ is maintained online according to
$$
\bar{w}_t^{\top} = \bar{w}_{t - 1}^{\top} + \nabla_{\theta} \log p
   (\bar{h}_t | \bar{h}_{t - 1} ; \theta_t) = \bar{w}_{t - 1} +
   \frac{1}{\sigma} u_t^{\top} Q_t^{- 1} \mathcal{J}^{h_t}_{\theta_t},
$$
which is analogous to $\tilde{w}$ in UORO. An important difference is that the backward Jacobians $\mathcal{J}^{h_t}_{\theta_t}$ are evaluated in the noisy system. In the paper we eliminate this difference by passing to the limit $\sigma \rightarrow 0$, which simulates the common practice of annealing the noise.

Besides $\tilde{w}_t$, UORO's estimate of $\mathcal{J}^{L_t}_{h_t} \tilde{h}_t \tilde{w}_t^{\top} \approx \mathcal{J}^{L_t}_{\theta}$ involves $\mathcal{J}^{L_t}_{h_t} \tilde{h}_t$. In REINFORCE, the inner product of $\bar{w}_t$ with this quantity is implicit in the multiplication by the loss. We can reveal it by taking the Taylor series of the loss around the point $u = 0$ where the noise is zero:
$$
L_t = L_t |_{u = 0} + \left( \sum_{s \leqslant t} \mathcal{J}^{L_t}_{u_s}
   |_{u = 0} u_s \right) + \frac{1}{2} \left( \sum_{r \leqslant t} \sum_{s
   \leqslant t} u_r^{\top} \mathcal{H}^{L_t}_{u_r, u_s} |_{u = 0} u_s \right)
   + \cdots
$$
Using the fact that derivatives with respect to $u_s$ are directly related to derivatives with respect to $h_s$, namely
$$
\mathcal{J}^{L_t}_{u_s} |_{u = 0} = \sigma \sum_{s \leqslant t}
   \mathcal{J}^{L_t}_{h_s} |_{u = 0} Q_s,
$$
we may write
$$
L_t = L_t |_{u = 0} + \sigma \mathcal{J}^{L_t}_{h_t} \left( \sum_{s
   \leqslant t} \mathcal{J}^{h_t}_{h_s} |_{u = 0} Q_s u_s \right) +\mathcal{O}
   (\sigma^2) .
$$
Plugging these into the REINFORCE estimate $L_t \nabla_{\theta} \log p (\bar{h}_t | \bar{h}_0, \bar{h}_1 \ldots \bar{h}_{t - 1} ; \theta)$, we get
$$
L_t \bar{w}_t^{\top} = \frac{1}{\sigma} L_t |_{u = 0} \left( \sum_{s
   \leqslant t} u_s^{\top} Q_s^{- 1} \mathcal{J}^{h_s}_{\theta_s} \right)
   +\mathcal{J}^{L_t}_{h_t} \left( \sum_{s \leqslant t}
   \mathcal{J}^{h_t}_{h_s} |_{u = 0} Q_s u_s \right) \left( \sum_{s \leqslant
   t} u_s^{\top} Q_s^{- 1} \mathcal{J}^{h_s}_{\theta_s} \right) +\mathcal{O}
   (\sigma^2) .
$$
When we pass to the limit $\sigma \rightarrow 0$, the second term becomes identical to the UORO estimate $\mathcal{J}^{L_t}_{h_t} \tilde{h}_t \tilde{w}_t^{\top}$. Note how the $Q_t$ matrices that determine the covariance of the exploration noise in REINFORCE play the exact same role as our variance reduction matrices in UORO.

The first term, which is zero in expectation, contributes infinite variance in the limit. In effect, annealing the noise deteriorates the quality of REINFORCE's estimates. This first term is usually addressed by subtracting a "baseline" -- an estimate of $L_t |_{u = 0}$ -- from the loss $L_t$ before multiplying with the score.

Conclusions


We've delved deeply into UORO, contributing a straightforward derivation, a general theoretical framework and a thorough analysis. Our proposed variance reduction using these $Q_t$ matrices is promising, although much work remains to be done. We've shown a variant of UORO that avoids the spatial level of stochastic approximation, thereby greatly reducing the variance at the cost of equally greatly increasing the time complexity. Finally, we have established a deep link between UORO and REINFORCE, which allows the interpretation of REINFORCE as an approximation to RTRL.

2014-11-16

For the Sake of Pleasure Alone

In the past few years I have made acquaintance with several people who, despite being both smart and educated, continue to worship the false god of preference utilitarianism.  I too consider myself both smart and educated, but worship hedonistic utilitarianism instead.  Sometimes I wonder if our differing views aren't merely different representations of the same thing.

I have been wanting to explore this for a long time, but never knew where to start.  Today, less is more, and I will just take an excerpt from Not for the Sake of Happiness (Alone) and go from there:
It is an undeniable fact that we tend to do things that make us happy, but this doesn't mean we should regard the happiness as the only reason for so acting.  First, this would make it difficult to explain how we could care about anyone else's happiness - how we could treat people as ends in themselves, rather than instrumental means of obtaining a warm glow of satisfaction.

Second, just because something is a consequence of my action doesn't mean it was the sole justification.  If I'm writing a blog post, and I get a headache, I may take an ibuprofen.  One of the consequences of my action is that I experience less pain, but this doesn't mean it was the only consequence, or even the most important reason for my decision.  I do value the state of not having a headache.  But I can value something for its own sake and also value it as a means to an end.

For all value to be reducible to happiness, it's not enough to show that happiness is involved in most of our decisions - it's not even enough to show that happiness is the most important consequent in all of our decisions - it must be the only consequent.  That's a tough standard to meet.
Yes, there are multiple reasons for taking the ibuprofen, and it would be a stretch to say that the other reasons boil down to maximizing expected happiness.  But just because they are reasons for acting one way or another does not mean there is something valuable there.  This strikes me as a confusion of terminal and instrumental values.

I can spend hours and hours aimlessly clicking around on Facebook or aimlessly walking around the grocery store.  Anyone with a similar affliction can confirm that these activities yield more dolor than hedon; why then do I keep doing them?  Whatever the cause, I don't feel at all compelled to elevate it or any aspect of its outcomes to Terminal Value.

This feels like a strawman because writing a blog post has higher expected value than aimlessly clicking around on Facebook.  But if that's the crux, then it's a confusion over instrumental and terminal values.

The more substantial disagreement is that I don't see why our conception of terminal values should explain our behaviour.

---

In Terminal Values and Instrumental Values, Yudkowsky writes:
Consider the philosopher who asserts, "All of us are ultimately selfish; we care only about our own states of mind.  The mother who claims to care about her son's welfare, really wants to believe that her son is doing well - this belief is what makes the mother happy.  She helps him for the sake of her own happiness, not his."  You say, "Well, suppose the mother sacrifices her life to push her son out of the path of an oncoming truck.  That's not going to make her happy, just dead."  The philosopher stammers for a few moments, then replies, "But she still did it because she valued that choice above others - because of the feeling of importance she attached to that decision."

[...]

When we object that people sometimes do sacrifice their lives, the philosopher's reply shifts to discussing Expected Utilities over Actions:  "The feeling of importance she attached to that decision."  This is a drastic jump that should make us leap out of our chairs in indignation.
And this is basically how I see it, except that I don't try to be cool and cynical and deep by making accusations of selfishness, and that my stammering wouldn't be relevant to the argument.  The mother is an adaptation-executor just like the rest of us; endowed with reinforcement-learned genes that make no distinction between expected utility and actual utility or between instrumental and terminal values.  Worse, her adaptations have been optimized with respect to an irrelevant goal: inclusive genetic fitness.

If the terminal value function we are looking for should explain our behaviour, then there you have it: inclusive genetic fitness.

(As I am writing this, it occurs to me that I may see it differently from the paraphrased philosopher after all.  I would not have to make the jump from talking about utility to talking about expected utility because I would be talking about expected utility from the start.)

---

Years ago when I first started reading Yudkowsky's writings, I came upon
How to Convince Me That 2 + 2 = 3:
I find it quite easy to imagine a situation which would convince me that 2 + 2 = 3.  Suppose I got up one morning, and took out two earplugs, and set them down next to two other earplugs on my nighttable, and noticed that there were now three earplugs, without any earplugs having appeared or disappeared [...].
(Note: I have omitted essential parts of his argument because I'm not going to argue with it anyway.)

I remember reading over that second sentence several times.  You have two earplugs, you add two earplugs, and you end up with three earplugs.  That's impossible!  What on Earth is he not telling me?  I think it took close to a dozen fruitless re-readings of the first few paragraphs of the post before it dawned on me.

It's unimaginable, and that's the point.  The hypothesis that two plus two equals four is tested possibly more often than any other, and has never had evidence -- no matter how unexpected -- cast doubt on it.  We simply can't see things being any other way.

And so I feel about pain and pleasure being the only terminally valuable things.  I can't even fathom how something could possibly be terminally valuable if it were neither pleasurable nor painful.  But I would know it if it happened, and then I would be convinced.

---

I feel like I should taboo "terminally valuable", but I am not sure what to replace it with.  I terminally value good sensations because they are good to me.  I terminally disvalue bad sensations because they are bad to me.

I could go on for a while, rephrasing it in different ways, all conveying simply "I like what I like and I dislike what I dislike".

That statement is tautological but not useless.

There is no objective morality.  Even if there were, there would be nothing to enforce it.  If there were something to enforce it, this whole discussion would be pointless.

On the flip side, there is plenty of subjective morality.  Different people like different things.  Different people want different things.

The nice thing about subjective morality is that it's nobody's business but your own.  But it's a leaky abstraction.  We share an objective reality, and what happens there violates most people's subjective moralities.

That much is obvious.

Now suppose we want to do something about this.  Ideally, we'd like for everyone to get what they want.  Failing that, we can make them want what they get.

That last option is one that Yudkowsky seems to be heavily opposed to.  If he wants chocolate ice cream, then he doesn't want to want vanilla ice cream, because he doesn't want vanilla ice cream.  He shudders at the thought of eating vanilla ice cream.  Eww!

But I hope that it is uncontroversial to say that there is nothing valuable about chocolate or vanilla ice cream per se.  The value comes from the experience of eating it and liking it.

There is no objective morality, and so you should consider your attachment to your current wants (whatever they are) to be instrumentally irrational.  In fact, your wants are worse than useless: there is an opportunity cost to insisting on chocolate ice cream when vanilla ice cream is much more abundant and your wants are easily changed.

The extent to which you insist on keeping your wants the way they are determines your... say... preferential footprint.  If, as I assumed earlier, we want for everyone to get what they want, then we want to minimize our preferential footprints.

Doing so will leave only that which we cannot help but value: pleasures and pains.  If it turns out we can do away with those as well, then great, but just like I don't expect two plus two to stop equalling three, so I don't expect pain to stop hurting.

---

So far I have assumed that we want to cooperate so that everyone gets what they want.  Given that the conception of value presented here is in some sense selfish, it is a whole 'nother minefield whether and how this cooperation can occur.  I believe I have written plenty of disagreeable things already; let's leave it at this for now.

2014-04-06

Hyperbola Quintessence for rooks along ranks

Hyperbola Quintessence is a quick and simple alternative to Magic Bitboards, but it does not work for rooks or queens moving along ranks (assuming LERF mapping).  It applies the o^(o-2r) trick to generate attacks in positive ray direction (for rooks, east or north).  The attacks in negative ray direction are generated by applying the same trick to the bit-reversed board.

There is no fast (i.e. branchless) way to reverse bits, but in many cases (attacks along a diagonal or along a file), the relevant bits are in separate bytes.  In those cases, the relevant bits can be reversed by a simple byte swap, which is sufficient.

In the case of a rook moving along a rank, the relevant bits are all in the same byte.  The same trick can't be exploited here because the byte swap will have no effect on the relative order of the bits.

I'm new to the world of computer chess, so I don't know what people have been falling back to.  I guess look-up tables are an okay solution, with 8 * 256 entries covering all combinations of rook position on the rank and presence of other pieces on the rank.

In any case, here's a way to apply Hyperbola Quintessence in the rook-along-rank case.  The crux is that we don't need to be able to reverse all the bits of bitboard but only the relevant ones.  Since there are as many bits in the byte (representing the rank) as there are bytes in the board, we can map the bits to their own consecutive bytes and swap the bytes as usual.

Here are the steps (dots are zeros):
  1. Mask the occupancy and the piece position to keep only the relevant rank.
  2. Map the relevant rank of the occupancy and the piece position to the a1-h8 diagonal:
    occupancy         piece
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    
    >> rank_index * 8
    
    occupancy         piece
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    . . . . . . . .   . . . . . . . . 
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0 
    
    * 0x0101010101010101
    
    occupancy         piece
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0  
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0  
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0  
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0  
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0  
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0  
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0  
    0 0 1 0 0 1 1 0   0 0 1 0 0 0 0 0 
     
    & 0x8040201008040201
    
    occupancy         piece
    . . . . . . . 0    . . . . . . . 0 
    . . . . . . 1 .    . . . . . . 0 . 
    . . . . . 1 . .    . . . . . 0 . . 
    . . . . 0 . . .    . . . . 0 . . . 
    . . . 0 . . . .    . . . 0 . . . . 
    . . 1 . . . . .    . . 1 . . . . . 
    . 0 . . . . . .    . 0 . . . . . . 
    0 . . . . . . .    0 . . . . . . . 
  3. Apply Hyperbola Quintessence to get the attacks along the rank as if it were the a1-h8 diagonal.
  4. Map the attacks back onto the relevant rank:
    attacks
    . . . . . . . 0
    . . . . . . 0 .
    . . . . . 1 . .
    . . . . 1 . . .
    . . . 1 . . . .
    . . 0 . . . . .
    . 1 . . . . . .
    1 . . . . . . .
    
    * 0x0101010101010101
    
    attacks
    1 1 0 1 1 1 0 0
    1 1 0 1 1 1 0 0
    1 1 0 1 1 1 0 0
    1 1 0 1 1 1 0 0
    1 1 0 1 1 1 0 0
    1 1 0 1 1 1 0 0
    1 1 0 1 1 1 0 0
    1 1 0 1 1 1 0 0
    
    / 0x0100000000000000
    
    attacks
    . . . . . . . .
    . . . . . . . .
    . . . . . . . .
    . . . . . . . .
    . . . . . . . .
    . . . . . . . .
    . . . . . . . .
    1 1 0 1 1 1 0 0
    
    << rank_index * 8
    
    attacks
    . . . . . . . .
    . . . . . . . .
    . . . . . . . .
    . . . . . . . .
    1 1 0 1 1 1 0 0
    . . . . . . . .
    . . . . . . . .
    . . . . . . . .
Note, this only works one rook at a time.

2013-12-20

How to create a logical array in numpy

This took me a while to find: the numpy equivalents of Matlab's true and false constructors.

To create a 3x3 array of False: np.zeros((3, 3), dtype=bool)
To create a 3x3 array of True: np.ones((3, 3), dtype=bool)

2013-10-29

Playing HexDame, 1: Introduction

For one of my university courses, I will be implementing an automated HexDame player.  This post is the first in a series that will document the process in a way that is hopefully easy to understand.

HexDame is a variant of draughts/checkers on a six-sided board with six-sided tiles:

Because the board shape differs from that used in checkers, some of the rules of checkers need to be adapted.  In HexDame, pieces can move in six directions rather than four: north-east, north, north-west, south-west, south, south-east.  Like in checkers where men (as opposed to kings) can only move forward, in HexDame men can only move away from the player's initial corner.  For example, white's pieces can only move away from the tile labeled "a1" in the picture.  A man gets promoted to king when it reaches "the other side", which in HexDame is the two sides furthest away from the player's initial corner.

Now let's think about how to play this game as a computer.  You might start brainstorming about how to set up a trap, or how to develop a strong position, or that promoting a man should have highest priority if it can be done in one move.  You could do this, but you would end up with a bunch of isolated ideas with no guidance as to how to combine them.  Worse, the ideas you come up with are tailored to this particular game.

That's right, there are ways to automate the playing of games that are general, in that they work for many games.  The methods that I am thinking of rely on the concept of a game tree.

As an example, here's an image of part of the game tree for tic-tac-toe:


At the top ("root") of the tree, there is the empty tic-tac-toe board.  Below it are all the possibilities for how the tic-tac-toe board will look after one move.  Actually there are more possibilities, but each of the ones not shown can be rotated to match one of those shown.  The rotated variants lead to the same game results, and so do not need to be considered separately.

Below the possible boards after the first move are the possible boards after the second move.  Again, not all are shown.  The image stops there, but you can imagine how the game tree continues further down.

Game trees are a useful idea in many ways.  They model the whole game, revealing all the possible consequences of every possible move.  If you know the game tree, you know what to play.  So what's the catch?  Let's digress...

Lay conversations about the game of chess often touch on game trees.  Wild claims are routinely made about "how many moves ahead" a master looks.  It is a good rule of thumb to believe that stronger players look further ahead than weaker ones, but looking ahead is not the only thing to chess.

Well, actually it is, but it takes a lot of time to consider every possible move and countermove.  In fact, every additional step looked ahead costs roughly 35 times as much time!  So if it takes you on average one minute to look two steps ahead, it will take you more than an hour to look three steps ahead, and more than a day to look four steps ahead.  The magic number 35 here is the average number of moves you have to choose from in chess.  In some situations there are fewer possible moves, in others there are more, but the average is around 35.

So what do masters do?  They don't consider each possible move, because some moves are obviously stupid.  They consider forcing moves first (moves that put the king in check or attack the queen), because forcing moves constrain the opponent, which reduces the number of moves they have to choose from, which reduces the time spent thinking ahead.  In the beginning of the game, they following opening theory, which determines their early moves much like a phone script determines a telemarketer's sequence of questions.  Toward the end of the game, they similarly follow endgame theory.  For the rest of the game, they use their highly developed intuition for chess, and a large set of explicit principles, such as the ones listed at http://www.geocities.com/siliconvalley/lab/7378/principl.htm.

In principle there is no difference between humans and computers.  Both have to comply with the laws of physics, and both are subject to the tyranny of computational complexity, which measures the amount of thinking time and the amount of memory that is required to compute some result.  The practical difference is that out of the box, humans have the ability to recognize patterns.  This is a complicated ability that developed over the years through the process of evolution.  In the context of games, it is good for two things; de-greasing engines and killing brain cells developing intuition and applying existing knowledge.

Intuition in games is a cool thing.  A decent chess player can look at a position and instantly see in whose favor it is.  Not with perfect accuracy, but perfect accuracy would take more time than has passed since time began.  It is better to lose the occasional game due to inaccuracy than to lose every game by running out of time.

The other way in which human brains have a headstart for playing this kind of game is that they can apply existing knowledge to it.  Knowledge that has been learned for one task can sometimes be reused for another.  For instance, if you're good at playing soccer, it will be easier for you to pick up table soccer.  Wait.. let me find a better example.  How about: if you're good at playing badminton, it will be easier to learn tennis.  If you know how to play chess, you have knowledge that can be repurposed for the game of checkers.  The rules are different, but the games are pretty similar.  Both have the concepts of capturing and promoting pieces, and, less superficially, both have positional and tactical aspects.

Artificial Intelligence researchers are working on mathematical analogues to these abilities.  The general field is reinforcement learning, in which an agent (such as a robot) learns to perform a task by trying things and being told how well it is doing.  In the subfield of transfer reinforcement learning, the agent tries to reuse knowledge from previously learned tasks.  Also lots of progress has been made on neural networks recently, which could substitute for human intuition.  These topics will probably pop up again in future blog posts; I'm not going to go into them here.

For the beginnings of our automated HexDame player, let's forget about these complications and just create a program that looks ahead.  The game tree provides a nice, systematic way for doing this.  More later.