Jekyll2021-03-26T21:59:16+01:00https://danielewworrall.github.io/feed.xmlDaniel WorrallMachine Learning ResearcherDaniel Worralldworrall at qti.qualcomm.comReversible optimisers2020-12-20T00:00:00+01:002020-12-20T00:00:00+01:00https://danielewworrall.github.io/blog/2020/12/reversible-optimisers<p>This post touches on a curious property of some common optimisers used by the machine learning community: <em>reversibility</em>.</p>
<p>I tend to hate reading through lengthy introductions, so let’s just dive in with an example. Take gradient descent with momentum, this has the following form
\begin{align}
\mu_{t+1} &= \alpha \mu_t + \nabla_{x} f(x_{t}) \newline
x_{t+1} &= x_t - \lambda \mu_{t+1}.
\end{align}
Here $x_t$ denotes the optimisation variable, or <em>position</em>, $x$ at time $t$, $\mu$ is the associated <em>momentum</em>, and $0 < \alpha < 1$ & $\lambda > 0$ are metaparameters, which govern the dynamics of the descent trajectory. I use the term <em>meta</em>parameters, instead of <em>hyper</em>parameters, to distinguish that they are part of the optimiser and not the model, even though some would nowadays say that the optimiser is in fact part of the model, implicitly regularising it.</p>
<p>Anyway, interestingly we can reverse these equations, given the state $[x_{t+1}, \mu_{t+1}]$ as
\begin{align}
x_t &= x_{t+1} + \lambda \mu_{t+1} \newline
\mu_{t} &= \frac{1}{\alpha} \left ( \mu_{t+1} - \nabla_{x} f(x_{t}) \right).
\end{align}
This seemingly arbitrary property is useful from a practical standpoint.</p>
<h3 id="memory-efficiency">Memory efficiency</h3>
<p>An oft-lauded property of reversible systems is that we do not have to store intermediate computations, since they should be easily reconstructed from the system’s end-state. Typically for reverse-mode differentiation to work (i.e. backpropagation), we have to store all the intermediate activations in the forward pass of a network. This has memory complexity, which scales linearly with the size of the computation graph. If we can dynamically reconstruct intermediate activations during the backward pass, then we instantly convert this linear memory complexity to a constant, which enables us to build (in theory) infinitely deep networks.</p>
<h3 id="momentum-is-additive-coupling">Momentum is additive coupling</h3>
<p>Indeed, if you look a little closer at the momentum equations, then you may spot that they resemble an <a href="https://arxiv.org/pdf/1410.8516.pdf">additive coupling layer</a>. Here we have that a state, split into two parts $x$ and $\mu$ (to mimic the momentum optimiser notation), is reversible with the following computation graph
\begin{align}
\mu_{t+1} &= \mu_t + g(x_t) \newline
x_{t+1} &= x_t + h(\mu_{t+1})
\end{align}
To make a direct comparison, $g(x) = \nabla_x f(x)$ and $h(x) = \lambda x$. The one slight discrepancy is the factor of $\alpha$, but we can sweep that under the rug. The reverse equations for the additive coupling layer are
\begin{align}
x_{t} &= x_{t-1} - h(\mu_{t+1}) \newline
\mu_{t} &= \mu_{t+1} - g(x_t).
\end{align}</p>
<div style="text-align:center"><img src="/images/coupling.png" width="50%" /></div>
<p><em>Source: <a href="https://arxiv.org/pdf/1902.02729.pdf">Reversible GANs for Memory-efficient Image-to-Image Translation</a>. This diagramme represents the additive coupling layer in its computation graph form. LEFT: forward pass. RIGHT: reverse pass. To link up the notation $x_1 = \mu_{t}$, $x_2 = x_{t}$, $y_1 = \mu_{t+1}$, $y_2 = x_{t+1}$, $g = \texttt{NN}_1$, and $h=\texttt{NN}_2$</em></p>
<h3 id="case-study">Case study</h3>
<p>Specifically in the case of optimisers, I was pointed towards this paper <a href="https://arxiv.org/pdf/1502.03492.pdf">Gradient-based Hyperparameter Optimization with Reversible Learning</a> (2015) by <a href="https://dougalmaclaurin.com/">Dougal Maclaurin</a>, <a href="http://www.cs.toronto.edu/~duvenaud/">David Duvenaud</a>, and <a href="https://www.cs.princeton.edu/~rpa/">Ryan Adams</a>. The authors exploited the reversibility property of SGD with momentum to train the optimiser metaparameters themselves. First they run the optimiser an arbitrary number of steps, say 100 iterations. This defines an optimisation trajectory $x_0, x_1, x_2, …, x_{99}$. Now the clever part is that you can view the unrolled optimisation trajectory as a computation graph in itself. They compute a loss at the end of the trajectory, then they backpropagate the loss in the reverse direction with respect to the optimiser’s metaparameters.</p>
<div style="text-align:center"><img src="/images/reversibility.png" width="50%" /></div>
<p><em>Source: <a href="https://arxiv.org/pdf/1502.03492.pdf">Gradient-based Hyperparameter Optimization with Reversible Learning</a>. The authors optimise metaparameters by backpropagating along optimisation roll outs. This is made possible with the reversibility of momentum-based SGD, to cap memory-complexity.</em></p>
<p>Could we not do this already, such as in <a href="https://arxiv.org/abs/1606.04474">Learning to learn by gradient descent by gradient descent</a> (Andrychowicz et al., 2016)? Well yes, but the crucial point is that you would usually have to store all the intermediate states $\{[x_t, \mu_t]\}_{t=0}^{99}$, which is costly memory-wise. Exploiting the reversibility property, this memory explosion falls away. Indeed there are issues with numerical stability of the inverse, which the papers dives into, but the principle is elegant.</p>
<h3 id="adam">Adam</h3>
<p>So what other optimisers are reversible? Let’s consider <a href="https://arxiv.org/pdf/1412.6980.pdf">Adam</a>, where
\begin{align}
\mu_{t+1} &= \beta_1 \mu_t + (1-\beta_1) \nabla_{x} f(x_{t}) \newline
\nu_{t+1} &= \beta_2 \nu_t + (1-\beta_2) (\nabla_{x} f(x_{t}))^2 \newline
x_{t+1} &= x_t - \lambda \frac{\mu_{t+1}}{\sqrt{\nu_{t+1}} + \epsilon}.
\end{align}
Given $x_{t+1}$, $\mu_{t+1}$ and $\nu_{t+1}$, we can easily reconstruct $x_t$ from the last line and from there, we can compute the gradient and recover $\mu_{t}$ and $\nu_{t}$. In maths
\begin{align}
x_{t} &= x_{t+1} + \lambda \frac{\mu_{t+1}}{\sqrt{\nu_{t+1}} + \epsilon} \newline
\mu_{t} &= \frac{1}{\beta_1} \left ( \mu_{t+1} - (1-\beta_1) \nabla_{x} f(x_{t}) \right ) \newline
\nu_{t} &= \frac{1}{\beta_2} \left ( \nu_{t+1} - (1-\beta_2) (\nabla_{x} f(x_{t}))^2 \right).
\end{align}
So Adam is reversible. We actually missed out the bias correction steps
\begin{align}
\mu_{t+1} &\gets \mu_{t+1} / (1 - \beta_1^{t+1}) \newline
\nu_{t+1} &\gets \nu_{t+1} / (1 - \beta_2^{t+1}).
\end{align}
You can also verify for yourself that these are reversible too.</p>
<h3 id="do-we-need-reversibility-in-optimisers">Do we need reversibility in optimisers?</h3>
<p>Well, no. In fact, in some ways, we would rather do without it. Optimisers are supposed to be many-to-one mappings. Starting from an infinity of initial conditions, we should converge to the global minimum of a convex function. This means we should discard information about initialisation along the way. To put it as Maclaurin et al. do:</p>
<blockquote>
<p>[O]ptimization moves a system from a high-entropy initial state to a low-entropy (hopefully zero entropy) optimized final state.</p>
</blockquote>
<p>It turns out that if you set $\alpha = 0$ for the momentum method; that is, you just run gradient descent, then this is not reversible. I think this may also be true for <a href="https://www.cs.toronto.edu/~fritz/absps/momentum.pdf">Nesterov accelerated momentum</a>, and <a href="http://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf">RMSProp</a> which I couldn’t make reversible (I call this <em>proof by fatigue</em>). So I’m left wondering, is reversibility just some extra curious property that can be useful sometimes, but is completely arbitrary when it comes to doing optimisation? Or is there some deeper meaning to it? Is it just some artifact of how we think of optimisation, in terms of balls rolling down hills? Maybe more interestingly, what does the lack of reversibility for standard gradient descent and Nesterov entail? Could this be another reason why Nesterov works better than classical momentum? Could we measure the information loss somehow? And if we could, what would this mean?</p>Daniel Worralldworrall at qti.qualcomm.comReversible neural architectures have been a popular research area in the last few years, but reversibility is also built into many modern day neural optimisers, perhaps serendipitously.On the ‘invention’ of randomness2019-12-15T00:00:00+01:002019-12-15T00:00:00+01:00https://danielewworrall.github.io/blog/2019/12/randomness<p><img src="/media/jaynes-himself.jpg" alt="The legend himself" height="25%" width="25%" style="float: right;margin-left: 20px;margin-top: 7px;" /></p>
<p>Recently in AMLAB we started a Jaynes reading group. <a href="https://en.wikipedia.org/wiki/Edwin_Thompson_Jaynes">E T Jaynes’</a> posthumous book and general all-round cult classic <em>Probability Theory: The Logic of Science</em> is the focus of our study. After having lectured a Bayesian statistics course for the last two years, I felt fairly confident in my understanding of the subject matter. It seems that while I am at ease with performing computations I have far from grasped Bayesianism from a conceptual, and some might say doctrinal, standpoint. And after a couple of conversations with others in a similar situation to me, it seems I may not be alone.</p>
<p>Now this book is littered with gems and Jaynes’ colourful written-style is literary gold, but I want to focus on a small snippet, which got me thinking hard about my understanding of what it means to be <em>random</em>. <strong>Jaynes essentially claims that randomness simply does not exist</strong>. It is a human invention, which you can find in Digression 3.8.1. Let’s step through the logic.</p>
<h1 id="my-truth-versus-your-truth">My truth versus your truth</h1>
<p>In true statistical tradition, we are going to consider coin tosses. Let’s assume that randomness does exist. What do I mean by randomness? I mean that when I throw the coin in the air it will land heads or tails in a 100% unpredictable fashion. Some intrinsically indeterminate process will drive the coin to come to rest in a state, independent of when it left my fingers. In this unpredictable world an observer would be unable to make any sure judgements about the outcome. She may assume the prior probability of the coin landing heads or tails is 50%. This is not some weird quantum-y line of reasoning, the coin will land either heads or either tails, but we just cannot say beforehand. Some would argue, that maintaining a uniform probability distribution over the outcome of the coin toss is really the best she can do. It is a reflection of the truth of the world she lives in.</p>
<p>Let’s now pretend that the world is 100% deterministic. Say I flip a coin and I happen to know its mass, moment of angular inertia, air resistance, initial momentum, etc. In this perfect world, an observer gifted with this knowledge would be able to predict with 100% accuracy whether it lands heads or tails. Even if the computations are intractable and we are reduced to brute-forcing over possible futures, we could at least agree that the outcome is calculable in principle. No randomness here. Now let’s add a twist. Let’s say that we do not tell the observer any of this privileged information about the physics of the coin-toss. In that case, the observer would be reduced to educated guesses about the outcome of the toss. Because she’s a good Bayesian, she will continue to assume the prior probability of the coin landing heads or tails is 50%, but now her motivations are different. Or are they?</p>
<p>Maintaining a probability distribution over the outcome of the coin toss is not just a cynically non-committal statement. It is a full and honest acknowledgement of her ignorance. She has not claimed that the world itself is random in any way—in fact she very well knows it’s deterministic—but she is merely asserting that her knowledge of its physical parameters are unknown to her. Whichever way the coin lands, the outcome will be a surprise to her, because she is unable to predict it with her imperfect knowledge.</p>
<p>This seemed a bit weird to me. In moving from the synthetically random world to the deterministic one, it would seem that modelled randomness is really just a statement of observer ignorance, rather than any innate property of the universe around us. The fascinating point for me is that from the point-of-view of the observer, <em>it does not matter whichever world she lives in (the inherently random or the inherently deterministic), in both cases she is forced to use the same mathematical reasoning!</em> To make a linguistic distinction between the intrinsic randomness of the outside world and the observer’s perceived randomness, we will term them <em>intrinsic randomness</em> and <em>epistemic randomness</em>. Epistemic randomness is the uncertainty I have over the outside world because I simply do not have enough information. It seems that epistemic randomness is uncontroversial. Intrinsic randomness on the other hand is this very much controversial concept that Nature itself is indeterminate and unpredictable.</p>
<p>Now this kind of thinking leads to a vast array of new questions. Could an observer ever determine whether the world she lives in is intrinsically random or just deterministic, given that her tools to analyse both are the same? If intrinsic randomness does not exist, is it then some kind of invention? What does this mean for the interpretation of stochastically derived quantities such as aleatoric and epistemic uncertainty? What about free will? (This last question is a little cliché, but people seem to like talking about free will). <strong>I plan to devote a follow up blog to aleatoric and epistemic uncertainties.</strong></p>
<h1 id="my-gripe-with-jaynes">My gripe with Jaynes</h1>
<p>Those acquainted with Jaynes’ writings will be all too familiar with his grandiloquent rhetorical style. He really seems to believe that intrinsic randomness does not exist, that it is a sort of what he calls <em>mind projection fallacy</em>.</p>
<blockquote>
<p>The belief that ‘randomness’ is some kind of real property existing in Nature is a form of the mind projection fallacy which says, in effect, ‘I don’t know the detailed causes – therefore – Nature does not know them.</p>
</blockquote>
<p>Of course <em>he</em> would put (intrinsic) <em>randomness</em> in quotation marks. Mind projection fallacies, a very Jaynesian invention, are assertions wherein an observer states that how they see the world really is reality. My reality is your reality and everyone else’s too. That someone should disagree on the nature of Nature itself is simple idiocy. Now my gripe with Jaynes’ stance is that he himself appears to be stuck in a mind projection fallacy of his own. His assertion that intrinsic randomness does not exist is a projection of his reality on to the reader. Just for fun here is another, and I daresay somewhat salacious, Jaynes quote</p>
<blockquote>
<p>For some, declaring a problem to be ‘randomized’ is an incantation with the same purpose and effect as those uttered by an exorcist to drive out evil spirits; i.e. it cleanses their subsequent calculations and renders them immune to criticism. We agnostics often envy the True Believer, who thus acquires so easily that sense of security which is forever denied to us.</p>
</blockquote>
<h3 id="quantum-weirdness">Quantum weirdness</h3>
<p><strong>Disclaimer: in this next bit I talk about physics, but be under no illusions, I am far from knowledgeable on this subject.</strong></p>
<p>So is the world deterministic? As far as I know the main source of randomness in the world rises out of the depths from the sub-atomic quantum world. The quantum world is very strange in that very small objects, called particles, such as electrons and neutrons, are described probabilistically and not using our everyday Newtonian world-view. Particles are seen to exhibit <em>wave-particle duality</em>—a behaviour where they act like waves, being able to be in multiple locations at once—until they are observed, at which point we observe a bizarre effect called <em>wavefunction collapse</em>, where they then assume a definite location in space. Wavefunction collapse has always been an incomprehensible phenomenon to me. Why should the act of observation change the nature of the underlying physics?</p>
<p>The central apparatus for modelling particles is a <em>wavefunction</em>, a function extending over all space and time. The squared modulus of the wavefunction is equal to the probability that a particle will be observed at a specific location and time if a measurement is taken. One of the big problems in quantum mechanics is understanding how to interpret the wavefunction. Is it a fundamental physical object? If so, it is very strange indeed, since in the classical setting where we model big objects, we never observe an object to be in two places at once. In the quantum world, this happens by necessity.</p>
<p>One school of thinking, in fact the one I learnt in secondary school (that’s high school for everyone else), is that the wavefunction is a fundamental object of nature. It is part of reality. This is an unsettling idea, but if true it would indicate that our world is very weird indeed. Einstein famously rejected this interpretation claiming “God does not play dice”. This is the (in)famous <em>Copenhagen interpretation</em> of quantum mechanics, which as it turns out is not universally accepted or rejected by all physicists. Some detractors indeed do subscribe to deterministic <em>hidden variable theories</em>, as I believe did Jaynes, in which random quantum effects are just disturbances caused by so-called hidden variables, unobserved by us humans (so far). But I think that one flavour of hidden variable theories, called local hidden variable theories, have been ruled out already by what is known as the Bell test experiments. A much more promising route is <em>QBism</em>, pronounced <em>cubism</em> like the artistic movement, championed by Christopher Fuchs. QBism was originally called Quantum Bayesianism, but apparently many hardcore Bayesians have pointed out it is not strictly Bayesian and QBism sounds cooler anyway. In this theory particles occupy one position at any one time, the wavefunction is just an expression of observer ignorance, and wavefunction collapse is analogous to making a measurement and updating our beliefs. Mystery resolved…if it’s true.</p>
<p>So whether intrinsic randomness does exists is an open and potentially unanswerable question. At least to determine the potentiality of intrinsic randomness would require us to step outside of the rôle of observer—a purely unscientific practice. All observers are bound to make world inferences via the methods of epistemic randomness and are thus at risk of succumbing to Jaynesian mind projection fallacies. Some might go as far to say that if something is unverifiable by experiment it cannot <em>exist</em>. Enter die-hard scientific methodists (theological pun intended). On a side note, the question of existence seems to be an argument I get into a lot nowadays with post-structuralists from the arts and humanities, who always tell me that scientists are mistaken in believing that what they study exists.</p>
<h2 id="inevitably-after-reading-a-story-about-flipping-coins-i-seem-to-have-found-myself-in-quite-the-bind-if-i-have-come-to-understand-anything-it-is-that-you-can-never-be-safe-in-thinking-you-know-anything">Inevitably after reading a story about flipping coins, I seem to have found myself in quite the bind. If I have come to understand anything, it is that you can never be safe in thinking you know anything.</h2>Daniel Worralldworrall at qti.qualcomm.comRecently in AMLAB we started a Jaynes reading group. E T Jaynes' posthumous book and general all-round cult classic Probability Theory: The Logic of Science is the focus of our study. After having lectured a Bayesian statistics course for the last two years...