1 Introduction
In recent years there has been an increasing interest in methods for Bayesian inference which are scalable to Big Data settings. Contrary to optimisationbased or maximum likelihood settings, where one looks for a single point estimation of parameters, Bayesian methods attempt to obtain a characterisation of the full posterior distribution over the unknown parameters and latent variables in the model. This approach allows for a better characterisation of the uncertainties inherent to the learning process as well as providing protection against over fitting.
One of the most widely used classes of methods for Bayesian posterior inference is Markov Chain Monte Carlo (MCMC). This class of algorithms mixes slowly in complex, high dimensionalmodels and scales poorly to large data sets [3]. In order to deal with these issues, a lot of effort has been placed on developing MCMC methods that provide more efficient exploration of the posterior, such as Hamiltonian Monte Carlo (HMC) [6, 16] and its Riemannian manifold variant [12].
Stochastic gradient variants of such continuousdynamic samplers have been shown to scale very well with the size of the data sets, as at each iteration they use data subsamples (also called minibatches) rather than the full dataset. Stochastic gradient Langevin dynamics (SGLD) [21] was the first algorithm of this kind showing that adding the right amount of noise to a standard stochastic gradient optimisation algorithm leads to sampling from the true posterior as the step size is decreased to zero. Since its introduction, there have been a number of articles extending this idea to different samplers [4, 15, 5], as well as carefully studying the behaviour of the mean square error (MSE) of the SGLD for decreasing step sizes and for a fixed step size [20, 19]. The common conclusion of these papers is that the MSE is of order for computational cost of (as opposed to rate of MCMC).
The basic idea of Multilevel Monte Carlo methodology is to use a cascade of decreasing stepsizes. If those different levels of the algorithm are appropriately coupled, one can reduce the computational complexity without a loss of accuracy.
In this paper, we develop a Multilevel SGLD (MLSGLD) algorithm with computational complexity of , hence closing the gap between MCMC and stochastic gradient methods. The underlying idea is based on [18] and its extensions are:

We build an antithetic version of MLSGLD which removes the logarithmic term present in [18] and makes the algorithm competitive with MCMC.

We consider the scaling of the computational cost as well as the number of data items . By using a Taylor based stochastic gradient, we obtain sublinear growth of the cost in .

By introducing additional time averages, we can speed up the algorithm further.
The underlying idea is close in spirit to [1] where expectations of the invariant distribution of an infinite dimensional Markov chain is estimated based on coupling approximations.
This article is organised as follows. In Section 2, we provide a brief description of the SGLD algorithm and the MLMC methodology to extent, which will allow us to sketch in Section 3 how these two ideas can be enmeshed in an efficient way. Next we describe three new variants of the multilevel SGLD with favourable computational complexity properties and study their numerical performance in Section 4. Numerical experiments demonstrate that our algorithm is indeed competitive with MCMC methods which is reflected in the concluding remarks in Section 5.
2 Preliminaries
2.1 Stochastic Gradient Langevin Dynamics
Let
be a parameter vector where
denotes a prior distribution, and the density of a data item is parametrised by . By Bayes’ rule, the posterior distribution of a set of data items is given byThe following stochastic differential equation (SDE) is ergodic with respect to the posterior
(1) 
where is a
dimensional standard Brownian motion. In other words, the probability distribution of
converges to as . Thus, the simulation of (1) provides an algorithm to sample from . Since an explicit solution to (1) is rarely known, we need to discretise it. An application of the Euler scheme yieldswhere
is a standard Gaussian random variable on
. However, this algorithm is computationally expensive since it involves computations on all items in the dataset. The SGLD algorithm circumvents this problem by replacing the sum of the likelihood terms by an appropriately constructed sum of terms which is given by the following recursion formula(2) 
with being a standard Gaussian random variable on and is a random subset of , generated for example by sampling with or without replacement from . Notice that this corresponds to a noisy Euler discretisation. In the original formulation of the SGLD in [21] decreasing step sizes
were used in order to obtain an asymptotically unbiased estimator. However, the RMSE is only of order
for the computational cost of [19].2.2 Multilevel Monte Carlo
Consider the problem of approximating where is a random variable. In practically relevant situations, we cannot sample from , but often we can approximate it by another random variable at a certain associated , which goes to infinity as increases. At the same time , so we can have a better approximation, but at a certain cost. The typical biased estimator of then has the form
(3) 
Consequently, the cost of evaluating the estimator is proportional to to
. According to the Central Limit theorem, we need to set
to get the standard deviation of the estimator
less than .Now consider just two approximations and , where . It is clear, that the cost of one sample for is roughly proportional to . We assume that and where . Then based on the identity , we have
We see that the overall cost of the Monte Carlo estimator is proportional to
so implying the condition
we obtain that . The idea behind this method, which was introduced and analysed in [14], lies in sampling in a way, that . This approach has been independently developed by Giles in a seminal work [9], where a MLMC method has been introduced in the setting of stochastic differential equations.
MLMC takes this idea further by using independent clouds of simulations with approximations of a different resolution. This allows the recovery of a complexity
(i.e variance
). The idea of MLMC begins by exploiting the following identity(4) 
In our context , , with , defined in (2), , and being the final time index in an SGLD sample. We consider a MLMC estimator
where are independent samples at level . The inclusion of the level in the superscript indicates that independent samples are used at each level and between levels. Thus, these samples can be generated in parallel.
Efficiency of MLMC lies in the coupling of and that results in small . In particular, for the SDE in (1), one can use the same Brownian path to simulate and which, through the strong convergence property of the scheme, yields an estimate for . More precisely it is shown in Giles [9] that under the assumptions^{2}^{2}2Recall denotes the size of the step of the algorithm (2).
(5) 
for some the expected accuracy under a prescribed computational cost is proportional to
where the cost of the algorithm at each level is of order .
The main difficulties in extending the approach in the context of the SGLD algorithm is a) the fact that and therefore all estimates need to hold uniformly in time; b) coupling SGLD dynamics across the different levels in time c) coupling the subsampling across the different levels. All of these problems need serious consideration as naive attempts to deal with them might leave (4) unsatisfied, hence violating the core principle of the MLMC methodology.
3 Stochastic Gradient based on Multilevel Monte Carlo
In the following we present a strategy how the two main ideas discussed above can be combined in order to obtain the new variants of the SGLD method. In particular, we are interested in coupling the dicretisations of (1) based on the step size with . Because we are interested in computing expectations with respect to the invariant measure , we also increase the time endpoint which is chosen such that . Thus, .
We introduce the notation
where denotes the Gaussian noise and the index of the batch data. We would like to exploit the following telescopic sum
We have the additional difficulty of different and stepsizes and simulation time . First, the fine path is initially evolving uncoupled for time steps. The coupling arises by evolving both fine and coarse paths jointly, over a time interval of length , by doing two steps for the finer level denoted by (with the time step ) and one on the coarser level denoted by (with the time step ) using the discretisation of the averaged Gaussian input for the coarse step.
This coupling makes use of the underlying contraction (Equation (6)) as illustrated in Figure 1. The property that we use is that solutions to (1) started from two different initial conditions and with the same driving noise satisfy
(6) 
In [18, 7] it is shown that this holds if the posterior is strongly logconcave and also is satisfied by the numerical discretisation. However, numerically this holds for a much larger class and this can be extended by considering more complicated couplings such as the reflection coupling [8]. This shifting coupling was introduced in [13] for coupling Markov chains. In [18, 7] it is shown that (6) holds if the posterior is strongly logconcave. This is sufficient but not necessary and holds for a much wider class of problems [8].
This property implies that the variance of
for suitably chosen would remain small, thus allowing an application of the MLMC methodology. We will drop appropriately.
3.1 Multilevel SGLD
As common in MLMC we couple fine and coarse paths through the Brownian increments, with a Brownian increment on a coarse path given as a scaled sum of increments on the fine  , which can be written in our notation as
(7) 
One question that naturally occurs now is that if and how should one choose to couple between the subsampling of the data? In particular, in order for the telescopic sum to be respected, one needs to have that the laws of distribution for subsampling the data is the same, namely
(8) 
In order for this condition to hold we first take independent samples on the first finestep and another independent ssamples on the second finestep. In order to ensure that Equation 8 holds, we create by drawing samples without replacement from . Other strategies are also possible and we refer the reader to [18].

The initial steps are characterised by

set and , then simulate jointly according to
(9) and set
3.2 Antithetic Multilevel SGLD
Here we present the most promising variant of coupling on subsampling: Algorithm 2 for . Building on the ideas developed in [11] (see also [10]) we propose Antithetic Multilevel SGLD which achieves an MSE of order complexity for prescribed computational cost (and therefore allows for MLMC with random truncation see [1]).

The initial steps are characterised by

set and , then simulate jointly according to
(10) 
set
3.3 Averaging the Path
Compared to MCMC these algorithms seem wasteful because only the last step of a long simulation is saved. The numerical performance can be improved by instead averaging of parts of the trajectory as follows
and this also applies appropriately to the antithetic version.
3.4 Taylor based Stochastic Gradient
The idea of Taylor based stochastic gradient is to use subampling on the remainder of a Taylor approximation
(11)  
We expect that the Taylor based stochastic gradient to have small variance for small. The idea of subsampling the remainder originally was introduced in [2]. By interopolating between the Taylor based stochastic gradient and the standard stochastic gradient we have the best of both worlds.
4 Experiments
We use Bayesian logistic regression as testbed for our newly proposed methodology and perform a simulation study. The data is modelled by
(12) 
where and are fixed covariates. We put a Gaussian prior on , for simplicity we use subsequently. By Bayes’ rule the posterior satisfies
We consider and data points and choose the covariate to be
for a fixed sample of for and we take .
It is reasonable to start the path of the individual SGLD trajectories at a mode of the target distribution. This means that we set the to be the map estimator
which is approximated using the NewtonRaphson method. In the following we disregard the cost for the preliminary computations which could be reduced using state of the art optimisation and evaluating the Hessian in parallel. In the following we use MCMC and the newly developed MLSGLD to estimate the averaged squared distance from the map estimator under the posterior i.e. set
(13) 
Notice that by posterior consistency properties we expect this quantity to be have like which is why we will consider relative MSE.
4.1 Illustration of Coupling standard, antithetic and with Taylor
We choose , and leave as a tuning parameter. The crucial ingredient here is that in expectation the coarse and fine paths get closer exponentially initially and then asymptote, with the asymptote decaying as the step size decays. This illustrated on Figure (a)a. As any MLMC algorithm performance is effected by the order of the variance , the parameters and should be chosen such that the difference between pathes reaches the asymptote, but preferrably does not spent to much time in it, as this increases the computational cost of sampling those paths. In our experiments we set and and on Figure (c)c we see, that Algorithm 2 provides better coupling with variance decay of order , which is significantly better than the first order variance decay, given by Algorithm 1. Combining Algorithm 2 with Taylor based extension from Section 3.4 and path averaging with from Section 3.3 gives additional decrease for the variance without affecting the rate . The faster variance decay leads to lower overall complexity, as the number of samples at each level is proportional to the variance at that level. The Taylor Mean decay rates are of the same order, which can be seen on Figure (b)b, but once again Algorithm 2 combined with Taylor and path averaging is more preferable, as the multiplicative constant is lower, than in Algorithm 1.
Numerical evidence, presented here, leads to the conclusion, that Antithetic MLSGLD with Taylor along with Antithetic MLSGLD with Taylor and Averaging are the best competitors to MCMC algorithm, so we proceed to comparison of those algorithms.
4.2 Comparison with MCMC
We choose MetropolisAdjusted Langevin (MALA, see [17]
) as a competitor because it is based on one Euler step of the Langevin SDE, but adds a Metropolis acceptreject step in order to preserve the correct invariant measure (removing the requirement to decrease step size for better accuracy). We take cost as the number of evaluation of data items, which is typically measured in epochs. One epoch corresponds to one effective iteration through the full data set. Heuristically, for this logconcave problem we expect the convergence rate to be independent of
, so the only cost increase is due to evaluating posterior density and evaluating . This agrees with the findings in Figure (a)a, where the MCMC lines are almost on top of each other thus yielding the same relative MSE for the same number of epochs for different dataset sizes. As increases the cost per epoch increases proportional to . We run the MALA for steps with steps of burning and optimal acceptance rate for 50 times and then average. The various MLSGLD algorithms are ran for 50 times to achieve relative accuracies . This is yet another advantage of MLMC paradigm, which allows us to control numerically the mean increments and variance at all the levels, thus stopping the algorithm, when it has converged numerically.The most important comparison is presented on Figure (b)b, where we compare the increase of the complexity to achieve relative accuracy of with respect to the dataset size. We observe the sublinear growth of cost w.r.t dataset size for Antithetic MLSGLD with Taylor and Antithetic MLSGLD with Taylor and averaging, with the later having a slightly better behaviour than the first one.
5 Conclusion
We develop a Multilevel SGLD algorithm with computational complexity of , hence closing the gap between MCMC and stochastic gradient methods. Moreover, this algorithm scales sublinearly with respect to the dataset size and allows natural parallelization, due to the typical properties of Monte Carlo sampling. The benefits of parallelization are to be studied later along with further numerical investigations for adaptive choices of parameters in the algorithm. In our further studies we also plan to quantify analytically the gains, given by MLSGLD algorithm and extend its applicability to a larger class of models.
References
 [1] S. Agapiou, G. O. Roberts, and S. J. Vollmer. Unbiased Monte Carlo: posterior estimation for intractable/infinitedimensional models. arXiv preprint, 2014. To appear in Bernoulli.
 [2] C. Andrieu and S. Yildirim. Facilitating the penalty method for mcmc with large data. 2015.
 [3] R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data. ArXiv eprints, May 2015.

[4]
T. Chen, E. Fox, and C. Guestrin.
Stochastic gradient Hamiltonian Monte Carlo.
In
Proc. International Conference on Machine Learning
, June 2014.  [5] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using stochastic gradient thermostats. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3203–3211. Curran Associates, Inc., 2014.
 [6] S. Duane, A. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid monte carlo. Physics Letters B, 195(2):216 – 222, 1987.
 [7] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the Unadjusted Langevin Algorithm. ArXiv eprints, July 2015.
 [8] A. Eberle. Reflection coupling and wasserstein contractivity without convexity. Comptes Rendus Mathematique, 2011.
 [9] M. B. Giles. Multilevel Monte Carlo path simulation. Oper. Res., 2008.
 [10] M. B. Giles. Multilevel Monte Carlo methods. Acta Numer., 24:259–328, 2015.
 [11] M. B. Giles and L. Szpruch. Antithetic multilevel Monte Carlo estimation for multidimensional SDEs without Lévy area simulation. Ann. Appl. Probab., 24(4):1585–1620, 2014.
 [12] M. Girolami and B. Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
 [13] P. W. Glynn and C.H. Rhee. Exact estimation for markov chain equilibrium expectations. J. Appl. Probab., 51A:377–389, 12 2014.
 [14] A. Kebaier. Statistical Romberg extrapolation: a new variance reduction method and applications to option pricing. Ann. Appl. Probab., 15(4):2681–2705, 2005.
 [15] B. Leimkuhler and X. Shang. Adaptive thermostats for noisy gradient systems. SIAM Journal on Scientific Computing, 38(2):A712–A736, 2016.
 [16] R. M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113–162, 2010.
 [17] G. O. Roberts and R. L. Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
 [18] L. Szpruch, S. Vollmer, K. C. Zygalakis, and M. B. Giles. Multi level monte carlo methods for a class of ergodic stochastic differential equations. arXiv:1605.01384.
 [19] Y. W. Teh, A. H. Thiery, and S. J. Vollmer. Consistency and fluctuations for stochastic gradient langevin dynamics. Journal of Machine Learning Research, 17(7):1–33, 2016.
 [20] Y. W. Teh, S. J. Vollmer, and K. C. Zygalakis. (Non) asymptotic properties of stochastic gradient langevin dynamics. ArXiv eprints, 2015.
 [21] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the 28th ICML, 2011.
Comments
There are no comments yet.