Fashion & Beauty

Meta-learning for stochastic gradient MCMC

Description
Meta-learning for stochastic gradient MCMC Yingzhen Li University of Cambridge Microsoft Research Cambridge Joint work with Wenbo Gong & José Miguel Hernández-Lobato (U Cambridge) arxiv Bayesian
Published
of 31
0
Published
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Transcript
Meta-learning for stochastic gradient MCMC Yingzhen Li University of Cambridge Microsoft Research Cambridge Joint work with Wenbo Gong & José Miguel Hernández-Lobato (U Cambridge) arxiv Bayesian neural networks 101 Goal: classifying different types of cats from images x: input images; y : output label cat build a neural network (with param. θ): yˆ = NNθ (x) find the best parameter θ given a dataset D = {(xn, yn )}N n=1 : θ = arg max θ N X log p(yn xn, θ) + log p(θ) n=1 Maximum a posteriori (MAP) 1 Bayesian neural networks 101 shorthair wirehair ??? Persian Do you know what you don't know? How confident are you? 2 Bayesian neural networks 101 Bayesian inference: given some function F (θ), want E p(θ D) [F (θ)] predictive mean ŷ mean = E p(θ D) [NN θ (x)] predictive distribution p(y x, D) = E p(θ D) [p(y x, θ)] evaluate posterior p(θ A D) = E p(θ D) [δ A ] 3 Bayesian neural networks 101 Bayesian inference: given some function F (θ), want E p(θ D) [F (θ)] Monte Carlo estimation: E p(θ D) [F (θ)] 1 K K F (θ k ) k=1 θ k p(θ D) (intractable) Stochastic gradient MCMC (SG-MCMC): efficient ways to (approximately) draw samples from p(θ D) 3 From SGD to SG-MCMC The MAP problem can be rewritten as θ = arg min U(θ), θ p(θ D) exp[ U(θ)], U(θ) = N log p(y n x n, θ) + log p(θ) n=1 4 From SGD to SG-MCMC The MAP problem can be rewritten as θ = arg min U(θ), θ p(θ D) exp[ U(θ)], U(θ) = N log p(y n x n, θ) + log p(θ) n=1 In the MAP problem, we find θ by gradient descent θ U(θ) = θ t+1 = θ t η θt U(θ t ), N θ log p(y n x n, θ) + θ log p(θ), {(x m, y m )} M m=1 D M n=1 } {{ } full gradient of LL 4 From SGD to SG-MCMC The MAP problem can be rewritten as θ = arg min U(θ), θ p(θ D) exp[ U(θ)], U(θ) = N log p(y n x n, θ) + log p(θ) In the MAP problem, we find θ by stochastic gradient descent (for big data) n=1 θ t+1 = θ t η θt Ũ(θ t ), θ Ũ(θ) = N M θ log p(y m x m, θ) + θ log p(θ), {(x m, y m )} M m=1 D M M m=1 }{{} stochastic gradient of LL 4 From SGD to SG-MCMC In the Bayesian inference problem, we need to (approximately) draw θ p(θ D) Big data: Stochastic gradient Langevin dynamics (SGLD) θ t+1 = θ t η θt Ũ(θ t ) + 2ηɛ, ɛ N (0, I) Welling and Teh (2011), Chen et al. (2014), Li et al. (2016), Chen et al. (2016) 5 From SGD to SG-MCMC In the Bayesian inference problem, we need to (approximately) draw θ p(θ D) Big data: Stochastic gradient Langevin dynamics (SGLD) θ t+1 = θ t η θt Ũ(θ t ) + 2ηɛ, ɛ N (0, I) SGLD = SGD + properly scaled Gaussian noise Other optimisation algorithms can be transformed into SG-MCMC samplers: SGD + momentum SGHMC; RMSprop preconditioned SGLD; Adam Santa Welling and Teh (2011), Chen et al. (2014), Li et al. (2016), Chen et al. (2016) 5 I m bored of tuning my optimiser & sampler Which SG-MCMC algorithm should I use? How do I tune the hyper-parameters? Salimans et al. (2015), Song et al. (2017), Levy et al. (2018) 6 I m bored of tuning my optimiser & sampler Which SG-MCMC algorithm should I use? How do I tune the hyper-parameters? Learn it from data! Want a general solution for similar tasks Train on low-dim, generalise to high-dim Salimans et al. (2015), Song et al. (2017), Levy et al. (2018) 6 Learning to learn Meta-learning for optimisers: Define an optimiser with parameters φ: z t+1 = z t f φ (z t, H( )) Run it on some training objective functions H(z), provide learning signals to train φ Once learned, apply this optimiser to test objective functions Andrychowicz et al. (2016), Li and Malik (2017), Wichrowska et al. (2017), Li and Turner (2018) 7 Learning to learn Meta-learning for SG-MCMC: can we just naively Define a sampler with parameters φ: z t+1 = z t f φ (z t, H( ), ɛ), ɛ N (0, I) Run it on some training distributions π(z) exp[ H(z)], provide learning signals to train φ Once learned, apply this sampler to test distributions Andrychowicz et al. (2016), Li and Malik (2017), Wichrowska et al. (2017), Li and Turner (2018) 7 Learning to learn Meta-learning for SG-MCMC: can we just naively Define a sampler with parameters φ: z t+1 = z t f φ (z t, H( ), ɛ), ɛ N (0, I) Run it on some training distributions π(z) exp[ H(z)], provide learning signals to train φ Once learned, apply this sampler to test distributions Not quite yet! We need to make sure it is a valid sampler! Andrychowicz et al. (2016), Li and Malik (2017), Wichrowska et al. (2017), Li and Turner (2018) 7 The complete framework: Ma et al. NIPS 2015 To sample from π(z) exp[ H(z)]: Let s take the step-size η 0 and use exact gradient: dz = z H(z)dt + 2dW (t) (Langevin dynamics) W (t) is a Wiener process (think about dw (t) as some Gaussian noise with variance dt) Langevin dynamics is a special case of Itô diffusion dz = µ(z)dt + 2D(z)dW (t) 8 The complete framework: Ma et al. NIPS 2015 Itô diffusion dz = µ(z)dt + 2D(z)dW (t) (1) To make sure π(z) exp[ H(z)] is a stationary distribution: µ(z) = [D(z) + Q(z)] z H(z) + Γ(z), Γ(z) i = D(z): diffusion matrix, PSD Q(z): curl matrix, skew-symmetric Γ(z): correction vector Ma et al. (2015) completeness result: under some mild conditions Any Itô diffusion that has the unique stationary π(z) is governed by (1)+(2) d j=1 z j [D ij (z) + Q ij (z)] (2) 8 The complete framework: Ma et al. NIPS 2015 Any better solutions? I know how to pick the best one! Is it a valid sampler? Langevin Ma et. al. any SDE flexibility Searching the best sampler within the complete framework: Guaranteed to be correct Retains the most flexibility Only needs to learn how to parameterise D(z) and Q(z) matrices! 8 Our recipe: dynamics design Goal: train an SG-MCMC sampler to sample from p(θ D) exp[ U(θ)] We augment the state space with momentum variable p: z = (θ, p), π(z) exp[ H(z)], H(z) = U(θ) pt p Recall the complete recipe dz = [D(z) + Q(z)] z H(z)dt + Γ(z)dt + 2dW (t) 9 Our recipe: dynamics design Our recipe: [ Q(z) = 0 Q f (z) Q f (z) 0 ] [ 0 0, D(z) = 0 D f (z) ] [ Γθ (z), Γ(z) = Γ p (z) Q f (z) = diag[f φq (z)], D f (z) = diag[αf φq (z) f φq (z) + f φd (z) + c], α, c 0 Resulting update rules (rearrange terms & discretise & stochastic gradient): correction momentum SGD {}}{ {}}{ θ t+1 = θ t + ηq f (z t )p t + ηγ θ (z t ) p t+1 = p t ηd f (z t )p t ηq f (z t ) θt Ũ(θ t ) + ηγ p (z t ) + Σ(z t )ɛ, ɛ N (0, I) }{{} friction Σ(z t ) = 2ηD f (z t ) η 2 Q f (z t )B(θ t )Q f (z t ), B(θ t ) = V[ θt Ũ(θ t )] ] 9 Our recipe: dynamics design Designing f φq (z) (responsible for the drift): the i th element is defined as f φq,i(z) = β + f φq (Ũ(θ), p i) We want f φq (z) to depend on the energy landscape: Fast traversal through low-density regions Better exploration in high-density regions But we don t want Γ θ (z) to be too expensive! (using θ U(θ) as input here leads to an extra term, θ U(θ) in Γ θ (z)) 9 Our recipe: dynamics design Designing f φd (z) (responsible for friction): the i th element is defined as f φd,i(z) = f φd (Ũ(θ), p i, θi Ũ(θ)) Γ p (z) only requires computing p D f (z)...so we can use the gradient information θ U(θ) prevent overshoot by comparing p and θ U(θ) 9 Our recipe: loss function design Use KL divergence KL[q(θ) p(θ D)] to define loss. Define q(θ) implicitly: run parallel chains for several steps, then Cross-chain loss: at time t, collect samples across chains In-chain loss: for each chain, collect samples by thinning 10 A toy example trained on factorised Gaussians, tested on correlated Gaussians manually injected Gaussian noise to the gradients (and assume we don t know noise variance B(θ)) 11 Bayesian NN on MNIST Goal: sample from BNN posterior Training: meta sampler trained to sample from the posterior of a BNN (1-hidden layer, 20 hidden units, ReLU) Three generalisation tests: to bigger network architecture: 2-hidden layer MLP (40 units, ReLU) to different activation function: 1-hidden layer MLP (20 units, Sigmoid) to different dataset: train on MNIST 0-4, test on MNIST 5-9 Also consider long-time horizon generalisation 12 Bayesian NN on MNIST: speed improvements Network Generalization Adam SGD-M SGHMC NNSGHMC SGLD Sigmoid Generalization Error Neg. LL Epoch iter Epoch iter 13 Bayesian NN on MNIST: long-time generalisation 14 Bayesian NN on MNIST: understanding the learned sampler Q f (z) = diag[f φq (z)], D f (z) = diag[αf φq (z) f φq (z) + f φd (z) + c] f φq f φd f φd (left): nearly linear wrt. energy (fast traversal, better exploration) (middle): decrease friction around high energy regions (right): increase friction when gradient & momentum disagree (prevent overshoot) 15 Summary MCMC and meta-learning can be friends: MCMC can be improved using meta-learning Meta-learning works better when searching in a theoretically sound framework Thank you! 16 Summary MCMC and meta-learning can be friends: MCMC can be improved using meta-learning Meta-learning works better when searching in a theoretically sound framework Future work: add in tempering, adaptive learning rate... meta-learn the Hamiltonian improving samplers for discrete variables Thank you! 16 References Welling and Teh (2011). Bayesian learning via stochastic gradient Langevin dynamics. ICML 2011 Chen et al. (2014). Stochastic gradient Hamiltonian Monte Carlo. ICML 2014 Li et al. (2016). Learning weight uncertainty with stochastic gradient MCMC for shape classification. CVPR 2016 Chen et al. (2016) Bridging the gap between stochastic gradient MCMC and stochastic optimization. AISTATS 2016 Salimans et al. (2015). Markov chain Monte Carlo and variational inference: Bridging the gap. ICML 2015 Song et al. (2017). A-NICE-MC: Adversarial training for MCMC. NIPS 2017 Levy et al. (2018). Generalizing Hamiltonian Monte Carlo with neural networks. ICLR 2018 Andrychowicz et al. (2016). Learning to learn by gradient descent by gradient descent. NIPS 2016 Li and Malik (2017). Learning to optimize. ICLR 2017 Wichrowska et al. (2017). Learned optimizers that scale and generalize. ICML 2017 Li and Turner (2018). Gradient estimators for implicit models. ICLR 2018 Ma et al. (2015). A complete recipe for stochastic gradient MCMC. NIPS
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x