Likelihood-based and Likelihood-free Unsupervised Learning

## {image-full top-left darkened /black-bg /no-status /first-slide title-slide fancy-slide bot}

- - - - 5 points {notes} - who - worked on unsup learning for 5-6 y - you'll get an overview of Graphical models vs NN, for unsupervised - you'll get a taste of interesting recent works - NIPS 2016 last week

## Disclaimer {infobox image-full top-right darkened /black-bg /no-status}

- Some notations are atypical. // due to the mix of domains - I will, almost surely, skip sections. - Don't hesitate to ask questions. ## TITLE {#plan plan overview /with-ujm} - Unsupervised Representation Learning {intro} - Notations and problem formulation {setup} - Probabilistic generative (graphical) models {probmod} - Auto-encoders {autoenc} - Generative Adversarial Networks {gan} - adversarial examples and training - GANs - Focus on … {focuses} - optimization {focus optim} - space and time convolutions {focus conv} - *depth {focus depth}*, *breadth/width {focus width}* - semantics {focus semantics} - sequential/temporal aspects {focus temporal} - recent GAN[s](#recentgans) {focus recentgans} - Wrap up {conclusion} # @copy:#plan: %+class:highlight: .intro ## Unsupervised (Representation) Learning - No labels available - Learning intermediate features or representations - Task agnostic - Related to (data) density estimation - Related to compression ## Example: motif mining in videos / temporal data @svg: motif-mining/motif-mining-task.svg 750 400 - @anim: #layer1 + -#init | #layer2 | #layer3 | #layer7 | #layer4 | #layer6 | #layer5 - Key points: structure? compression? density estimation? {slide} // st {hard, related, that I know} well-enough - {notes} - n. relation to compression, ... - n. notion of need to have a some structure/assumptions/priors - n. structured data probability density estimation - n. something that is {hard, related, that I know} well-enough # @copy:#plan: %+class:highlight: .setup # Notations and problem formulation {#setup} ## Notations and Problem Formulation - Notations - $x$ : data (observations) - $y$ : value to predict (for supervised cases) - $z$ : unknown, unobserved latent information - $\theta$ (or $W$) : model parameters // will come back on the differences z vs θ - Unsupervised learning - only $x$ is given - need to find the parameters ($\theta$, $W$) - may want to further infer the latent variables ($z$) # @copy:#plan: %+class:highlight: .probmod # Probabilistic (graphical) models {#probmod} ## Generative Model, Parameters, Latent Vars… - Observations / Data

- Supposition, we have a mixture of 3 gaussians {slide} - Challenge {slide} - gaussians have unknown *parameters* - which point belongs to which component is *not observable* - @anim: .first ## Probabilistic Modeling: principle {libyli} - Adopting a generative approach - think about how the world generated the data - describe it in a “generative model” - Formalize your assumptions about the observations (data) - choose/design a model - a model formulates how *some unknown variables* that are “responsible” for the *observations* (data) - set some priors on the unknown variables - Naming convention: different types of unknowns - parameters: unknown global parameters of the model - latent variables: unknown observation-specific variables // usually unknown - With a mixture of Gaussians - parameters: mean and covariances (and weight) of all Gaussians - latent variables: which Gaussian each data points comes from ## Probabilistic Model Learning {libyli} - The model is generative - describes how the data ($x$) gets generated - “forward model” - the probability of the observations: $p(x | \theta)$ - Finding the unknowns (parameters, latent var.) is challenging - reversing the generative process - finding (or maximizing) $p(\theta | x)$ or $p(\theta, z | x)$ or $p(z | x, \theta)$ - high dimensional parameter/latent spaces - highly non-convex functions ## M1 − PCA: intuition @svg: media/wikipedia-pca.svg 800 500 - @anim: #patch_3 | #patch_4 ## M1: PCA @svg: media/wikipedia-pca.svg 200 200 {model} @svg: graphs/theta-x.svg 100 250 {model m11 clearright} @svg: graphs/x-f-theta.svg 100 250 {model minv} // @svg: media/factor.svg 40 300 {model m12} - Principle Component Analysis (eigen-*) - dimensionality reduction - capture the maximum amount of data variance - PCA probabilistic view {libyli} - observations come from a single low-dimensional gaussian distribution - ... and are transformed with a linear transformation (rotation + scale), - ... and have added noise noisy - @anim: .m11 - Over-generic graphical representation {slide} - $\theta$ is linear transformation - data points $x$ depend on $\theta$ - no *explicit* latent variables *{ico-pencil}* - @anim: .minv - Inference problem: $f$ {slide} - dedicated algorithms
(covariance matrix eigenvalues, iterative methods, …) ## M2 − Topic Modeling: matrix factorization

- Probabilistic Latent Semantic Analysis (PLSA) - matrix decomposition {step2} - non-negative, normalized {step2} - probabilistic formulation {step2} - $p(w|d) = \sum_z p(w|z) \times p(z|d)$ {step2} - or $ x^i = \theta^T \cdot z^i $ (for a document $i$) {step3} - @anim: .svg1 | #documents | #topics - @anim: .step2 | .step3 @svg: media/matrix-decomposition.svg 700 200 {svg1} ## M2: Topic Models {libyli} // our notation in this pres is highly confusion with standards of this domain @svg: graphs/theta-x.svg 50 200 {model m21} @svg: graphs/thetaz-x.svg 130 200 {model m22 clearright} @svg: graphs/x-f-thetaz.svg 130 200 {model m23} - LDA, topic models […](file:///home/twilight/doc/PublicationsAndPresentations/2012-cpms/day-11/cpms-lecture-11-topic-models.html#slide-4) - Latent Dirichlet Allocation - mixture of discrete distributions (categorical/multinomial) - Bayesian formulation of // won't go into details about bayesian - LSA, LSI (Latent Semantic Indexing) // we don't distinguish pLSA/LDA here - Probabilistic formulation of - NMF (non-negative matrix factorization) - @anim: .m21 - $x^i = \theta^T \cdot z^i$ (for a document $i$) *{ico-pencil}* - @anim: .m22 | .m23 - Learning/Inference, $f$ - Gibbs sampling - EM: expectation maximization - variational inference # @copy:#plan: %+class:highlight: .autoenc # Auto-encoders
(with a classical NN before) {autoenc} ## Feed-forward Neural Networks (supervised) {libyli} @svg: graphs/ffnet.svg 120 400 {model} - Supervised learning (regression, classification, …) - the $x$ are given - the corresponding labels $y$ are given - Building blocks of “neural nets” - a neuron computes a weighted sum of its inputs - the sum is followed by an “activation” $\sigma$ - weights are learned ($W$) - $f^o(x^i, W) = \sigma\left( \sum_d W_{o,d} \times x^i_d \right) = \sigma\left( W_{o,.} ^T \cdot x^i\right)$ - Define a network architecture (class of functions) - number and dimension of layers - activation functions (sigmoid, tanh, ReLU, …) - … actually any composition of differentiable functions - Learning with stochastic gradient descent (SGD) and variants ## M3: Autoencoders {libyli} @svg: graphs/autoenc.svg 120 500 {model m31} - Idea: use a feed-forward approach - … for unsupervised learning (no labels) - to learn a compact data representation - Principle *{ico-pencil}* - try to predict the input form the input - have a latent **bottleneck**: limited model capacity - **encoder** $f$: from the input $x$ to the latent $z$ - **decoder** $g$: from the latent $z$ to the input $x$ - @anim: .m31, .cup - Learning principles of $f$ and $g$ - mean square reconstruction error: $\left\| g(f(x)) - x \right\|^2$ - SGD (like any neural net) - sparsifying regularization: sparse activations ($z$, $f(x)$) - add noise to the input (denoizing autoencoders)

n. RBM ? {notes} n. VAE ? {notes} # @copy:#plan: %+class:highlight: .gan # Generative Adversarial Networks {#gan} ## Adversarial Examples (Goodfellow, 2014) {libyli} // ostrich

- @anim: %attr:.hasFS:height:200 - In high dimensional spaces - a huge part of the input space is never seen / irrelevant - models are easy to fool - models are wrongly calibrated (bad confidence estimation) - Goal - build machine learning methods robust to adversarial examples - (relation to anomaly detection) - Idea of adversarial training - generate adversarial examples automatically - train also using these examples ## GAN Intuition {infobox image-full top-left darkened /black-bg /no-status}

- Ongoing struggle between two players: - one that makes fake samples, - one that tries to detect them. ## M4: Generative Adversarial Networks {libyli} @svg: graphs/ganright.svg 90 500 {model heightauto} @svg: graphs/ganleft.svg 90 482 {model heightauto alignbottom} - Principle: train two networks - $G$: to generate samples from noise - $D$: to discriminate between true and fake samples - NB: $G$ will try to fool $D$ - Elements *{ico-pencil}* - $x$: a training sample (real) - $z$: a random point in a latent space - $\tilde{x}{}$: a generated sample (fake) - $y$: a binary “fake” ($0$) or “real” ($1$) value - GAN is a minimax game - $\min_G \max_D V(D, G)$ - $V(D, G) = \; \mathbb{E}_{x} [log( D(x) ) ] + \mathbb{E}_z [log(1 - D(G(z))) ]$ ## GAN Target {libyli} @svg: graphs/ganright.svg 90 500 {model heightauto} - GAN optimization: $\min_G \max_D V(D, G)$ - $V(D, G) = \; \mathbb{E}_{x} [log( D(x) ) ] + \mathbb{E}_z [log(1 - D(G(z))) ]$ - find a $G$ that minimizes the accuracy of the **best** $D$ - Equilibrium and best strategies - $D$ ideally computes $D(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{gen}(x)}$ - thus $G$ should ideally fit $p_{data}(x)$ - … $G$ samples for $p_{data}(x)$ - Optimization in practice - alternate optimization of $G$ and $D$ - warning: $\min \max$ is not $\max \min$ - saddle point finding (hot topic) ## Example of GAN-generated Digits - DCGAN (Deep Convolutional GAN), Radford et al., 2015/2016

## Example of DCGAN-generated Faces @svg: media/dcgan-faces.svg 800 500 @anim: div.hasSVG | %viewbox:#zzz | %viewbox:#zzz2 # @copy:#plan: %+class:highlight: .focuses, .focus.optim ## How is all This Optimized {libyli} - Deep models (composition of differentiable function) - … using “back-propagation” (chain rule) - (S)GD / SGD with momentum - SGD with adaptation: RMSProp, ADAM, … - batch normalization trick - link: other tricks for [learning GANs](https://github.com/soumith/ganhacks) - are local minima any good? - link: [which optimizer?](http://sebastianruder.com/optimizing-gradient-descent/index.html#whichoptimizertouse) - Probabilistic models *{ico-pencil}* - Gibbs sampling - Expectation Maximization - Variational Inference - Black-box variational inference (e.g., [Edward](https://github.com/blei-lab/edward)) - Probabilistic models, likelihood-free - empirical likelihood (Owen, 1988) - mean-shift estimation (Fukunaga, 1975) - method-of-moments (Hall, 2005) - Approximate Bayesian computation, ABC (Marin et al, 2012) ## An overview of gradient descent optimization algorithms {no-print}

[which optimizer to use?](http://sebastianruder.com/optimizing-gradient-descent/index.html#whichoptimizertouse) ## An overview of gradient descent optimization algorithms {no-print}

[which optimizer to use?](http://sebastianruder.com/optimizing-gradient-descent/index.html#whichoptimizertouse) # @copy:#plan: %+class:highlight: .focuses, .focus.conv ## Convolution Models - Extensions of topic models - replace topic with motifs (with temporal structure) - PLSM, HDLSM (Emonet et al., 2014) - Convolutional Neural Networks - most of Christian's talk (ConvNets) - pixelRNN, … # @copy:#plan: %+class:highlight: .focuses, .focus.depth ## Depth in Unsupervised Learning {libyli} - Neural Network depth = Hierarchical probabilistic models - Neural Networks - “deep learning” - adding layers - handling depth with ReLU - handling depth with “ResNets”, Residual Networks (Deep residual learning for image recognition, He et al. 2015) - Hierarchical probabilistic models - Topic Models (LDA, Blei, Ng, Jordan, 2003) - Deep exponential families (Ranganath et al., 2015) *{ico-pencil}* // blei AISTATS - Deep Gaussian Processes (Damianou, Lawrence, 2013) // AISTATS @svg: media/deepexpfam.svg 100 100 {model} # @copy:#plan: %+class:highlight: .focuses, .focus.width ## Width in Unsupervised Learning {libyli} - Width - Topic model: number of topics - Autoencoder: number of neurons in the hidden layer - GAN: size of $z$ - Non-parametric approaches, HDP, HDLSM (Emonet et al., 2014) - Gaussian process as an infinitely wide NN layer
(Damianou, Lawrence, 2013) - universal function approximator - Autoencoders with group sparsity (Bascol et al., 2016) - allow for many hidden units - penalize the use of too many of them # @copy:#plan: %+class:highlight: .focuses, .focus.semantics ## Semantics in Unsupervised Learning {libyli} - Probabilistic models - inference is difficult - consider the “explains away” principle - lead to better interpretability (meaningful $z$) // rain, sun allergy, umbrella - Simpler feed-forward model - independent processing - inhibitory feedback is difficult - Bascol et al, 2016 … - group-sparsity on filters - local activation inhibition - global activation entropy maximization - AdaReLU: activation function that zeroes low-energy points # @copy:#plan: %+class:highlight: .focuses, .focus.temporal ## Sequential and Temporal Modeling - cf. Christian Wolf's talk - HMM, CRF =?= RNN - LSTM =?= HSMM *{ico-hugepencil}* # @copy:#plan: %+class:highlight: .focuses, .focus.recentgans # Recent GAN Works {#recentgans} ## M5: BiGAN, ALI (2016) @svg: graphs/biganright.svg 200 500 {model} @svg: graphs/biganleft.svg 200 500 {model} *{ico-hugepencil}* ## M6: InfoGAN (2016) {libyli} @svg: graphs/infoganright.svg 250 480 {model} // @svg: graphs/ganleft.svg 93 480 {model heightauto alignbottom} - GAN noise ($z$) - is unstructured - can be partly ignored by $G$ - InfoGAN idea and principle - part of the noise is a code $c$ - enforce high mutual information
between $c$ and $\tilde{x}{}$ - in practice, predict $c$ from $\tilde{x}{}$ - use a coder $Q$ - @anim: .model - Structure in the code *{ico-pencil}* - Cartesian product of anything - (categorical, continuous, ...) - $\min_{G,Q} \max_D V_{InfoGAN}(D, G, Q) = V_{InfoGAN}(D, G) - \lambda L_I(G, Q) $ {denser} ## InfoGAN: some results

## *GAN as a Modeling Tool {libyli} - Conditional GANs and variants (2016) - the GAN process is conditioned on some data - e.g., image generation condition on a semantic mask - e.g., image conditioned on a text sentence - e.g., audio conditioned on a text sentence - e.g., image conditioned on class and keypoints - … - Very complex (and operational) setups ## Ex: Learning What and Where to Draw

## Ex: Learning What and Where to Draw - Scott Reed et al.

# @copy:#plan: %+class:highlight: .conclusion ## Take-home Message {infobox takehome image-fit top-left darkened /black-bg /no-status /fancy-slide}

- Return of the generative approaches. - Two ways of estimating densities - generative models, - generative networks. - The golden age of Variational Inference. // (and black-box VI) - The golden age of SGD. // is the Gibbs sampling of continuous spaces - Saddle points! // active and will progress a lot ## Thank You!

Questions? {infobox image-fit top-left darkened /black-bg /fancy-slide /no-status deck-status-fake-end nocurien} - - - - twitter − [@remiemonet](https://twitter.com/remiemonet) - twitter − [@DataIntelGroup](https://twitter.com/DataIntelGroup)

# Attributions ## au_ears {image-full bottom-left darkened /black-bg /no-status}

## seefit {image-full bottom-left darkened /black-bg /no-status}

## govan riverside {image-fit bottom-left darkened /black-bg /no-status}

## GorissenM {image-full bottom-left darkened /black-bg /no-status}

## someToast {image-fit bottom-left darkened /black-bg /no-status}

## ShutterRunner {image-full bottom-left darkened /black-bg /no-status}

## END {no-print}