Sparsity in Probabilistic Models

## {image-full top-left darkened /black-bg /no-status /first-slide title-slide fancy-slide}

- - - ## TITLE {#plan plan overview /with-ujm} - Least Square Linear Regression:
a Probabilistic View - Regularized Least Square Regression:
a Bayesian View - Mixture Models and Alternatives to Model Selection - Learning Sparse Matrices // http://www.jmlr.org/papers/volume12/griffiths11a/griffiths11a.pdf - Conclusion # Linear Regression:
a Probabilistic View {linearregtitle} ## Linear Regression: a Probabilistic Model @svg: bayesian-least-square/linear-gaussian-noise.svg 300 300 {floatright} - We observe $S = \left\\{ (\mathbf{x}_i, y_i )\right\\} \_{i=1}^n $ {go linearregtitle} - We model $y_i =\; \mathbf{w}^T\mathbf{x}_i + \epsilon_i$,
with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ {go linearregtitle next1} - equivalent to: $y_i \sim \mathcal{N}(\mathbf{w}^T\mathbf{x}_i, \sigma^2)$ - NB: as the noise was independent,
$y_i$ are independent, given $\mathbf{w}{}$ - The parameters of the model are $\mathbf{w}{}$ {go linearregtitle} - @anim: %-class:linearregtitle:.go - @anim: .floatright |#linear |#gauss1 |#gauss2 |#gradiented - @anim: .next1 li - The likelihood is $L(\mathbf{w}, S) = p(S | \mathbf{w})$ {slide libyli} - from the Independence $L(\mathbf{w}, S) = \prod\_{i=1}^n p(\mathbf{x}_i, \mathbf{y}_i | \mathbf{w})$ - from the Normal distribution: $L(\mathbf{w}, S) = \prod\_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} e^\frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2} {}$ ## Linear Regression: maximum likelihood - Reminder - {no} - likelihood: $L(\mathbf{w}, S) = \prod\_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} \exp\left(\frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2}\right) {}$ - We want to maximize the likelihood, over $\mathbf{w}{}$, {libyli} - we will rather consider the log-likelihood {libyli} - $\log L(\mathbf{w}, S) = \sum\_{i=1}^n \left( -\log(\sigma \sqrt{2\pi}) + \frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2} \right)$ - $\log L(\mathbf{w}, S) = - n \log(\sigma \sqrt{2\pi}) + \frac{1}{2 \sigma^2} \sum\_{i=1}^n - (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $ - we have {libyli} - $\arg\max_w \; L(\mathbf{w}, S) = \arg\max_w \; \log L(\mathbf{w}, S)$ - $\arg\max_w \; L(\mathbf{w}, S) = \arg\max_w \; \frac{1}{2 \sigma^2} \sum\_{i=1}^n - (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $ - $\arg\max_w \; L(\mathbf{w}, S) = \arg\min_w \; \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $ ## Linear Regression: summary{libyli} - Supposing a Gaussian noise around the linear predictor $\mathbf{w}{}$ - These are equivalent point of views to find $\mathbf{w}{}$ - maximizing the likelihood $L(\mathbf{w}, S) = \prod\_{i=1}^n p(\mathbf{x}_i, \mathbf{y}_i | \mathbf{w})$ - minimizing $\frac12 \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $ - solving the least square linear regression problem - NB: the noise variance $\sigma^2$ does not appear, cool! {coocoocool} # Regularized Least Square:{linearregtitle}
a Bayesian{bay} View{linearregtitle} ## Should we really maximize the likelihood? {libyli} - We have observations ($S$) and
we want to find the best parameters $\mathbf{w}{}$ - We actually want to find{libyli} - the parameters that are the most supported by the observations - i.e., the parameters that are most likely, knowing the observations - i.e., $\arg\max_w \; p(\mathbf{w} | S)$ (reminder: the likelihood is $L(\mathbf{w}, S) = p(S | \mathbf{w})$) {captain} - How is it related? {libyli} - Bayes: $ p(A|B) p(B) = p(A \wedge B) = p(B \wedge A) = p(B|A) p(A) {captain}$ - Bayes, v2: $ p(A|B) = \frac{p(B|A) p(A)}{p(B)} {}$ - so: $p(\mathbf{w} | S) = \frac{p(S | \mathbf{w}) p(\mathbf{w})}{p(S)} {}$ ## Bayesian Posterior Optimization {libyli} - We have observations $S$, we want the best parameters $\mathbf{w}{}$ {captain} - We want to {libyli} - maximize: $p(\mathbf{w} | S) = \frac{p(S | \mathbf{w}) p(\mathbf{w})}{p(S)} {}$ - i.e., maximize: $p(S | \mathbf{w}) p(\mathbf{w})$ (as $p(S)$ does not depend en $\mathbf{w}{}$) {captain} - i.e, minimize: $- \log(p(S | \mathbf{w}) p(\mathbf{w}))$ (as $\log$ is increasing) {captain} - i.e, minimize: $-\log(p(S | \mathbf{w})) -\log(p(\mathbf{w}))$ - i.e, minimize: $-\log L(\mathbf{w}, S) -\log(p(\mathbf{w}))$ - One (possible) interpretation {slide} - we want to optimize the (negative-log)-likelihood (as in MLE)
but **penalized** by the (negative-log)-prior # And Ockham?
(aside on the board) // based on the broadness of the prior etc ## From Prior to Regularization - Bayesian opt., minimize: $-\log L(\mathbf{w}, S) -\log(p(\mathbf{w}))$ {captain} - Back to a Gaussian noise $\sigma^2$ around the linear predictor {libyli} - minimize: $\frac{1}{2 \sigma^2} \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 - \log(p(\mathbf{w}))$ - i.e., minimize: $\frac{1}{2} \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 - \sigma^2\cdot \log(p(\mathbf{w}))$ - We can identify {slide} - the regularization for regularized the least square - and, $- \sigma^2\cdot \log(p(\mathbf{w}))$ (the obs. noise variance times the log-prior) {captain} - NB: $\log p$ is negative, so it is really a penalty {captain} ## Priors and (some) $L_p$ Norms {dense libyli} @svg: bayesian-least-square/norm-lap-1d.svg 100px 100px {floatright clearboth} @svg: bayesian-least-square/norm-lap-2d.svg 100px 100px {floatright clearboth} @svg: bayesian-least-square/generalized-normal.svg 100px 100px {floatright clearboth} - Regularization is identified to $- \sigma^2\cdot \log(p(\mathbf{w}))$ {captain} // NB the isotropic normal is easier seen decomposed (TODO) - Isotropic Normal prior, i.e.: $ \mathbf{w} \sim \mathcal{N}\left(0, \sigma_w^2 \mathbf{I} \right)$ - $\log(p(\mathbf{w})) = cst + \log \exp\left(- \frac{\mathbf{w}^T\mathbf{w}}{2 \sigma_w^2}\right) $ - i.e., $- \log(p(\mathbf{w})) = \frac{\mathbf{w}^T\mathbf{w}}{2 \sigma_w^2} - cst = \frac{\left\| \mathbf{w} \right\|_2^2}{2 \sigma_w^2} - cst$ - Regularizer: $\frac{\sigma^2}{2 \sigma_w^2} \left\| \mathbf{w} \right\|_2^2 $ - can use generalized gaussian to remove the square? {comment} - Laplace prior, i.e.: $ \mathbf{w_j} \sim \mathcal{Laplace}\left(0, b_w\right)$ - $\log(p(\mathbf{w})) = cst + \sum_j \log \exp\left(- \frac{\left|\mathbf{w_j}\right|}{b_w}\right) $ - Regularizer: $\frac{\sigma^2}{b_w} \left\| \mathbf{w} \right\|_1 $ - Generalized Normal distr. (v1): $p(x | \mu, \alpha, \beta) = \frac{\beta}{2\alpha\Gamma(1/\beta)} \; \exp\left(-\frac{\left|x-\mu\right|^\beta}{\alpha^\beta}\right) {}$ - Regularizer: $\frac{\sigma^2}{\alpha^\beta} \left\| \mathbf{w} \right\|_\beta^\beta $ # Mixture Models and Alternatives to Model Selection (skipped) # Learning Sparse Matrices (skipped) ## Take-home Message {image-fit top-left darkened /black-bg /no-status /fancy-slide}

# That's It!
Questions? {deck-status-fake-end} # Attributions {no-print} ## jared {image-full bottom-left darkened /black-bg /no-status no-print}

## someToast {image-fit bottom-left darkened /black-bg /no-status no-print}

## Wikipedia {image-fit bottom-left darkened /black-bg /no-status no-print}