Please wait, while our marmots are preparing the hot chocolate…
##  {image-full top-left darkened /black-bg /no-status /first-slide title-slide fancy-slide}
- 
- 
- 
## TITLE {#plan plan overview /with-ujm}
- Least Square Linear Regression:
 a Probabilistic View
- Regularized Least Square Regression:
 a Bayesian View
- Mixture Models and Alternatives to Model Selection
- Learning Sparse Matrices // http://www.jmlr.org/papers/volume12/griffiths11a/griffiths11a.pdf
- Conclusion
# Linear Regression: 
 a Probabilistic View {linearregtitle}
## Linear Regression: a Probabilistic Model
@svg: bayesian-least-square/linear-gaussian-noise.svg 300 300 {floatright}
- We observe $S = \left\\{ (\mathbf{x}_i, y_i )\right\\} \_{i=1}^n $ {go linearregtitle}
- We model $y_i =\; \mathbf{w}^T\mathbf{x}_i + \epsilon_i$, 
with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ {go linearregtitle next1}
  - equivalent to: $y_i \sim \mathcal{N}(\mathbf{w}^T\mathbf{x}_i, \sigma^2)$
  - NB: as the noise was independent,
 $y_i$ are independent, given $\mathbf{w}{}$
- The parameters of the model are $\mathbf{w}{}$ {go linearregtitle}
- @anim: %-class:linearregtitle:.go
- @anim: .floatright |#linear |#gauss1 |#gauss2 |#gradiented
- @anim: .next1 li
- The likelihood is $L(\mathbf{w}, S) = p(S | \mathbf{w})$ {slide libyli}
  - from the Independence $L(\mathbf{w}, S) = \prod\_{i=1}^n p(\mathbf{x}_i, \mathbf{y}_i | \mathbf{w})$
  - from the Normal distribution: $L(\mathbf{w}, S) = \prod\_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} e^\frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2} {}$
## Linear Regression: maximum likelihood
- Reminder
  - {no}
      - likelihood: $L(\mathbf{w}, S) = \prod\_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} \exp\left(\frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2}\right) {}$
- We want to maximize the likelihood, over $\mathbf{w}{}$, {libyli}
  - we will rather consider the log-likelihood {libyli}
      - $\log L(\mathbf{w}, S) = \sum\_{i=1}^n \left( -\log(\sigma \sqrt{2\pi}) + \frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2} \right)$
      - $\log L(\mathbf{w}, S) = - n \log(\sigma \sqrt{2\pi}) + \frac{1}{2 \sigma^2} \sum\_{i=1}^n - (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $
  - we have {libyli}
      - $\arg\max_w \; L(\mathbf{w}, S) = \arg\max_w \; \log L(\mathbf{w}, S)$
      - $\arg\max_w \; L(\mathbf{w}, S) = \arg\max_w \; \frac{1}{2 \sigma^2} \sum\_{i=1}^n - (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $
      - $\arg\max_w \; L(\mathbf{w}, S) = \arg\min_w \; \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $
## Linear Regression: summary{libyli}
- Supposing a Gaussian noise around the linear predictor $\mathbf{w}{}$
- These are equivalent point of views to find $\mathbf{w}{}$
  - maximizing the likelihood $L(\mathbf{w}, S) = \prod\_{i=1}^n p(\mathbf{x}_i, \mathbf{y}_i | \mathbf{w})$
  - minimizing $\frac12 \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $
  - solving the least square linear regression problem
- NB: the noise variance $\sigma^2$ does not appear, cool! {coocoocool}
# Regularized Least Square:{linearregtitle}
 a Bayesian{bay} View{linearregtitle}
## Should we really maximize the likelihood? {libyli}
- We have observations ($S$) and 
 we want to find the best parameters $\mathbf{w}{}$
- We actually want to find{libyli}
  - the parameters that are the most supported by the observations
  - i.e., the parameters that are most likely, knowing the observations
  - i.e., $\arg\max_w \; p(\mathbf{w} | S)$     (reminder: the likelihood is $L(\mathbf{w}, S) = p(S | \mathbf{w})$) {captain}
- How is it related? {libyli}
  - Bayes:   $ p(A|B) p(B) = p(A \wedge B) = p(B \wedge A) = p(B|A) p(A) {captain}$
  - Bayes, v2:   $ p(A|B) = \frac{p(B|A) p(A)}{p(B)} {}$
  - so:   $p(\mathbf{w} | S) = \frac{p(S | \mathbf{w}) p(\mathbf{w})}{p(S)} {}$
## Bayesian Posterior Optimization {libyli}
- We have observations $S$, we want the best parameters $\mathbf{w}{}$ {captain}
- We want to {libyli}
  - maximize: $p(\mathbf{w} | S) = \frac{p(S | \mathbf{w}) p(\mathbf{w})}{p(S)} {}$
  - i.e., maximize: $p(S | \mathbf{w}) p(\mathbf{w})$    (as $p(S)$ does not depend en $\mathbf{w}{}$) {captain}
  - i.e, minimize: $- \log(p(S | \mathbf{w}) p(\mathbf{w}))$    (as $\log$ is increasing) {captain}
  - i.e, minimize: $-\log(p(S | \mathbf{w})) -\log(p(\mathbf{w}))$
  - i.e, minimize: $-\log L(\mathbf{w}, S) -\log(p(\mathbf{w}))$
- One (possible) interpretation {slide}
  - we want to optimize the (negative-log)-likelihood (as in MLE) 
 but **penalized** by the (negative-log)-prior
# And Ockham? 
 (aside on the board) // based on the broadness of the prior etc
## From Prior to Regularization
- Bayesian opt., minimize:   $-\log L(\mathbf{w}, S) -\log(p(\mathbf{w}))$ {captain}
- Back to a Gaussian noise $\sigma^2$ around the linear predictor {libyli}
  - minimize:   $\frac{1}{2 \sigma^2} \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2  - \log(p(\mathbf{w}))$
  - i.e., minimize:   $\frac{1}{2} \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2  - \sigma^2\cdot \log(p(\mathbf{w}))$
- We can identify {slide}
  - the regularization for regularized the least square
  - and, $- \sigma^2\cdot \log(p(\mathbf{w}))$   (the obs. noise variance times the log-prior) {captain}
  - NB: $\log p$ is negative, so it is really a penalty {captain}
## Priors and (some) $L_p$ Norms {dense libyli}
@svg: bayesian-least-square/norm-lap-1d.svg 100px 100px {floatright clearboth}
@svg: bayesian-least-square/norm-lap-2d.svg 100px 100px {floatright clearboth}
@svg: bayesian-least-square/generalized-normal.svg 100px 100px {floatright clearboth}
- Regularization is identified to $- \sigma^2\cdot \log(p(\mathbf{w}))$ {captain} // NB the isotropic normal is easier seen decomposed (TODO)
- Isotropic Normal prior, i.e.:   $ \mathbf{w} \sim \mathcal{N}\left(0, \sigma_w^2 \mathbf{I} \right)$
  - $\log(p(\mathbf{w})) = cst + \log \exp\left(- \frac{\mathbf{w}^T\mathbf{w}}{2 \sigma_w^2}\right) $
  - i.e., $- \log(p(\mathbf{w})) = \frac{\mathbf{w}^T\mathbf{w}}{2 \sigma_w^2} - cst  =  \frac{\left\| \mathbf{w} \right\|_2^2}{2 \sigma_w^2} - cst$
  - Regularizer: $\frac{\sigma^2}{2 \sigma_w^2} \left\| \mathbf{w} \right\|_2^2 $
  - can use generalized gaussian to remove the square? {comment}
- Laplace prior, i.e.:   $ \mathbf{w_j} \sim \mathcal{Laplace}\left(0, b_w\right)$
  - $\log(p(\mathbf{w})) = cst + \sum_j \log \exp\left(- \frac{\left|\mathbf{w_j}\right|}{b_w}\right) $
  - Regularizer: $\frac{\sigma^2}{b_w} \left\| \mathbf{w} \right\|_1 $
- Generalized Normal distr. (v1):  $p(x | \mu, \alpha, \beta) = \frac{\beta}{2\alpha\Gamma(1/\beta)} \; \exp\left(-\frac{\left|x-\mu\right|^\beta}{\alpha^\beta}\right) {}$ 
  - Regularizer: $\frac{\sigma^2}{\alpha^\beta} \left\| \mathbf{w} \right\|_\beta^\beta $
# Mixture Models and Alternatives to Model Selection (skipped)
# Learning Sparse Matrices (skipped)
## Take-home Message {image-fit top-left darkened /black-bg /no-status /fancy-slide}
# That's It!
Questions? {deck-status-fake-end}
# Attributions {no-print}
## jared {image-full bottom-left darkened /black-bg /no-status no-print}
## someToast {image-fit bottom-left darkened /black-bg /no-status no-print}
## Wikipedia {image-fit bottom-left darkened /black-bg /no-status no-print}
      /  − will be replaced by the author − will be replaced by the title