Please wait, while our marmots are preparing the hot chocolate…
## {image-full top-left darkened /black-bg /no-status /first-slide title-slide fancy-slide}
-
-
-
## TITLE {#plan plan overview /with-ujm}
- Least Square Linear Regression:
a Probabilistic View
- Regularized Least Square Regression:
a Bayesian View
- Mixture Models and Alternatives to Model Selection
- Learning Sparse Matrices // http://www.jmlr.org/papers/volume12/griffiths11a/griffiths11a.pdf
- Conclusion
# Linear Regression:
a Probabilistic View {linearregtitle}
## Linear Regression: a Probabilistic Model
@svg: bayesian-least-square/linear-gaussian-noise.svg 300 300 {floatright}
- We observe $S = \left\\{ (\mathbf{x}_i, y_i )\right\\} \_{i=1}^n $ {go linearregtitle}
- We model $y_i =\; \mathbf{w}^T\mathbf{x}_i + \epsilon_i$,
with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ {go linearregtitle next1}
- equivalent to: $y_i \sim \mathcal{N}(\mathbf{w}^T\mathbf{x}_i, \sigma^2)$
- NB: as the noise was independent,
$y_i$ are independent, given $\mathbf{w}{}$
- The parameters of the model are $\mathbf{w}{}$ {go linearregtitle}
- @anim: %-class:linearregtitle:.go
- @anim: .floatright |#linear |#gauss1 |#gauss2 |#gradiented
- @anim: .next1 li
- The likelihood is $L(\mathbf{w}, S) = p(S | \mathbf{w})$ {slide libyli}
- from the Independence $L(\mathbf{w}, S) = \prod\_{i=1}^n p(\mathbf{x}_i, \mathbf{y}_i | \mathbf{w})$
- from the Normal distribution: $L(\mathbf{w}, S) = \prod\_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} e^\frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2} {}$
## Linear Regression: maximum likelihood
- Reminder
- {no}
- likelihood: $L(\mathbf{w}, S) = \prod\_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} \exp\left(\frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2}\right) {}$
- We want to maximize the likelihood, over $\mathbf{w}{}$, {libyli}
- we will rather consider the log-likelihood {libyli}
- $\log L(\mathbf{w}, S) = \sum\_{i=1}^n \left( -\log(\sigma \sqrt{2\pi}) + \frac{- (y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2 \sigma^2} \right)$
- $\log L(\mathbf{w}, S) = - n \log(\sigma \sqrt{2\pi}) + \frac{1}{2 \sigma^2} \sum\_{i=1}^n - (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $
- we have {libyli}
- $\arg\max_w \; L(\mathbf{w}, S) = \arg\max_w \; \log L(\mathbf{w}, S)$
- $\arg\max_w \; L(\mathbf{w}, S) = \arg\max_w \; \frac{1}{2 \sigma^2} \sum\_{i=1}^n - (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $
- $\arg\max_w \; L(\mathbf{w}, S) = \arg\min_w \; \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $
## Linear Regression: summary{libyli}
- Supposing a Gaussian noise around the linear predictor $\mathbf{w}{}$
- These are equivalent point of views to find $\mathbf{w}{}$
- maximizing the likelihood $L(\mathbf{w}, S) = \prod\_{i=1}^n p(\mathbf{x}_i, \mathbf{y}_i | \mathbf{w})$
- minimizing $\frac12 \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 $
- solving the least square linear regression problem
- NB: the noise variance $\sigma^2$ does not appear, cool! {coocoocool}
# Regularized Least Square:{linearregtitle}
a Bayesian{bay} View{linearregtitle}
## Should we really maximize the likelihood? {libyli}
- We have observations ($S$) and
we want to find the best parameters $\mathbf{w}{}$
- We actually want to find{libyli}
- the parameters that are the most supported by the observations
- i.e., the parameters that are most likely, knowing the observations
- i.e., $\arg\max_w \; p(\mathbf{w} | S)$ (reminder: the likelihood is $L(\mathbf{w}, S) = p(S | \mathbf{w})$) {captain}
- How is it related? {libyli}
- Bayes: $ p(A|B) p(B) = p(A \wedge B) = p(B \wedge A) = p(B|A) p(A) {captain}$
- Bayes, v2: $ p(A|B) = \frac{p(B|A) p(A)}{p(B)} {}$
- so: $p(\mathbf{w} | S) = \frac{p(S | \mathbf{w}) p(\mathbf{w})}{p(S)} {}$
## Bayesian Posterior Optimization {libyli}
- We have observations $S$, we want the best parameters $\mathbf{w}{}$ {captain}
- We want to {libyli}
- maximize: $p(\mathbf{w} | S) = \frac{p(S | \mathbf{w}) p(\mathbf{w})}{p(S)} {}$
- i.e., maximize: $p(S | \mathbf{w}) p(\mathbf{w})$ (as $p(S)$ does not depend en $\mathbf{w}{}$) {captain}
- i.e, minimize: $- \log(p(S | \mathbf{w}) p(\mathbf{w}))$ (as $\log$ is increasing) {captain}
- i.e, minimize: $-\log(p(S | \mathbf{w})) -\log(p(\mathbf{w}))$
- i.e, minimize: $-\log L(\mathbf{w}, S) -\log(p(\mathbf{w}))$
- One (possible) interpretation {slide}
- we want to optimize the (negative-log)-likelihood (as in MLE)
but **penalized** by the (negative-log)-prior
# And Ockham?
(aside on the board) // based on the broadness of the prior etc
## From Prior to Regularization
- Bayesian opt., minimize: $-\log L(\mathbf{w}, S) -\log(p(\mathbf{w}))$ {captain}
- Back to a Gaussian noise $\sigma^2$ around the linear predictor {libyli}
- minimize: $\frac{1}{2 \sigma^2} \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 - \log(p(\mathbf{w}))$
- i.e., minimize: $\frac{1}{2} \sum\_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2 - \sigma^2\cdot \log(p(\mathbf{w}))$
- We can identify {slide}
- the regularization for regularized the least square
- and, $- \sigma^2\cdot \log(p(\mathbf{w}))$ (the obs. noise variance times the log-prior) {captain}
- NB: $\log p$ is negative, so it is really a penalty {captain}
## Priors and (some) $L_p$ Norms {dense libyli}
@svg: bayesian-least-square/norm-lap-1d.svg 100px 100px {floatright clearboth}
@svg: bayesian-least-square/norm-lap-2d.svg 100px 100px {floatright clearboth}
@svg: bayesian-least-square/generalized-normal.svg 100px 100px {floatright clearboth}
- Regularization is identified to $- \sigma^2\cdot \log(p(\mathbf{w}))$ {captain} // NB the isotropic normal is easier seen decomposed (TODO)
- Isotropic Normal prior, i.e.: $ \mathbf{w} \sim \mathcal{N}\left(0, \sigma_w^2 \mathbf{I} \right)$
- $\log(p(\mathbf{w})) = cst + \log \exp\left(- \frac{\mathbf{w}^T\mathbf{w}}{2 \sigma_w^2}\right) $
- i.e., $- \log(p(\mathbf{w})) = \frac{\mathbf{w}^T\mathbf{w}}{2 \sigma_w^2} - cst = \frac{\left\| \mathbf{w} \right\|_2^2}{2 \sigma_w^2} - cst$
- Regularizer: $\frac{\sigma^2}{2 \sigma_w^2} \left\| \mathbf{w} \right\|_2^2 $
- can use generalized gaussian to remove the square? {comment}
- Laplace prior, i.e.: $ \mathbf{w_j} \sim \mathcal{Laplace}\left(0, b_w\right)$
- $\log(p(\mathbf{w})) = cst + \sum_j \log \exp\left(- \frac{\left|\mathbf{w_j}\right|}{b_w}\right) $
- Regularizer: $\frac{\sigma^2}{b_w} \left\| \mathbf{w} \right\|_1 $
- Generalized Normal distr. (v1): $p(x | \mu, \alpha, \beta) = \frac{\beta}{2\alpha\Gamma(1/\beta)} \; \exp\left(-\frac{\left|x-\mu\right|^\beta}{\alpha^\beta}\right) {}$
- Regularizer: $\frac{\sigma^2}{\alpha^\beta} \left\| \mathbf{w} \right\|_\beta^\beta $
# Mixture Models and Alternatives to Model Selection (skipped)
# Learning Sparse Matrices (skipped)
## Take-home Message {image-fit top-left darkened /black-bg /no-status /fancy-slide}
# That's It!
Questions? {deck-status-fake-end}
# Attributions {no-print}
## jared {image-full bottom-left darkened /black-bg /no-status no-print}
## someToast {image-fit bottom-left darkened /black-bg /no-status no-print}
## Wikipedia {image-fit bottom-left darkened /black-bg /no-status no-print}
/ − will be replaced by the author − will be replaced by the title