Please wait, while our marmots are preparing the hot chocolate…
=Meta=[*no-status]
* Introduction to Fisher Kernel
* Introduction to Fisher Kernel/Vectors for Image classification
* M1 internship: Fisher Kernel for Temporal Topic Models
====[title-slide][*no-status]
*
*
==Simple Quiz!==
=How similar are these two 2d points?=
// color does not matter
=WAT?=[image-full][top-left][darkened][*no-status][*black-bg]
=Which pair of points are most similar?=
// color does not matter
@
A: narrow
B: wide
@
=Fisher Kernel:
context-aware comparison of points=[*no-status][image-full][bottom-left][darkened][*black-bg]
// This is the intuition: comparing points given a global distribution (within a context)
==Probably Just Some Reminders==
=Some Probability Rules, Measure Theory=
@svg: floatrightbr,aa proba/potato-xandy.svg 245px 135px
@svg: floatrightbr,bb proba/potato-xsumy.svg 245px 285px
* Product rule[prod]
** $p(X, Y) = p(X|Y) \, p(Y)$ = $p(Y|X) \, p(X)$
* Marginalization, Sum rule[sum]
** $p(X) = \sum_{Y \in \mathcal{Y}} p(X,Y)$
** $p(X) = \int_{Y \in \mathcal{Y}} p(X,Y)$
* Bayes rule[bayes]
** $p(Y|X) = \frac{p(X|Y) \, p(Y)}{p(X)}$
** $p(X) = \sum_{Y\in \mathcal{Y}} p(X|Y) \, p(Y)$[denom]
@anim-appear:800: .aa + .prod | .bb + .sum | .app | .bayes | .denom
=Gaussian/Normal Distribution: basics=
@svg: floatrightbr,fullabs gaussian/normal-distribution-stddev.svg 800px 500px
=Gaussian/Normal Distribution: basics=
@svg: floatrightbr gaussian/normal-distribution-stddev.svg 200px 200px
* Normal Distribution or Gaussian Distribution[bb]
** $N(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} exp(-\frac{(x-\mu)^2}{2 \sigma^2})$
** Is-a probability density[bbb]
** $\int_{-\infty}^{+\infty} N(x|\mu,\sigma^2) dx = 1$[bbb]
** $N(x|\mu,\sigma^2) > 0$[bbb]
* Parameters[cc]
** $\mu$: mean, $E[X] = \mu$
** $\sigma^2$: variance, $E[(X -E[X])^2 ] = \sigma^2$
@anim-appear:800: .bb |.bbb |.cc
=Multivariate Normal Distribution=
* D-dimensional space: $x = \{x_1, ..., x_D\}$
* Probability distribution[slide]
** $N(x|\mu,\Sigma) = \frac{1}{\sqrt{(2\pi)^D \|\Sigma\|}}\; exp(-\frac{(x-\mu)^T\Sigma^{-1}(x-\mu)}{2})$
** $\Sigma$: covariance matrix
@svg: floatleft gaussian/multivariate-normal.svg 800px 250px
@anim-appear:400: .hasSVG
==Probabilistic Models
and
Statistical Learning==
=Maximum Likelihood Estimation=
* We observe data points[slide]
** e.g., $\{ X_i \}_{i=1..N} \in R^N$
* We suppose a model (with parameters)[slide]
** e.g., $\forall i, X_i \sim N(.|\mu,\sigma^2)$
* We may try to maximize the likelihood
(find the “best parameters”)[slide]
** e.g., $\operatorname{arg max}_{\mu, \sigma^2} \mathcal{L}(\mu, \sigma^2 | X) = \operatorname{arg max}_{\mu, \sigma^2} p(X | \mu, \sigma^2) $
// what about priors? could be a two-hour discussion :)
* We often maximize the log-likelihood[slide]
** e.g., $\operatorname{arg max}_{\mu, \sigma^2} \mathcal{L}(\mu, \sigma^2 | X) = \operatorname{arg max}_{\mu, \sigma^2} \log \mathcal{L}(\mu, \sigma^2 | X)$
** e.g., $\log p(X | \mu, \sigma^2) = \log \prod_{i=1..N} p(X_i | \mu, \sigma^2) = \sum_{i=1..N} \log p(X_i | \mu, \sigma^2)$[slide]
** ⇒ equivalent, more convenient[slide]
=Maximization of the Log-Likelihood=
* We need to maximize $\log \mathcal{L}(\mu, \sigma^2 | X) = \log p(X | \mu, \sigma^2)$
* We set the derivatives to $0$ and solve it[slide]
** e.g., $\frac{\partial \log p(X | \mu, \sigma^2)}{\partial \mu} = 0$ and $\frac{\partial \log p(X | \mu, \sigma^2)}{\partial \sigma^2} = 0$
* Example, with a Gaussian: $p(X_i | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp^{(-\frac{(x-\mu)^2}{2 \sigma^2})}$[slide]
** we get $\mu = \frac{1}{N}\sum_{i=1..N} X_i$ and $\sigma^2 = \frac{1}{N}\sum_{i=1..N} (X_i - \mu)^2$
* The problem can also be written using the gradient[slide]
** $\nabla_{\mu,\sigma^2} \log \mathcal{L}(\mu, \sigma^2 | X) = \vec{0}$
** the gradient is also called the Fisher Score, noted (here) $U_{X}$[slide]
// $U_X(\mu, \sigma^2)$
=Understanding the Fisher Score=
* Fisher Score: $\nabla_{\theta} \log \mathcal{L}(\theta | X) = U_X(\theta)$
* A function of $\theta$, considering $X$ fixed[slide]
* How data points $X$ want to change the parameters $\theta$ ?[slide]
// how the data wants to tune/update the model
** $X$ is known
** $\theta$ is known
** (considers a given model, with its parameters given)
* (( Maximum likelihood estimation by gradient descent ))[slide]
** we look for the best model ($\theta$ unknown)
** evaluate the score (gradient) in the current estimate $\theta_t$
** compute the next estimate $\theta_{t+1}$
=Fisher Kernel for Probabilistic Models=
* Kernel trick
** goal: generalize linear algorithms to non-linear spaces
** given some input points ${X_i}$ (in any dimension)[slide][anim-continue]
** manipulate similarities given from a kernel $K(X_i, X_j)$
(instead of the points)[slide]
* Fisher Kernel: a kernel from a prob. mod.[slide]
** compare data points
** in the context of a model with parameters $\theta$ (e.g., $\{\mu,\sigma^2\}$)
** $K(X_i, X_j) = U^T_{X_i} I^{-1} U_{X_j} = \nabla^T \log \mathcal{L}(\mu, \sigma^2 | X_i) \; I^{-1} \; \nabla \log \mathcal{L}(\mu, \sigma^2 | X_j)$[slide]
// Actually I(\theta)
** $I$ : Fisher Information Matrix (normalizer)[slide]
** $I$ is a kind of normalization factor : $I = \E_{X}[U_{X} U_{X}^T]$[slide]
** “sometimes” $I_{ij} = -\E[\frac{\partial^2 \log \mathcal{L}(\theta)}{\partial \theta_i \partial \theta_j}]$[slide]
=Which pair of points are most similar?=
// color does not matter
@
A: narrow
B: wide
@
=Example: narrow vs wide=
=Example: mixture of gaussians=
=Why $\frac{\partial \log \mathcal{L}}{\partial\theta}$ and not $\frac{\partial \log \mathcal{L}}{\partial X}$?=
@anim-appear:400: img
* Other: is $X$ continuous? map-able between instances? …[slide]
=Why $\frac{\partial \log \mathcal{L}}{\partial\theta}$ and not $\frac{\partial \mathcal{L}}{\partial \theta}$?=
* Derivative of the log: $\partial\log f = \frac{\partial f}{f}$[slide]
* ⇒ normalized/relative gradient[slide]
=Questions? Comments?=[image-fit][bottom-left][darkened][deck-status-fake-end]
@
=Coffee Time!=[*no-status][image-full][top-left][darkened][*black-bg][deck-status-fake-end]
=Fisher Vectors for Image Classification ⇒=[title-slide][*no-status][image-full][*black-bg][top-left][darkened]
=Next: Fisher Kernel
from Temporal Mixture Models=[*no-status][image-full][bottom-left][darkened][*black-bg]
=Temporal Mixture Models=[*no-status][image-full][bottom-left][darkened][*black-bg]
/* ============= TEMPORAL MIXTURE MODELS ============= */
=Task: Unsupervised Motif Mining=
@SVG:owbg motif-mining/motif-mining-task.svg 750 400
@anim-appear:400: #layer1 + -#init | #layer2 | #layer3 | #layer7 | #layer4 | #layer6 | #layer5
/* ------------- PLSM THEN ---------------- */
=Probabilistic Latent Sequential Motifs=
@svg:owbg past-temporal-approaches/plsm-process.svg 780px 500px
@anim-appear:600: div.owbg | #g4915 | #rect3279 | #rect3279-0 | #rect3279-0-3 | #g5010 | -#rect3279,#rect3279-0,#rect3279-0-3,#g5010 | #g5123 | -#g5123 | -#g4915 | #g4915-8 | -#g4915-8 | #g4915-8-4 | -#g4915-8-4
=PLSM Graphical Model=
@svg:owbg,floatright time-models/plsm-model-3.svg 300px 400px
* Observations (time instant and word)
within a document: $(w,t_a)$
* Motifs: $\forall k \; \phi_k$ ≡ $p(w,t_r|z=k)$[slide]
* Start/occurrences of motifs
within a document: $(z,t_s)$[slide]
/* ------------- VIDEO CASE ---------------- */
=Video Case: Full Process=
@SVG:owbg,space10 video-mining/process-full.svg 780px 470px
@anim-appear:400: #layer1 | #shortll | #tdocetc
@anim-appear:400: #shorttm | #layer7 | #layer5
@anim-appear:1000: #layer6 + -#layer3 + @#flow1
@anim-appear:600: @#flow2 | @#flow3 | -#layer6 + @#box1 + #layer2
@anim-appear:1000: @#box2
@anim-appear:1000: @#box3
/* ------------- VIDEO CASE: REPRESENTATION ---------------- */
=Video Case: motif representation=
@SVG:fullsvg motif-repr/repr-motif.svg 780px 350px
@anim-appear:400: #motiftable | #rt0 | #rt1 | #rt2 | #rt3 | #rt4 | #rt5 | #giffy | -#giffy | #arrow + #magic
/* --------- KUETTEL --------- */
=Extracted Motifs: Kuettel3=[img3inwidth][shy]
+
/* ------------- ACTION RECOGNITION IN VIDEOS ---------------- */
==Action Recognition in Videos==
=Motifs for Action Recognition=
@SVG:owbg action-recognition/explain.svg 780px 300px
// * Better generalization properties[title]
* Extract STIP
** spatiotemporal interest points
** like SIFT but with 2D+t cubes
* Learn temporal patterns of words (quantified STIP)
@CSS: .slide .hasSVG.dakth {margin-top: 15px;}
@CSS: .slide .hasSVG.kthpos {position: absolute; left:180px; top:58px;}
=Supervised Action Recognition=
@SVG:owbg action-recognition/overview-learning.svg 780 300
@SVG:owbg,dakth action-recognition/overview-test.svg 780 200
@SVG:wbg,kthpos action-recognition/kth_ntrain.svg 500 400
@anim-appear:700: .dakth | .kthpos
=Temporal Topic Models for Classification=
* Tested on action recognition
* To make it work for classification
** need to learn one model per class
** need to have very little learning data
/* -------- FISHER FOR PLSM --------- */
==Fisher Kernel/Vector
From Temporal Topic Models==
=Approach and Goal=
* Goal[slide]
** improve discrimination power of temporal topic models (PLSM, ...)
* Approach[slide]
** derive a Fisher Kernel from these models
** learn classifiers using the kernel
** and/or manipulate Fisher Vectors directly[slide]
* Expected outcome[slide]
** no supervision at training
** better classification accuracy
** model-aware analysis of document similarity
(more in next slides)
=Datasets for Classification of Time Series=
* Pure classification
** action recognition
*** STIP words across time
** music genre recognition
*** spectrograms
* Bird songs analysis, whales, …
** features
*** spectrograms
*** MFCC across time
=Probabilistic Latent Sequential Motifs=
// What are the parameters?
@svg:owbg past-temporal-approaches/plsm-process.svg 780px 500px
@anim-appear:300: div.owbg | #g4915 | #rect3279 | #rect3279-0 + #rect3279-0-3 + #g5010 + -#rect3279,#rect3279-0,#rect3279-0-3,#g5010 + #g5123 + -#g5123 + -#g4915 + #g4915-8 + -#g4915-8 + #g4915-8-4 + -#g4915-8-4
=Reminder: parameters in PLSM=
@svg:owbg,floatright time-models/plsm-model-3.svg 300px 400px
* Parameters[slide]
** global, motifs: $\forall k, \phi_k = p(w,t_r | k)$
** per document: $\forall d, \theta_d = p(z,t_s | d)$
* Variation of motif weights ($\theta$)
** $p(z | d) = \sum_{t_s} p(z, t_s|d)$
** Amount of motif in each document
* Variation of motif shapes ($\phi$)
==The End==
←
→
/ − will be replaced by author − will be replaced by the title