Domain Adaptation and Multi-view Learning: using subspace alignment and landmark projections

# {.var-title-br} {*no-status title-slide} // comment - - -

## `Team Data Intelligence @ LabHC`
`(at some point in the past, not exaustive) {denser}` {image-full bottom-left darkened /black-bg /no-status} // Machine Learning Group in Saint Étienne {comment}

## Disclaimer {image-full bottom-left darkened /black-bg /no-status} ## In a nutshell {#nutshell} - Transfer learning has multiple facets - multi-task - multi-view - multi-domain - Domain adaptation by bringing distributions together - by aligning subspaces obtained from PCA - non-linearly by using projection on landmarks - Landmarks can also be used for multiview-learning - random landmark selection - non-linear projection on the landmarks - fast linear model # Transfer Learning:
Multi-* Learning // Domain Adaptation, Multi-View, Multi-Task ## Multi-Task Learning - Covered a lot in this summer school - (At least), different output for each task, e.g., - different classification task: dog-vs-cat and domestic-vs-wild - different output kind: image segmentation and image classification - ... ## Multi-View Learning - Input have multiple views, e.g. - different viewpoints of an object - multi-modal perception - different medical tests on a patient - different sets of features extracted from images - ... - There could be missing views for some input data *(we'll come back to this){denser}* # Multi-domain Learning? ## Domain Adaptation: What and Why? {libyli} - When do we need Domain Adaptation (DA)? {card} - The training distribution is different from the testing distribution - Example Domain Adaptation task? {card} - Given: labeled images (e.g., fruits images) - Task: what fruit appears on this unlabeled images of trees -
{no} - {inlineblock center no custom1} -

Blueberry -

Almond - ⇒

-

Blueberry -

Almond - How can we learn, from one distribution,
a low-error classifier on another distribution? ## Overview {#plan .plan image-fit darkened top-right /no-status}

- The Multiple Facets of Transfer Learning {tlearn} - Domain Adaptation by Subspace Alignment {dasa} - Landmark-based Kernelized Subspace Alignment {landmarks no} - Deep Multi-Domain Multi-Task Learning {mdmt} - Random Landmark projection for Multi-View Learning {mvlsvm} # @copy:#plan: %+class:highlight: .dasa # Unupervised Domain Adaptation {title-slide} - by Subspace Alignment: B. Fernando - and Landmark Selection: R. Aljundi ## Domain Adaptation: task and notations {libyli} - Typical binary classification task - $X$ : input space, $Y = \\{-1,+1\\} {}$ : output space - Typical supervised classification {card} - $\green{P_S} {}$ source domain: distribution over $X \times Y$ // - $\green{D_S}{}$: marginal distribution over $X$ - $\green{S}{} = \\{(x^s_i,y^s_i)\\}_{i=1}^{m_s} \sim (\green{P_S})^{m_s} {}$: a sample of labeled points - Goal: Find a classifier $h \in \mathcal{H}{}$ with a low source error $R\_{\green{P\_S}}(h) = \mathbf{E}\_{(x^s,y^s)\sim \green{P_S}}\;\; \mathbf{I}\big[h(x^s)\ne y^s\big] {}$ - Domain Adaptation {card} - $\orange{P_T} {}$ target domain: distribution over $X \times Y$, ($\orange{D_T}{}$: marginal over $X$) - $\orange{T}{} = \\{(x^t_i)\\}_{j=1}^{m_t} \sim (\orange{D_T})^{m_t} {}$: a sample of unlabeled target points - Goal: Find a classifier $h \in \mathcal{H}{}$ with a low target error $R\_{\orange{P\_T}}(h) = \mathbf{E}\_{(x^t,y^t)\sim \orange{P_T}}\;\; \mathbf{I}\big[h(x^t)\ne y^t\big] {}$ @svg:images-da/normal-vs-da.svg 375px 280px {normalvsda slide} ## Domain Adaptation − Domain Divergence {libyli} - {card dense libyli} - {inlineblock no} - Labeled source samples $S$
drawn *i.i.d.* from $\green{P_S}{}$ {c5} - {c1} - Unlabeled target samples $T$
drawn *i.i.d.* from $\orange{P_T}{}$ {c5} - $h$ is learned on the source, how does it perform on the target? - ⇒ it depends on the closeness of the domains{no} - {inlineblock no} -

- Adaptation Bound [Ben-David et al., MLJ’10, NIPS’06] {card dense libyli} - $\forall h\in\mathcal{H},\quad R\_{\orange{P\_T}}(h)\ \leq\ \; R\_{\green{P\_S}}(h)\ +\ \frac{1}{2} d\_{\mathcal{H}\;\Delta\;\mathcal{H}}(\green{D\_S},\orange{D\_T})\ +\ \nu $ - Domain divergence: $d\_{\mathcal{H}\;\Delta\;\mathcal{H}}(\green{D\_S},\orange{D\_T}) \;=\; 2 \sup\_{(h,h')\in\mathcal{H}^2} \Big| R\_{\orange{D\_T}}(h,h') - R\_{\green{D\_S}}(h,h')\Big| $ - Error of the joint optimal classifier: $\nu = \inf\_{h'\in\mathcal{H}}\big(R\_{\green{P\_S}}(h')+R\_{\orange{P\_T}}(h')\big)$ - {notes notslide} - with some probability (1 - delta) - H a symmetric hypothesis space - More (by M. Sebban), epat slides - about what is d_HH nips 07 ## Unsupervised Visual Domain Adaptation Using Subspace Alignment − ICCV 2013
*Basura Fernando, Amaury Habrard, Marc Sebban, Tinne Tuytelaars* {paper libyli} - Intuition for unsupervised domain adaptation - principal components of the domains may be shared - principal components should be re-aligned - Principle // with K.U. Leuven (T. Tuytelaars) - extract a source subspace ($d$ largest eigen vectors) - extract a target subspace ($d$ largest eigen vectors) - learn a linear mapping function
that aligns the source subspace with the target one {slide anim-continue} -

{no} ## Subspace Alignment − Algorithm - Algorithm {card custom2 libyli} - **Input:** Source data $\green{S}{}$, Target data $\orange{T}{}$, Source labels $\green{L\_S}{}$ - **Input:** Subspace dimension $d$ {no} - **Output:** Predicted target labels $\orange{L\_T} {}$ {no} - $\green{X\_S} \leftarrow PCA(\green{S},d)$ *(source subspace defined by the first d eigenvectors)* - $\orange{X\_T} \leftarrow PCA(\orange{T},d)$ *(target subspace defined by the first d eigenvectors)* - $M \leftarrow \green{X\_S}' \orange{X\_T}{}$ *(closed form alignment)* - $X\_a \leftarrow \green{X\_S} M$ *(operator for aligning the source subspace to the target one)* - $\gray{S\_a} = \green{S} X\_a$ *(new source data in the aligned space)* - $\gray{T\_T} = \orange{T} \orange{X\_T}{}$ *(new target data in the aligned space)* - $\orange{L\_T} \leftarrow Classifier(\gray{S\_a},\green{L\_S}, \gray{T\_T})$ - A natural similarity: $Sim(\mathbf{x}\_s,\mathbf{x}\_t)=\mathbf{x}\_sX\_SMX\_T' \mathbf{x}\_t'=\mathbf{x}\_sA \mathbf{x}\_t'$ {slide} ## Subspace Alignment − Recap. {#sarecap} - Good - Very simple and intuitive method - Totally unsupervised - Theoretical results for dimensionality detection - Good results on computer vision datasets - Can be combined with supervised information // as we learn in the target space in the end - Bad {limitations} - Cannot be directly kernelized to deal with non linearity - Actually assumes that spaces are relatively close - Ugly {limitations} - Assumes that all the source and target examples are relevant - **Idea:** *Select landmarks from both source and target domains to project the data in a common space using a kernel w.r.t those chosen landmarks. Then the subspace alignment is performed. {dense}* {hidden} # @copy:#plan: %+class:highlight: .landmarks # @copy:#sarecap: %+class:highlight: .limitations + %-class:hidden: .hidden ## Principle of Landmarks {libyli} - JMLR 2013 − *Connecting the Dots with Landmarks:
Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation{denser}* {no} - Boqing Gong, Kristen Grauman, Fei Sha - Principle: find source points (the landmarks) such that
the domains are similarly distributed “around” {inlineblock} - @svg:images-da-landmarks/landmarks1.svg 280px 100px - @svg:images-da-landmarks/landmarks2.svg 280px 100px - Optimization problem: $\min\_\alpha \left\\| \frac{1}{\sum\_m \alpha\_m } \sum\_m \alpha\_m \phi (x\_m) - \frac{1}{N} \sum\_n \phi(x\_n) \right\\|^2$ {dense} - {no} - $\alpha$: binary landmark indicator variables - $\phi(.)$: nonlinear mapping, maps every $x$ to a RKHS - minimize the difference in sample-means - \+ a constraint: *labels should be balanced among the landmarks* ## Landmarks-based Kernelized Subspace Alignment for Unsupervised DA − CVPR 2015
*Rahaf Aljundi, Rémi Emonet, Damien Muselet, Marc Sebban* {paper libyli} - Intuition for landmarks-based alignment - subspace alignment does not handle non-linearity - subspace alignment cannot “ignore” points - landmarks can be a useful to handle locality and non-linearity - Challenges - selecting landmarks in a unsupervised way - choosing the proper Gaussian-kernel scale @svg:images-da-landmarks/landmarks3.svg 700px 200px ## Proposed Approach − Workflow @svg:images-da-landmarks/workflow.svg 700px 400px - @anim: %viewbox:#zlandpro + - @anim: #s1 | #s1t2 | #s2 | #s3 | %viewbox:#zpca - @anim: #s4 | %viewbox:#zalign - @anim: #s5 | %viewbox:#zclassify - @anim: #s6 | #s7 | %viewbox:#zall - Overall approach - 2 new steps: *landmark selection*, *projection* on landmarks - subspace alignment ## Multiscale Landmark Selection {libyli} - Select landmarks among all points, $\green{S} \cup \orange{T} {}$ - Greedy selection - consider each candidate point $c$ and a set of possible scales $s$ - criteria to promote the candidate - after projection on the candidate - the overlap between source and target distributions is above a threshold - Projection: a point is projected with $K(c, p)= \exp \left( \frac{-\left\|c - p\right\|^2}{2 s^2} \right)$ {dense} - Overlap {libyli} - project source and target points - fit two Gaussians (one for each) - $ overlap(\green{\mu\_S, \sigma\_S} ; \orange{\mu\_T, \sigma\_T}) = \frac{\mathcal{N}(\green{\mu\_S} - \orange{\mu\_T} \mid 0, \sigma\_{sum}^2)}{\mathcal{N}(0 \mid 0, \sigma\_{sum}^2)} $ - normalized integral of product - with $\sigma\_{sum}^2 = \green{\sigma\_S}^2 + \orange{\sigma\_T}^2$, and $\mathcal{N}(. \mid 0, \sigma\_{sum}^2)$ centered 1d-Gaussian ## Landmark-Based Alignment − Overall {#sadarecap libyli} - Select landmarks among all points, $\green{S} \cup \orange{T} {}$ - greedy selection - multi-scale selection - maximize domain overlap - Project all points on the landmarks - use a Gaussian kernel - $\sigma \gets median\\_distance(S \cup T) $ - Subspace-align the projected points - PCA on source domain - PCA on target domain - compute the alignment $M$ ## Landmark-Based Alignment − Results {libyli} - Is landmark-based kernelization useful? - *@svg:images-da-landmarks/results.svg 730px 200px* {no} - Is our landmark-selection any good? - *@svg:images-da-landmarks/results-sel-landmarks.svg 730px 200px* {no} # “Deep” Domain Adapation {deep-da} ## Domain Adaptation in Deep Neural Nets {libyli dense} *Based on the same core principles: bring distributions together* - See Elisa Fromont's talk *@SVG: media/ganin.svg 400 200 {floatright}* - Domain-Adversarial Training... Ganin et al.
(JMLR 2016) // Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Victor Lempitsky - ADDA - chairlifts - avoiding negative transfer using domain distances // - @anim: #domclass | #domclassback | #grl | #grlfeat - Batch normalization and AdaBN - AutoDIAL - Multitask-multidomain semantic segmentation (Damien Fourure) -

*@SVG: media/mdmt.svg 350 180 {inlineblock}* {no} # @copy:#plan: %+class:highlight: .mvlsvm # Multi-view Classification
with Landmark-based SVM {title-slide} - by **Valentina Zantedeschi**, Rémi Emonet, Marc Sebban - as part of the ANR LIVES project (multiview) ## MVL-SVM Principle - Randomly select landmarks {select-landmarks} - $L$ points $l_1, l_2, \cdots, l_L$ from the dataset - with no missing views - Project all points on this landmarks {project} - use an arbitrary $\mu$ similarity measure - Learn a model (classifier) {learn} - in the joint projected space - fast and linear (non-linearity already in the projection) - *@SVG: mvlsvm/mvlsvm.svg 750 250* {no} - @anim: #xi |#xin |#v1 |#v2 |#v3 - @anim: .select-landmarks | .project - @anim: #mapv1 |#mapv2 |#mapv3 |#muxi |#mapjoint, #joint |.learn |#fx, #learnjoint ## Generalization Bound - The generalization bound of MVL-SVM,
derived using the Uniform Stability framework: $ R_{\mathcal{D}}(f)\\! \leq \\!\hat{R}_{S}(f) + \frac{c L V M^2}{m} + \left( 2c L V M^2 \\!+ \\!1 \\!+\\! 2c \sqrt{L V} M \\! \right)\\!\\!\sqrt{\frac{\ln \frac{1}{\delta}}{2m}} {dense}$ - $L$ number of landmarks - $M$ number of views - $m$ number of samples - NB - stable if $L \ll \frac{m}{V}{}$ // ok, as usually $m \gg V$. - the lower $L$, the more stable ## MVL-SVM Results {libyli} // svm-2k co-regularization of view to maximize agreement - {inlineblock no} // MVML Multi-view Metric Learning (MVML) which combines vvRKHS with Metric Learning -

{c5} -

{c5} - {inlineblock no} -

{c10} ## Missing Views? {libyli} - Landmark-based missing view reconstruction method - Allow to maintain accuracy and scalability - {inlineblock no} -

{c5} -

{c5} # @copy:#nutshell # Thank You
Questions? {deck-status-fake-end no-print} # Supplementary Material {no-print} # Supp. on source and target risk {sup-risks} ## Link the Target Risk to the Source? \begin{matrix} \\\\ R\_{\orange{P\_T}}(h)&=& \mathbf{E}\_{(x^t,y^t)\sim \orange{P\_T}}\mathbf{I}\big[h(x^t)\ne y^t\big]\\\\ \\\\ &=&\mathbf{E}\_{(x^t,y^t)\sim \orange{P\_T}}\frac{\green{P\_S}(x^t,y^t)}{\green{P\_S}(x^t,y^t)}\mathbf{I}\big[h(x^t)\ne y^t\big]\\\\ \\\\ &=&\sum\_{(x^t,y^t)} \orange{P\_T}(x^t,y^t)\frac{\green{P\_S}(x^t,y^t)}{\green{P\_S}(x^t,y^t)}\mathbf{I}\big[h(x^t)\ne y^t\big]\\\\ \\\\ &=&\mathbf{E}\_{(x^t,y^t)\sim \green{P\_S}}\frac{\orange{P\_T}(x^t,y^t)}{\green{P\_S}(x^t,y^t)}\mathbf{I}\big[h(x^t)\ne y^t\big]\\\\ \end{matrix} {latex slide} # Supp. on covariate shift ## Domain Adaptation − Covariate Shift? {libyli} // This difference between the two domains is called covariate shift (Shimodaira, 2000). - {card dense} - R\_{\orange{P\_T}}(h)\; =\;\; \mathbf{E}\_{(x^t,y^t)\sim \green{P\_S}}\frac{\orange{P\_T}(x^t,y^t)}{\green{P\_S}(x^t,y^t)}\mathbf{I}\big[h(x^t)\ne y^t\big] {latex} - The target risk can be rewritten as an expectation on the source - Covariate Shift {card} - When $\green{P\_S}(y^t|x^t)=\orange{P\_T}(y^t|x^t)$ (covariate shift assumption) - Very strong assumption - We can estimate a ratio between unlabeled data -
{no} - \begin{matrix} {R\_{\orange{P\_T}}(h)}&=&\mathbf{E}\_{(x^t,y^t)\sim \green{P\_S}}\frac{\orange{D\_T}(x^t)\orange{P\_T}(y^t|x^t)}{\green{D\_S}(x^t)\green{P\_S}(y^t|x^t)}\mathbf{I}\big[h(x^t)\ne y^t\big]\\\\ \\\\ &=&\mathbf{E}\_{(x^t,y^t)\sim \green{P\_S}}\frac{\orange{D\_T}(x^t)}{\green{D\_S}(x^t)}\mathbf{I}\big[h(x^t)\ne y^t\big]\\\\ \end{matrix} {latex no slide} - **⇒ Approach**: density estimation and instance re-weighting {no slide} // actually, it is simpler to estimate the density of the ratio - {notes notslide} - nice pres http://www.slideserve.com/Anita/sample-selection-bias ## Supp. on Subspace Alignment Results {.sup-saresults} ## Subspace Alignment − Experiments {libyli} - *@svg:images-da-align/img-iccv13.svg 800px 200px* {no} - Comparison on visual domain adaptation tasks // SURF, 800 words? - adaptation from Office/Caltech-10 datasets (four domains to adapt) - adaptation on ImageNet, LabelMe and Caltech-256 datasets: one is used as source and one as target - Other methods - Baseline 1: projection on the source subspace - Baseline 2: projection on the target subspace - 2 related methods: - GFS [Gopalan et al.,ICCV'11] // Geodesic flow subspaces - GFK [Gong et al., CVPR'12] // Geodesic flow kernel ## Subspace Alignment − Results {dense libyli} - Office/Caltech-10 datasets {inlineblock} - @svg:images-da-align/result1-NN-iccv13.svg 330px 250px - @svg:images-da-align/result1-SVM-iccv13.svg 330px 250px - ImageNet (I), LabelMe (L) and Caltech-256 (C) datasets {inlineblock} - @svg:images-da-align/result2-NN-iccv13.svg 330px 130px - @svg:images-da-align/result2-SVM-iccv13.svg 330px 130px # Attribution ## CC by genevieveromier (Flickr) {image-full bottom-left darkened /black-bg /no-status}

## CC by anmuell (Flickr) {image-full bottom-left darkened /black-bg /no-status}

## CC by sgillies (Flickr) {image-full bottom-left darkened /black-bg /no-status}

## CC by mustetahra (Flickr) {image-full bottom-left darkened /black-bg /no-status}

/ − automatically replaced by the author − automatically replaced by the title