Abstract
Learning shared representations is a primary area of multimodal representation learning. Current approaches rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data.
Outstanding Performance with Minimal Pairs
Retrieved images by SUE for custom captions on MSCOCO, trained with only 100 pairs and 10k unpaired samples. Despite minimal paired data, results semantically align closely with text queries.
Retrieval Performance
SUE achieves remarkable retrieval performance across multiple datasets (MSCOCO, Flickr30k, Polyvore, Edges2Shoes) using only 50-500 paired samples, outperforming contrastive learning by an average of 257% when both use the same minimal number of pairs.
SUE: Spectral Universal Embedding
SUE's three-step pipeline: (1) Spectral Embedding extracts universal features from each modality independently, (2) CCA provides linear alignment using minimal paired samples, (3) MMD-net performs non-linear fine-tuning.
Step 1: Spectral Embedding
Learn parametric spectral embeddings independently for each modality using SpectralNet. These capture the global structure and universal properties of the data manifolds.
Step 2: CCA Alignment
Use Canonical Correlation Analysis on a minimal number of paired samples (as few as 50-500) to resolve rotational ambiguity and provide initial linear alignment.
Step 3: MMD Fine-tuning
Train a residual network to minimize Maximum Mean Discrepancy between modalities, leveraging the full unpaired dataset for non-linear refinement.
Key Insight: Universal Embeddings
(a) Distances between corresponding random walks on image and text graphs show significantly greater similarity than non-matching walks. (b) Paired points are consistently closer in aligned spectral embeddings, indicating independently learned embeddings capture analogous structure across modalities.
We demonstrate that random walks defined on different modality-specific representations exhibit remarkable similarity when those representations capture semantics well. This similarity enables us to learn universal embeddings - representation spaces where corresponding instances from different modalities naturally align, even when trained almost exclusively on unpaired data.
Diverse Applications
(a) Cross-modal retrieval captures semantic relationships. (b) Text-to-image generation with minimal text supervision. (c) Semantic arithmetic operations in the universal space.
Cross-Modal Retrieval
Superior retrieval performance across vision-language, vision-vision, and tabular datasets with minimal paired supervision.
Text-to-Image Generation
Generate images from text queries using a GAN trained only on images, enabled by universal embeddings.
Semantic Arithmetic
Intuitive vector operations like "man" + "with sunglasses" produce semantically meaningful results.
Zero-Shot Classification
88% accuracy on few ImageNet classes when trained on Flickr30k with just 500 pairs.
Cross-Domain Transfer
Effective knowledge transfer between domains with minimal correspondence (e.g., Office31 dataset).
Unpaired Data is Key
(a) SUE with 100 pairs requires an order of magnitude more pairs in contrastive learning to achieve similar results. (b) SUE improves significantly with additional unpaired data. (c) SUE relies minimally on paired data beyond a small threshold.
Our experiments reveal that SUE derives its strength primarily from unpaired samples. Adding more unpaired data consistently improves performance, while adding more paired samples beyond a small threshold (∼500) provides diminishing returns. This demonstrates the untapped potential of unpaired data in multimodal learning.
Citation
title={Learning Shared Representations from Unpaired Data},
author={Yacobi, Amitai and Ben-Ari, Nir and Talmon, Ronen and Shaham, Uri},
journal={Advances in Neural Information Processing Systems},
year={2025}
}