Learning Shared Representations from Unpaired Data

NeurIPS 2025
Amitai Yacobi*, Nir Ben-Ari*, Ronen Talmon, Uri Shaham
Bar-Ilan University | Technion | Moodify.ai
* Equal contribution, random order

Abstract

Learning shared representations is a primary area of multimodal representation learning. Current approaches rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data.

Outstanding Performance with Minimal Pairs

Figure 2: Retrieval Examples with 100 Pairs

Retrieved images by SUE for custom captions on MSCOCO, trained with only 100 pairs and 10k unpaired samples. Despite minimal paired data, results semantically align closely with text queries.

Retrieval Performance

+257%
Improvement over Contrastive on MSCOCO
100
Paired samples needed (vs. 400M for CLIP)
10×
Fewer pairs than contrastive for same results

SUE achieves remarkable retrieval performance across multiple datasets (MSCOCO, Flickr30k, Polyvore, Edges2Shoes) using only 50-500 paired samples, outperforming contrastive learning by an average of 257% when both use the same minimal number of pairs.

SUE: Spectral Universal Embedding

Figure 3: SUE Pipeline Overview

SUE's three-step pipeline: (1) Spectral Embedding extracts universal features from each modality independently, (2) CCA provides linear alignment using minimal paired samples, (3) MMD-net performs non-linear fine-tuning.

Step 1: Spectral Embedding

Learn parametric spectral embeddings independently for each modality using SpectralNet. These capture the global structure and universal properties of the data manifolds.

Step 2: CCA Alignment

Use Canonical Correlation Analysis on a minimal number of paired samples (as few as 50-500) to resolve rotational ambiguity and provide initial linear alignment.

Step 3: MMD Fine-tuning

Train a residual network to minimize Maximum Mean Discrepancy between modalities, leveraging the full unpaired dataset for non-linear refinement.

Key Insight: Universal Embeddings

Figure 1: Empirical demonstration of universality

(a) Distances between corresponding random walks on image and text graphs show significantly greater similarity than non-matching walks. (b) Paired points are consistently closer in aligned spectral embeddings, indicating independently learned embeddings capture analogous structure across modalities.

We demonstrate that random walks defined on different modality-specific representations exhibit remarkable similarity when those representations capture semantics well. This similarity enables us to learn universal embeddings - representation spaces where corresponding instances from different modalities naturally align, even when trained almost exclusively on unpaired data.

Diverse Applications

Figure 4: Application Examples

(a) Cross-modal retrieval captures semantic relationships. (b) Text-to-image generation with minimal text supervision. (c) Semantic arithmetic operations in the universal space.

Cross-Modal Retrieval

Superior retrieval performance across vision-language, vision-vision, and tabular datasets with minimal paired supervision.

Text-to-Image Generation

Generate images from text queries using a GAN trained only on images, enabled by universal embeddings.

Semantic Arithmetic

Intuitive vector operations like "man" + "with sunglasses" produce semantically meaningful results.

Zero-Shot Classification

88% accuracy on few ImageNet classes when trained on Flickr30k with just 500 pairs.

Cross-Domain Transfer

Effective knowledge transfer between domains with minimal correspondence (e.g., Office31 dataset).

Unpaired Data is Key

Figure 5: Effect of Paired vs Unpaired Data

(a) SUE with 100 pairs requires an order of magnitude more pairs in contrastive learning to achieve similar results. (b) SUE improves significantly with additional unpaired data. (c) SUE relies minimally on paired data beyond a small threshold.

Our experiments reveal that SUE derives its strength primarily from unpaired samples. Adding more unpaired data consistently improves performance, while adding more paired samples beyond a small threshold (∼500) provides diminishing returns. This demonstrates the untapped potential of unpaired data in multimodal learning.

Citation

@inproceedings{yacobi2025sue,
  title={Learning Shared Representations from Unpaired Data},
  author={Yacobi, Amitai and Ben-Ari, Nir and Talmon, Ronen and Shaham, Uri},
  journal={Advances in Neural Information Processing Systems},
  year={2025}
}