Telling Left from Right: Identifying Geometry-Aware
Semantic Correspondence

Junyi Zhang1 Charles Herrmann2 Junhwa Hur2 Eric Chen3
Varun Jampani4 Deqing Sun2 * Ming-Hsuan Yang2,5 *
1 Shanghai Jiao Tong University 2 Google Research 3 UIUC 4 Stability AI 5 UC Merced (*: equal contribution)
CVPR 2024


(a) SD+DINO [1] struggles at “telling left from right” (red solid lines).
(b) Our method significantly improves semantic correspondence.
(c) Qualitative comparison with state-of-the-art methods in cases with extreme viewpoint variations.

In Fig. (a), we demonstrate that the state-of-the-art method, SD+DINO, fails at matching keypoints with geometric ambiguity, or “telling left from right”. In Fig. (b), we show the performance gap between our proposed geometry-aware subset (Geo.) and the standard set (Std.) of state-of-the-art methods. Note that the geo-aware subset accounts for 59.6% and 45.7% of the total keypoint pairs on SPair-71k and AP-10K, respectively. Our method significantly improves overall semantic correspondence as well as narrows the gap between the two sets. In Fig. (c), our method successfully establishes geometrically correct semantic correspondence even in cases of extreme view variation, while both versions of SD+DINO struggle with geometric ambiguity.

[Paper (updated)]      [Arxiv]      [Code (new!)]     [BibTeX]

Abstract

While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state-of-the-art by 5.5p and 11.0p absolute gains, respectively. Our code and datasets are publicly available.

Geometric Awareness of SD/DINO Features

We show how these features perform at matching keypoints with geometric ambiguity by constructing a geometry-aware semantic correspondence subset.

(a) Semantically-similar keypoint subgroups in images.
(b) Annotations of geo-aware semantic correspondence (yellow).

To construct a geometry-aware semantic correspondence subset, we first cluster keypoints into semantically-similar subgroups, e.g., the four paws and two ears of the cat, as shown in Fig. (a). Then, we define a keypoint pair as a geometry-aware semantic correspondence if there are other keypoints in the same subgroup that are visible in the target image, e.g., the different paws of the cat, as shown in Fig. (b). These cases are especially challenging for existing methods, as they require a proper understanding of the geometry to establish correct matches rather than matching to the semantically-similar keypoint(s).

(c) Per-category performance on geo-aware set.
(d) Sensitivity to pose variation (higher value = more sensitivity).

In Fig. (c), we show the per-category evaluation of state-of-the-art methods on SPair-71k geometry-aware subset (Geo.) and standard set. While the geometry-aware subset accounts for 60% of the total matching keypoints, we observe a substantial performance gap between the two sets for all the methods. In Fig. (d), we further evaluate how the performance on geometry-aware subset is sensitive to pose variation of the pair images. The y-axis displays the normalized difference between the best and worst performances among 5 different azimuth-variation subsets. As it can be observed, the geometry-aware subset is more sensitive to the pose variation than the standard set across categories and methods, indicating that the pose variation particularly affects the performance on geometry-aware semantic correspondence.

(e) Rough pose prediction with feature distance.
(f) Zero-shot rough pose prediction result with instance matching distance (IDM). We manually annotated 100 cat images from SPair-71k with rough pose labels {left, right, front, and back} and report the accuracy of predicting left or right (L/R), front or back (F/B), either of the two cases (L/R or F/B), and one of the four directions (L/R/F/B).

We further analyze if deep features are aware of high-level pose (or viewpoint) information of an instance in an image. In Fig. (e), we show how we explore this pose awareness by a template-matching approach in the feature space. The performance shown in Fig. (f) suggests that the deep features are indeed aware of the global pose information. Please refer to the paper for more details.

Improving Geometry-Aware Semantic Correspondence

We propose several techniques that improve geometric awareness during matching, in both zero-shot and supervised settings.

(a) Adaptive pose alignment with feature space distance.
(b) Qualitative results of the adaptive alignment.

We first introduce an adaptive pose alignment strategy that runs at test time without any training involved. This is based on the observation that pose variations can largely affect the performance of geometry-aware semantic correspondence. As illustrated in Fig. (a), we introduce a very simple test-time pose alignment strategy to address this, which utilizes the global pose information inherent in deep features and thus improves correspondence accuracy. We show in Fig. (b) that this simple strategy can drastically improve the correspondence accuracy in a test-time, unsupervised manner.

(c) (Left) previous supervised methods [1,3] with a sparse training objective. (Right) an overview of our supervised method.

We further introduce a post-processing module with various training strategies that can improve the geometry awareness of deep features. It's worth noting that this extra module only costs 0.32% extra runtime while significantly improving the performance on standard benchmarks by 15%.

Benchmarking AP-10K Dataset for Semantic Correspondence

To facilitate validating and training, we construct a new, large-scale, and challenging semantic correspondence benchmark built from an existing animal pose estimation dataset, AP-10K.

(a) Sample image pairs from AP-10K semantic correspondence benchmark.

AP-10K is an in-the-wild animal pose estimation dataset consisting of 10,015 images across 23 families and 54 species. After manually filtering and processing, we construct a benchmark with 261k training, 17k validation, and 36k testing image pairs. The validation and testing image pairs span three settings: the main intra-species set, the cross-species set, and the cross-family set. It is 5 times larger than the largest existing benchmark, SPair-71k, and the first benchmark to evaluate cross-class semantic correspondence. Please refer to Supp. for more details.

Experimental Results

We show the quantitative and qualitative results of our method on SPair-71k, AP-10K, and PF-Pascal datasets.

(a) Quantitative comparison across different datasets and PCK levels.

Both our zero-shot and supervised methods outperform all previous methods significantly. Particularly, our supervised methods achieve notable gains in the more strict thresholds (e.g., PCK@0.05 and PCK@0.01), especially considering that SD+DINO uses the same raw feature maps as our method. Despite the methods being trained only on AP-10K intra-species sets, the robust performance on cross-species and cross-family test sets showcases the generalizability of our approach. It's also noteworthy that pretraining on AP-10K brings a gain of 2.7p in PCK@0.10 on SPair-71k, underscoring the untapped potential of the pose datasets in this domain.

(b) Visualization of similarity map. The query and predicted points are red, and the keypoint supervision of the “chair” category is blue.

We further investigate cases where the query point lacks direct supervision and meaningful context. SD+DINO highlights the regions with similar appearance (wooden materials) but fails to locate the chair; SD+DINO (S) generates noisy similarity maps when the query point is out of supervision, due to the sparse training objective; our method locates the points correctly, both semantically and geometrically, even when the query point lacks direct supervision and meaningful context. Notably, all methods utilize the same raw feature maps, and our approach employs the same feature post-processor as SD+DINO (S). Despite this, the notable improvements in our method further underscore the efficacy of our design.

Qualitative Comparison on SPair-71k

From left to right: SD+DINO, SD+DINO supervised version (S), and Ours.
Our method particularly shines in cases with large viewpoint variations.


Qualitative Comparison on AP-10K

We show the comparison on the intra-species subset below:

We show the comparison on the cross-species subset below:

We show the comparison on the cross-family subset below:

Related Work

[1] A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

[2] Emergent Correspondence from Image Diffusion

[3] Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence

[Concurrent] Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps improves the geometric ambiguity in an unsupervised manner by introducing a spherical constraint on the feature space.

BibTex

@article{zhang2023telling,
  title={Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence},
  author={Zhang, Junyi and Herrmann, Charles and Hur, Junhwa and Chen, Eric and Jampani, Varun and Sun, Deqing and Yang, Ming-Hsuan},
  booktitle={arXiv preprint arxiv:2311.17034},
  year={2023}
}

Acknowledgements: We borrow this template from SD+DINO, which is originally from DreamBooth.