GRAFT

Abstract

We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.

Paper

paper arXiv supplementary

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, Kavita Bala. "Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment". In ICLR, 2024.

Bibtex


                            @inproceedings{graft-24,

                             title={Remote sensing vision-language foundation models without annotations via ground remote alignment},

                             author={Mall, Utkarsh and Phoo, Cheng Perng and Liu, Meilin Kelsey and Vondrick, Carl and Hariharan, Bharath and Bala, Kavita},

                             booktitle={ICLR},

                             year={2024}

                        }

Results

Retrieval using GRAFT

See more examples:

Open-Vocabulary Text-to-Image Segmentation

See random examples:

Code and Data

Our code can be found on this github repo.

For access to pre-trained model weights and dataset please sign the agreement form. The signup will lead to download links for model checkpoints and dataset.

Also try this Colab notebook to quickly run inference on test cases and on your own areas of interest!

Acknowledgments

We thank Aditya Chetan, Yihong Sun, and Luming Tang for their useful feedback.

This research is based upon work supported in part by the Office of the Director of National Intelligence (Intelligence Advanced Research Projects Activity) via 2021-20111000006, the NSF STC for Learning the Earth with Artificial Intelligence and Physics, and the U.S. DARPA ECOLE Program No #HR00112390060. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, DARPA, or the US Government. The US Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Remote Sensing Vision-Language Foundation Models
without Annotations via Ground Remote Alignment

Utkarsh Mall^1,2, Cheng Perng Phoo^1, Kelsey Liu¹, Carl Vondrick², Bharath Hariharan¹, Kavita Bala¹

¹Cornell University ²Columbia University

* Equal Contribution

In ICLR 2024

Abstract

Paper

paper arXiv supplementary

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, Kavita Bala. "Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment". In ICLR, 2024.

Bibtex

Results

Retrieval using GRAFT

See more examples:

Open-Vocabulary Text-to-Image Segmentation

See random examples:

Retrieval using GRAFT

See more examples:

Open-Vocabulary Text-to-Image Segmentation

See random examples:

Code and Data

Acknowledgments

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Utkarsh Mall*1,2, Cheng Perng Phoo*1, Kelsey Liu1, Carl Vondrick2, Bharath Hariharan1, Kavita Bala1

1Cornell University 2Columbia University

* Equal Contribution

In ICLR 2024

Abstract

Paper

paper arXiv supplementary

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, Kavita Bala. "Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment". In ICLR, 2024.

Bibtex

Results

Retrieval using GRAFT See more examples: Select a video Your browser does not support the video tag. Open-Vocabulary Text-to-Image Segmentation See random examples: Your browser does not support the video tag. Your browser does not support the video tag.

Retrieval using GRAFT

See more examples:

Open-Vocabulary Text-to-Image Segmentation

See random examples:

Code and Data

Acknowledgments

Remote Sensing Vision-Language Foundation Models
without Annotations via Ground Remote Alignment

Utkarsh Mall^1,2, Cheng Perng Phoo^1, Kelsey Liu¹, Carl Vondrick², Bharath Hariharan¹, Kavita Bala¹

¹Cornell University ²Columbia University

Retrieval using GRAFT

See more examples:

Open-Vocabulary Text-to-Image Segmentation

See random examples: