Remote Sensing Vision-Language Foundation Models
without Annotations via Ground Remote Alignment

Utkarsh Mall*1,2, Cheng Perng Phoo*1, Kelsey Liu1, Carl Vondrick2, Bharath Hariharan1, Kavita Bala1

1Cornell University        2Columbia University

* Equal Contribution

In ICLR 2024


We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.


paper     arXiv arXiv     supplementary

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, Kavita Bala. "Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment". In ICLR, 2024.

 title={Remote sensing vision-language foundation models without annotations via ground remote alignment},
 author={Mall, Utkarsh and Phoo, Cheng Perng and Liu, Meilin Kelsey and Vondrick, Carl and Hariharan, Bharath and Bala, Kavita},


Retrieval using GRAFT

See more examples:  

Open-Vocabulary Text-to-Image Segmentation

See random examples:  

Code and Data

Our code can be found on this github repo.

For access to pre-trained model weights and dataset please sign the agreement form. The signup will lead to download links for model checkpoints and dataset.

Also try this Colab notebook to quickly run inference on test cases and on your own areas of interest!


We thank Aditya Chetan, Yihong Sun, and Luming Tang for their useful feedback.

This research is based upon work supported in part by the Office of the Director of National Intelligence (Intelligence Advanced Research Projects Activity) via 2021-20111000006, the NSF STC for Learning the Earth with Artificial Intelligence and Physics, and the U.S. DARPA ECOLE Program No #HR00112390060. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, DARPA, or the US Government. The US Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.