Investigating Large Vision Model Training Challenges on Satellite Datasets

Nov 09, 2023


Contrastive learning methods that bridge textual descriptions and images, such as Contrastive Language-Image Pre-training (CLIP), have demonstrated remarkable advancements. These foundational models have shown exceptional performance in tasks related to zero-shot image classification, as evidenced by their substantial enhancement of zero-shot ImageNet accuracy from the prior state-of-the-art of 12\% to an impressive 76\%. However, the exposure of these models to satellite images during training has been limited, resulting in suboptimal performance when dealing with geospatial data. This limitation raises a pivotal question: Can these foundational models, which have demonstrated potential across multiple domains, be trained on geospatial imagery out-of-box? To answer this question, we perform a study on training CLIP on diverse geospatial datasets. Within our research, we delve into unique challenges in this context and discuss the strategies we employ to address these challenges effectively. We demonstrate that handling resolution is crucial when training CLIP like models on a large multi-resolution dataset.

InGARSS 2023

Contributed by

Hitesh Jain , Sagar Verma , Siddharth Gupta

Related Research

Post Wildfire Burnt-up Detection using Siamese UNet

In this article, we present an approach for detecting burnt area due to wild fire in Sentinel-2 images by leveraging the power of Siamese neural networks. By employing a Siamese network, we are able to efficiently encode the feature extraction process for pairs of images. This is achieved by utilizing two branches within the Siamese network, which capture and combine information at different resolutions to make predictions. The weights are shared between these two branches in siamese networks. This design allows to effectively analyze the changes between two remote sensing images, enabling precise identification of areas impacted by forest wildfires in the state of California as part of ChaBuD challenge thereby assisting local authorities in effectively monitoring the impacted regions and facilitating the restoration process. We experimented with various model architectures to train ChaBuD dataset and carefully evaluated the performance. Through rigorous testing and analysis, we have achieved promising results, ultimately obtaining a final private score (IoU) of 0.7495 on the hidden test dataset. The code is available at We also deploy the final model as a point solution for anyone to use at

09 November 2023

Detecting Urban Changes with Recurrent Neural Networks from Multitemporal Sentinel-2 Data

The advent of multitemporal high resolution data, like the Copernicus Sentinel-2, has enhanced significantly the potential of monitoring the earth's surface and environmental dynamics. In this paper, we present a novel deep learning framework for urban change detection which combines state-of-the-art fully convolutional networks (similar to U-Net) for feature representation and powerful recurrent networks (such as LSTMs) for temporal modeling. We report our results on the recently publicly available bi-temporal Onera Satellite Change Detection (OSCD) Sentinel-2 dataset, enhancing the temporal information with additional images of the same region on different dates. Moreover, we evaluate the performance of the recurrent networks as well as the use of the additional dates on the unseen test-set using an ensemble cross-validation strategy. All the developed models during the validation phase have scored an overall accuracy of more than 95%, while the use of LSTMs and further temporal information, boost the F1 rate of the change class by an additional 1.5%.

09 November 2023