Zero-shot Performance of Foundational Models on Satellite Imagery
Jul 14, 2023 10 min

Introduction
Recent progress in self-supervision shows that pre-training large neural networks on vast amounts of unsupervised data can increase generalization for downstream tasks. Such models, recently coined as foundation models, have been transformational to computer vision and natural language processing. In this work, we analyze the zero-shot performance of these models on standard remote sensing image classification datasets, in order to establish empirical evidence of whether these foundational models have seen satellite imagery during their training.
Background
Zero-Shot Learning is a scenario wherein, at inference time, a model is expected to predict labels on samples not observed during training accurately. It is a measure of a model’s capability to comprehend visual concepts to be able to generalize over unseen data.
Language-vision models pre-trained on large datasets (also called foundation models) are observed to have achieved tremendous success in zero-shot transfers to downstream tasks. The most popular foundation models in the public domain are:
- CLIP: uses contrastive losses to maximize the similarity between image and text representations during joint training of image and text encoder

CLIP Model
- ALIGN: uses the scale of the dataset to offset dataset noise to align image and text embeddings using contrastive losses

ALIGN model
- BLIP: minimises losses in dataset using a bootstrapping mechanism called CapFilt, which is used for training dual-encoder architecture

BLIP Model
- SAM: recursively applies an efficient model in the data collection loop to curate a huge segmentation dataset with 11M images and over 1B masks.

Segment Anything Model
In this work, we intend to analyze the zero-shot performance of state-of-the-art foundation models on remote-sensing data to establish whether these models have seen satellite imagery during their pre-training.
Experiments
we leveraged the publicly available variants of foundation models, namely, BLIP and CLIP. These advanced models were employed on well-established remote-sensing benchmarks, EuroSAT and BigEarthNet-S2, to produce zero-shot results under two different settings:.
- Standard Setting: In this configuration, we utilize standard prompts such as "an image of", "a view of", and so on, on the classification labels to generate text representations.
- Context-based Setting: In this mode, we use domain-specific prompts, for instance, "an aerial view of", "a satellite view of", and so forth, on the classification labels.
The rationale behind these settings, Standard and Context-based, is to investigate whether the application of geospatial context enhances the zero-shot performance of the foundational models. Any observed enhancement in performance would be a compelling sign that these models have been exposed to such images before.
Results
We tabulate the results of our experiments below:
Dataset/Model | Backbone | EuroSAT (Standard) | EUroSAT (COntext) | BigEarthNetS2 (Standard) | BigEarthNetS2 (Context) |
BLIP | ViT-B/16 | 36.87 | 42.35 | 86.97 | 84.88 |
ViT-B/16, CapFilt-L | 38.55 | 34.81 | 87.31 | 86.41 | |
ViT-B/16, COCO-Fin | 38.87 | 41.20 | 89.69 | 84.34 | |
ViT-B/16, Flickr-Fin | 42.67 | 46.20 | 88.47 | 82.30 | |
ViT-L/16 | 45.78 | 45.06 | 81.05 | 83.74 | |
ViT-L/16, COCO-Fin | 48.11 | 52.23 | 86.21 | 77.43 | |
ViT-L/16, Flickr-Fin | 42.35 | 50.42 | 87.25 | 77.03 | |
CLIP | ResNet-50 | 25.31 | 28.03 | 6.82 | 6.80 |
ResNet-50x4 | 22.04 | 28.79 | 6.82 | 6.76 | |
ResNet-50x16 | 43.13 | 41.74 | 6.78 | 6.71 | |
ResNet-50x64 | 35.86 | 17.20 | 6.80 | 6.76 | |
ResNet-101 | 26.74 | 23.96 | 6.81 | 6.82 | |
ViT-B/16 | 38.86 | 41.02 | 6.82 | 6.84 | |
ViT-B/32 | 32.67 | 33.58 | 6.85 | 6.82 | |
ViT-L/14 | 52.43 | 50.59 | 6.82 | 6.83 | |
ViT-L/14@336px | 51.05 | 45.40 | 6.82 | 6.82 |
Based on our results, the following key observations can be made:
- CLIP has a near-random performance on BigEarthNet-S2 owing to large image-encoder activations for (almost) all classes that lead to many false positives.
- Fine-tuned BLIP models have a better zero-shot performance on EuroSAT and BigEarthNet benchmarks than standard variants. Based on this, it can be safely concluded that fine-tuning standard benchmarks improves performance.
- Zero-shot performance on EuroSAT improves with the addition of remote-sensing context for smaller CLIP variants like ResNet50, ResNet101, and ViT-B/32, and degrades for larger architectures like ViT-L/14 and EfficientNet-based scaled versions of ResNet50. No such visible patterns could be observed in CLIP’s performance on context-addition for BigEarthNet labels
- Adding geospatial priors leads to a marked improvement in zero-shot performance for most of the BLIP variants on EuroSAT.
- On BigEarthNet-S2, the addition of context leads to a degradation in performance for most of the BLIP variants.
Conclusion
In summary, our study presents an empirical examination of the zero-shot performance of pre-trained foundation models on standard remote-sensing datasets like EuroSAT and BigEarthNet-S2. Our findings reveal that fine-tuned BLIP variants outperform the standard version on these benchmarks, and incorporation of geospatial context during the inference stage can lead to mixed outcomes depending on the model and dataset selected.
References
[1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[2] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
[3] J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
[4] L. Yuan, D. Chen, Y.-L. Chen, N. C. F. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, C. Liu, M. Liu, Z. Liu, Y. Lu, Y. Shi, L. Wang,J. Wang, B. Xiao, Z. Xiao, J. Yang, M. Zeng, L. Zhou, and P. Zhang, “Florence: A new foundation model for computer vision,” ArXiv, vol. abs/2111.11432, 2021.
[5] A. Kirillov, E. Mintun, N. Ravi, et al., “Segment anything,” 2023.
[6] P. Helber, B. Bischke, A. R. Dengel, and D. Borth, “EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE JSTARS, vol. 12, pp. 2217–2226, 2017.
[7] G. Sumbul, M. Charfuelan, B. Demir, and V. Markl, “BigEarthNet: A large-scale benchmark archive for remote sensing image understanding,” IGARSS, pp. 5901–5904, 2019.
Foundation modelsCLIPBLIPLanguage-vision pretrainingSatellite imageryGeospatial analysis