Foundational models and satellite imagery
Aug 17, 2023 10 min

Introduction
Recent developments in large-scale text and vision models have revolutionized downstream task solutions. These large-scale models are termed foundational models and their versatility can be attributed to the vast data requirements and high modeling capacity (large number of tunable parameters). An example would be the GPT, a large language model from OpenAI, which took the world by storm with its chat-based application ChatGPT. In our recently published paper in IGARSS 2023 [1], we look at the performance of such foundational models in one downstream task of image captioning using satellite imagery. The findings point towards subpar results (near-random performance).
Experiments and Datasets
Authors test the zero-shot performance of CLIP [2] and BLIP [3] language models, along with their image-encoder-based variants on remote-sensing datasets EuroSAT and BigEarthNet-S2 [4, 5]. EuroSAT is a Land User / Land cover dataset with 27k images and BigEarthNet-S2 large-scale multi-label dataset with 590,326 Sentinel-2 patches. More details of the experiments can be found in [1].
Results and Conclusion

Results
The above table shows the zero-shot performance of CLIP and BLIP with different backbone networks. ‘Standard’ refers to the out-of-the-box model and ‘Context’ refers to the fine-tuned model on geospatial datasets. It is worth noting that, while the standard model has subpar performance, adding geospatial context doesn’t necessarily improve performance and is subjective to the model and the dataset.
References
- Akash Panigrahi, Sagar Verma, Matthieu Terris, Maria Vakalopoulou. Have Foundational Models Seen Satellite Images?. IGARSS 2023 - International Geoscience and Remote Sensing Symposium, IEEE, Jul 2023, Pasadena, United States. hal-04112634f
- A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
- P. Helber, B. Bischke, A. R. Dengel, and D. Borth, “EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE JSTARS, vol. 12, pp. 2217–2226,2017.
- G. Sumbul, M. Charfuelan, B. Demir, and V. Markl, “BigEarthNet: A large-scale benchmark archive for remote sensing image understanding,” IGARSS, pp. 5901–5904, 2019.
Foundational modelsGISCLIPBLIPEuroSATBigEarthNet-S2Fine Tuning