Jain, HiteshHiteshJainVerma, SagarSagarVermaGupta, SiddharthSiddharthGupta2025-08-312025-08-312023-01-01[9798350325591]10.1109/InGARSS59135.2023.104904362-s2.0-85191016468https://d8.irins.org/handle/IITG2025/29305Contrastive learning methods that bridge textual descriptions and images, such as Contrastive Language-Image Pre-training (CLIP), have demonstrated remarkable advancements. These foundational models have shown exceptional performance in tasks related to zero-shot image classification, as evidenced by their substantial enhancement of zero-shot ImageNet accuracy from the prior state-of-the-art of 12% to an impressive 76%. However, the exposure of these models to satellite images during training has been limited, resulting in suboptimal performance when dealing with geospatial data. This limitation raises a pivotal question: Can these foundational models, which have demonstrated potential across multiple domains, be trained on geospatial imagery out-of-box? To answer this question, we perform a study on training CLIP on diverse geospatial datasets. Within our research, we delve into unique challenges in this context and discuss the strategies we employ to address these challenges effectively. We demonstrate that handling resolution is crucial when training CLIP like models on a large multi-resolution dataset.falseneural networks | remote sensing | robustnessInvestigating Large Vision Model Training Challenges on Satellite DatasetsConference Paper20230cpConference Proceeding1