TinyRS-R1: Compact Vision Language Model for Remote Sensing

2025-01-01
Remote sensing applications often rely on edge hardware that cannot host the models in the 7B parametric vision language of today. This paper presents TinyRS, the first 2B-parameter VLM optimized for remote sensing, and TinyRS-R1, its reasoning-augmented variant. Based on Qwen2-VL-2B, TinyRS is trained via a four-stage pipeline: pre-training on million-scale satellite images, instruction tuning, fine-tuning with Chain-of-Thought (CoT) annotations from a new reasoning dataset, and GRPO-based alignment. TinyRS-R1 matches or surpasses recent 7B remote sensing models in classification, VQA, grounding, and open-ended QA–while using one third of the memory and latency. CoT reasoning improves grounding and scene understanding, while TinyRS excels at concise, low-latency VQA. TinyRS-R1 is the first domain-specialized small VLM with GRPO-aligned CoT reasoning for general-purpose remote sensing. The code, models, and caption datasets will be released.
IEEE Geoscience and Remote Sensing Letters
Citation Formats
A. Köksal and A. A. Alatan, “TinyRS-R1: Compact Vision Language Model for Remote Sensing,” IEEE Geoscience and Remote Sensing Letters, pp. 0–0, 2025, Accessed: 00, 2025. [Online]. Available: https://hdl.handle.net/11511/117262.