Journal of Geodesy and Geoinformation Science ›› 2023, Vol. 6 ›› Issue (4): 27-39.doi: 10.11947/j.JGGS.2023.0403

Previous Articles     Next Articles

Multi-task Learning of Semantic Segmentation and Height Estimation for Multi-modal Remote Sensing Images

Mengyu WANG1,2,3,4(), Zhiyuan YAN1,4(), Yingchao FENG1,4, Wenhui DIAO1,4, Xian SUN1,2,3,4   

  1. 1. Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100190, China
    3. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China
    4. Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2023-07-26 Accepted:2023-11-06 Online:2023-12-20 Published:2024-02-06
  • Contact: Zhiyuan YAN E-mail:wangmentyu22@mails.ucas.ac.cn;ganzy@aircas.ac.cn
  • About author:Mengyu WANG E-mail: wangmentyu22@mails.ucas.ac.cn
  • Supported by:
    National Key R&D Program of China(2022ZD0118401)

Abstract:

Deep learning based methods have been successfully applied to semantic segmentation of optical remote sensing images. However, as more and more remote sensing data is available, it is a new challenge to comprehensively utilize multi-modal remote sensing data to break through the performance bottleneck of single-modal interpretation. In addition, semantic segmentation and height estimation in remote sensing data are two tasks with strong correlation, but existing methods usually study individual tasks separately, which leads to high computational resource overhead. To this end, we propose a Multi-Task learning framework for Multi-Modal remote sensing images (MM_MT). Specifically, we design a Cross-Modal Feature Fusion (CMFF) method, which aggregates complementary information of different modalities to improve the accuracy of semantic segmentation and height estimation. Besides, a dual-stream multi-task learning method is introduced for Joint Semantic Segmentation and Height Estimation (JSSHE), extracting common features in a shared network to save time and resources, and then learning task-specific features in two task branches. Experimental results on the public multi-modal remote sensing image dataset Potsdam show that compared to training two tasks independently, multi-task learning saves 20% of training time and achieves competitive performance with mIoU of 83.02% for semantic segmentation and accuracy of 95.26% for height estimation.

Key words: multi-modal; multi-task; semantic segmentation; height estimation; convolutional neural network