Multi-task Learning of Semantic Segmentation and Height Estimation for Multi-modal Remote Sensing Images

doi:10.11947/j.JGGS.2023.0403

摘要/Abstract

Abstract:

Deep learning based methods have been successfully applied to semantic segmentation of optical remote sensing images. However, as more and more remote sensing data is available, it is a new challenge to comprehensively utilize multi-modal remote sensing data to break through the performance bottleneck of single-modal interpretation. In addition, semantic segmentation and height estimation in remote sensing data are two tasks with strong correlation, but existing methods usually study individual tasks separately, which leads to high computational resource overhead. To this end, we propose a Multi-Task learning framework for Multi-Modal remote sensing images (MM_MT). Specifically, we design a Cross-Modal Feature Fusion (CMFF) method, which aggregates complementary information of different modalities to improve the accuracy of semantic segmentation and height estimation. Besides, a dual-stream multi-task learning method is introduced for Joint Semantic Segmentation and Height Estimation (JSSHE), extracting common features in a shared network to save time and resources, and then learning task-specific features in two task branches. Experimental results on the public multi-modal remote sensing image dataset Potsdam show that compared to training two tasks independently, multi-task learning saves 20% of training time and achieves competitive performance with mIoU of 83.02% for semantic segmentation and accuracy of 95.26% for height estimation.

Key words: multi-modal, multi-task, semantic segmentation, height estimation, convolutional neural network

. [J]. 测绘学报（英文版）, 2023, 6(4): 27-39.

Mengyu WANG, Zhiyuan YAN, Yingchao FENG, Wenhui DIAO, Xian SUN. Multi-task Learning of Semantic Segmentation and Height Estimation for Multi-modal Remote Sensing Images[J]. Journal of Geodesy and Geoinformation Science, 2023, 6(4): 27-39.

图/表 16

参考文献 45

[1]	ZHU Xiaoxiang, TUIA Devis, MOU Lichao, et al. Deep learning in remote sensing: a comprehensive review and list of resources[J]. IEEE Geoscience and Remote Sensing Magazine, 2017, 5(4): 8-36.
[2]	MENG Xiaoliang, YANG Yuechi, WANG Libo, et al. Class-guided swin transformer for semantic segmentation of remote sensing imagery[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 1-56517505.
[3]	SUN Long, WU Tao, Sun Guangcai, et al. Object detection research of SAR image using improved faster region-based convolutional neural network[J]. Journal of Geodesy and Geoinformation Science, 2020, 3(3): 18-28. doi: 10.11947/j.JGGS.2020.0302
[4]	MOU Lichao, ZHU Xiaoxiang. IM2HEIGHT: height estimation from single monocular imagery via fully residual convolutional-deconvolutional network[M/OL]. [2023-06-26]. http://arXiv.org/abs/:1802.10249, 2018.
[5]	HUANG Zhongling, DATCU Mihai, PAN Zongxu, et al. Deep SAR-Net: learning objects from signals[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 161: 179-193. doi: 10.1016/j.isprsjprs.2020.01.016
[6]	YAN Zhiyuan, WANG Peijin, XU Feng, et al. AIR-PV: a benchmark dataset for photovoltaic panel extraction in optical remote sensing imagery[J]. Science China Information Sciences, 2023, 66(4): 140307. doi: 10.1007/s11432-022-3663-1
[7]	LI Shutao, LI Congyu, KANG Xudong. Development status and future prospects of multi-source remote sensing image fusion[J]. National Remote Sensing Bulletin, 2021, 25(1): 148-166. doi: 10.11834/jrs.20210259
[8]	ZHANG Jiaqing, LEI Jie, XIE Weiying, et al. SuperYOLO: super resolution assisted object detection in multimodal remote sensing imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-155605415.
[9]	ZHENG Aihua, HE Jinbo, WANG Ming, et al. Category-wise fusion and enhancement learning for multimodal remote sensing image semantic segmentation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-124416212.
[10]	LI Xue, ZHANG Guo, CUI Hao, et al. MCANet: a joint semantic segmentation framework of optical and SAR images for land use classification[J]. International Journal of Applied Earth Observation and Geoinformation, 2022, 106: 102638. doi: 10.1016/j.jag.2021.102638
[11]	LIU Wenjie, SUN Xian, ZHANG Wenkai, et al. Associatively segmenting semantics and estimating height from monocular remote-sensing imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-175624317.
[12]	ZHU Panpan, LI Shuaipeng, ZHANG Liqiang, et al. Multitask learning-based building extraction from high-resolution remote sensing images[J]. Journal of Geo-Information Science, 2021, 23(3): 514-523.
[13]	LIU Anan, SU Yuting, NIE Weizhi, et al. Hierarchical clustering multi-task learning for joint human action grouping and recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(1): 102-114. pmid: 26955018
[14]	LIU Anan, XU Ning, NIE Weizhi, et al. Multi-domain and multi-task learning for human action recognition[J]. IEEE Transactions on Image Processing, 2019, 28(2): 853-867. doi: 10.1109/TIP.2018.2872879
[15]	RUDER S. An overview of multi-task learning in deep neural networks[M/OL]. [2023-06-26]. http://arXiv.org/abs/:1706.05098, 2017.
[16]	FENG Yingchao, SUN Xian, DIAO Wenhui, et al. Height aware understanding of remote sensing images based on cross-task interaction[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2023, 195: 233-249. doi: 10.1016/j.isprsjprs.2022.11.014
[17]	ZUO Zongcheng, ZHANG Wen, ZHANG Dongying. A remote sensing image semantic segmentation method by combining deformable convolution with conditional random fields[J]. Journal of Geodesy and Geoinformation Science, 2020, 3(3): 39-49. doi: 10.11947/j.JGGS.2020.0304
[18]	RONNEBERGER O, FISCHER P, BROX T. U-Net:convolutional networks for biomedical image segmentation[C]// Proceedings of the 18th International Conference on In Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015: 18th International Conference. Munich: Springer, 2015: 234-241.
[19]	ZHAO Hengshuang, SHI Jianping, QI Xiaojuan, et al. Pyramid scene parsing network[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE, 2017: 2881-2890.
[20]	DIAKOGIANNIS F I, WALDNER F, CACCETTA P, et al. ResUNet-a: a deep learning framework for semantic segmentation of remotely sensed data[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 162: 94-114. doi: 10.1016/j.isprsjprs.2020.01.013
[21]	FENG Yingchao, DIAO Wenhui, SUN Xian, et al. NPALOSS: neighboring pixel affinity loss for semantic segmentation in high-resolution aerial imagery[C]// Proceedings of the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences. [S.l.]: ISPRS, 2020: 475-482.
[22]	EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems Advances in Neural Information Processing Systems. Montreal: MIT Press, 2014(27): 2366-2374.
[23]	AO Ying, LI Penglong, WEN Li, et al. Fully convolutional networks for street furniture identification in panorama images[J]. Journal of Geodesy and Geoinformation Science, 2022, 5(4): 59-71. doi: 10.11947/j.JGGS.2022.0406
[24]	MOU Lichao, HUA Yuansheng, ZHU Xiaoxiang. Relation matters: relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(11): 7557-7569. doi: 10.1109/TGRS.36
[25]	LI Haifeng, QIU Kaijian, CHEN Li, et al. SCAttNet: semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images[J]. IEEE Geoscience and Remote Sensing Letters, 2021, 18(5):905-909. doi: 10.1109/LGRS.2020.2988294
[26]	LI Jiaxin, HONG Danfeng, GAO Lianru, et al. Deep learning in multimodal remote sensing data fusion: a comprehensive review[J]. International Journal of Applied Earth Observation and Geoinformation, 2022, 112: 102926. doi: 10.1016/j.jag.2022.102926
[27]	WU Xin, HONG Danfeng, CHANUSSOT J. Convolutional neural networks for multimodal remote sensing data classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-105517010.
[28]	SUN Xian, TIAN Yu, LU Wanxuan, et al. From single-to multi-modal remote sensing imagery interpretation: a survey and taxonomy[J]. Science China Information Sciences, 2023, 66(4): 140301. doi: 10.1007/s11432-022-3588-0
[29]	CHEN Kaiqiang, FU Kun, GAO Xin, et al. Effective fusion of multi-modal data with group convolutions for semantic segmentation of aerial imagery[C]//Proceedings of 2019 IEEE International Geoscience and Remote Sensing Symposium. Yokohama: IEEE, 2019: 3911-3914.
[30]	XING Siyuan, DONG Qiulei, HU Zhanyi. Gated feature aggregation for height estimation from single aerial images[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19:1-5.
[31]	SUN Xian, WANG Peijin, LU Wanxuan, et al. RingMo: a remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-225612822.
[32]	VANDENHENDE S, GEORGOULIS S, VAN GANSBEKE W, et al. Multi-task learning for dense prediction tasks: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(7): 3614-3633.
[33]	KOKKINOS I. UberNet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE, 2017: 6129-6138.
[34]	SRIVASTAVA S, VOLPI M, TUIA D. Joint height estimation and semantic labeling of monocular aerial images with CNNS[C]//Proceedings of 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). Fort Worth, TX: IEEE, 2017: 5173-5176.
[35]	WANG Yufeng, DING Wenrui, ZHANG Ruiqian, et al. Boundary-aware multitask learning for remote sensing imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 951-963. doi: 10.1109/JSTARS.4609443
[36]	SONG Weiwei, DAI Yong, GAO Zhi, et al. Hashing-based deep metric learning for the classification of hyperspectral and LiDAR data[J], IEEE Transactions on Geoscience and Remote Sensing, 2023, 61:1-13.
[37]	SUN Xian, WANG Peijin, LU Wanxuan, et al. RingMo: a remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-22.
[38]	VANDENHENDE S, GEORGOULIS S, VAN GANSBEKE W, et al. Multi-task learning for dense prediction tasks: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(7): 3614-3633.
[39]	KOKKINOS I. Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA:IEEE, 2017: 6129-6138.
[40]	SRIVASTAVA S, VOLPI M, TUIA D. Joint height estimation and semantic labeling of monocular aerial images with CNNS[C]//Proceedings of 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). Fort Worth, TX, USA: IEEE, 2017: 5173-5176.
[41]	WANG Yufeng, DING Wenrui, ZHANG Ruiqian, et al. Boundary-aware multitask learning for remote sensing imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 951-963. doi: 10.1109/JSTARS.4609443
[42]	GAO Zhi, SUN Wenbo, LU Yao, et al. Joint learning of semantic segmentation and height estimation for remote sensing image leveraging contrastive learning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61:1-15.
[43]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 770-778.
[44]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. doi: 10.1109/TPAMI.2017.2699184
[45]	ISPRS. 2D Semantic Labeling Contest-Potsdam. [EB/OL]. [2023-09-01]. https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx.

Layer name	Output size	Stride	Dilated rate
Conv1_1	256×256(1/2)	2	-
Conv1_2	256×256(1/2)	1	-
Conv1_3	256×256(1/2)	1	-
Layer1	128×128(1/4)	1	1
Layer2	64×64(1/8)	2	1
Layer3	64×64(1/8)	1	2
Layer4	64×64(1/8)	1	4

Method	mIoU	OA
Baseline(SM_SS)	81.84	87.80
+CMFF(MM_SS)	82.34	88.30

Method	Rel	Rmse	δ₁/(%)
Baseline(SM_HE)	0.0993	1.108	96.15
+CMFF(MM_HE)	0.1094	1.232	94.42

Method	mIoU /(%)	δ₁ /(%)	Training speed /(s/iter)	Model size /MB
SM_SS	81.84	-	0.48	267.18
SM_HE	-	96.15	0.48	267.17
+JSSHE (SM_MT)	79.09	94.33	0.72	399.26

Task/α	0.5	1	1.5
SS(mIoU)	79.23%	79.09%	78.42%
HE(Rel)	0.1168	0.1094	0.1135