CFM-UNet: A Joint CNN and Transformer Network via Cross Feature Modulation for Remote Sensing Images Segmentation

doi:10.11947/j.JGGS.2023.0404

Abstract

Abstract:

The semantic segmentation methods based on CNN have made great progress, but there are still some shortcomings in the application of remote sensing images segmentation, such as the small receptive field can not effectively capture global context. In order to solve this problem, this paper proposes a hybrid model based on ResNet50 and swin transformer to directly capture long-range dependence, which fuses features through Cross Feature Modulation Module(CFMM). Experimental results on two publicly available datasets, Vaihingen and Potsdam, are mIoU of 70.27% and 76.63%, respectively. Thus, CFM-UNet can maintain a high segmentation performance compared with other competitive networks.

Key words: remote sensing images; semantic segmentation; swin transformer; feature modulation module

Min WANG, Peidong WANG. CFM-UNet: A Joint CNN and Transformer Network via Cross Feature Modulation for Remote Sensing Images Segmentation[J]. Journal of Geodesy and Geoinformation Science, 2023, 6(4): 40-47.

Figures/Tables 8

Fig.1

Fig.2

Tab.1

Tab.2

Fig.3

Fig.4

Tab.3

Tab.4

References 26

[1]	ZUO Zongcheng, ZHANG Wen, ZHANG Dongying. A remote sensing image semantic segmentation method by combining deformable convolution with conditional random fields[J]. Journal of Geodesy and Geoinformation Science, 2020, 3(3): 39-49. doi: 10.11947/j.JGGS.2020.0304
[2]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA: IEEE, 2015: 3431-3440.
[3]	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich: Springer, 2015: 234-241.
[4]	CHENL C, ZHU Yukun, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018: 833-851.
[5]	FU Jun, LIU Jing, TIAN Haijie, et al. Dual attention network for scene segmentation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA: IEEE, 2019: 3146-3154.
[6]	XIAO Tete, LIU Yingcheng, ZHOU Bolei, et al. Unified perceptual parsing for scene understanding[C]//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018: 432-448.
[7]	ZHAO Hengshuang, SHI Jianping, QI Xiaojuan, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE, 2017: 6230-6239.
[8]	MOU Lichao, HUA Yuansheng, ZHU Xiaoxiang. Relation matters: relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(11): 7557-7569. doi: 10.1109/TGRS.36
[9]	L Wenjie, LI Yu, Z Quanhua. High-resolution remote sensing image segmentation using minimum spanning tree tessellation and RHMRF-FCM algorithm[J]. Journal of Geodesy and Geoinformation Science, 2020, 3(1): 52-63. doi: 10.11947/j.JGGS.2020.0106
[10]	DOSOVITSKIYA, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]//Proceedings of the 9th International Conference on Learning Representations. [S.l.]: OpenReview.net, 2021: 1-5.
[11]	LIU Z, et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 9992-10002, doi: 10.1109/ICCV48922.2021.00986.
[12]	CAO Hu, WANG Yueyue, CHEN J, et al. Swin-Unet: unet-like pure transformer for medical image segmentation[C]//Proceedings of the European Conference on Computer Vision. Tel Aviv: Springer, 2023: 205-218.
[13]	LIN Ailiang, CHEN Bingzhi, XU Jiayu, et al. DS-TransUNet: dual swin transformer U-Net for medical image segmentation[J]. IEEE Transactions on Instrumentation and Measurement, 2022, 71: 4005615.
[14]	HE Xin, ZHOU Yong, ZHAO Jiaqi, et al. Swin transformer embedding UNet for remote sensing image semantic segmentation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4408715.
[15]	CHEN J, LU Y, YU Q, et al. Transunet: Transformers make strong encoders for medical image segmentation[EB/OL]. [2023-09-01]. https://www.cs.jhu.edu/-alanlab/Pubs21/chen2021transunet.pdf.
[16]	JIANG Liming, ZHANG Changxu, HUANG Mingyang, et al. TSIT: a simple and versatile framework for image-to-image translation[C]//Proceedings of the 16th European Conference on Computer Vision. Glasgow: Springer, 2020: 206-222.
[17]	WANG Xintao, YU Ke, DONG Chao, et al. Recovering realistic texture in image super-resolution by deep spatial feature transform[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE, 2018: 606-615.
[18]	ISPRS 2D semantic labeling dataset[EB/OL]. [2021-06-10]. https://www2.isprs.org/commissions/comm2/wg4/benchmark/semantic-labeling/.
[19]	MAGGIORI E, TARABALKA Y, CHARPIAT G, et al. High-resolution aerial image labeling with convolutional neural networks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(12): 7092-7103. doi: 10.1109/TGRS.2017.2740362
[20]	LIU Yu, NGUYEN D M, DELIGIANNIS N, et al. Hourglass-shape network based semantic segmentation for high resolution aerial imagery[J]. Remote Sensing, 2017, 9(6): 522. doi: 10.3390/rs9060522
[21]	VOLPI M, TUIA D. Dense semantic labeling of subdecimeter resolution images with convolutional neural networks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(2): 881-893. doi: 10.1109/TGRS.2016.2616585
[22]	MARCOS D, VOLPI M, KELLENBERGER B, et al. Land cover mapping at very high resolution with rotation equivariant CNNs: towards small yet accurate models[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2018, 145: 96-107. doi: 10.1016/j.isprsjprs.2018.01.021
[23]	LI Xiangtai, HE Hao, LI Xia, et al. PointFlow: flowing semantics through points for aerial image segmentation[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN: IEEE, 2021: 4217-4226.
[24]	FIDONL, LI Wenqi, GARCIA-PERAZA-HERRERA L C, et al. Generalised Wasserstein dice score for imbalanced multi-class segmentation using holistic convolutional networks[C]//Proceedings of the 3rd International MICCAI Brainlesion Workshop. Quebec City: Springer, 2017: 64-76.
[25]	ZHU Qingtian, ZHENG Yumin, JIANG Yulai, et al. Efficient multi-class semantic segmentation of high resolution aerial imagery with dilated LinkNet[C]//Proceedings of 2019 IEEE International Geoscience and Remote Sensing Symposium. Yokohama: IEEE, 2019: 1065-1068.
[26]	PENG Zhiliang, HUANG Wei, GU Shanzhi, et al. Conformer: local features coupling global representations for visual recognition[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 357-366. DOI: 10.1109/ICCV48922.2021.00042.

Methods	IoU					Evaluation indicator
Methods	Low vegetation	Tree	Car	Impervious surface	Building	mIoU
FCN^[2]	54.80	70.38	39.92	73.22	78.97	63.46
UNet^[3]	57.23	71.63	48.29	72.91	81.68	66.35
Deeplab V3+^[4]	56.09	71.54	50.30	74.85	83.01	67.16
UperNet^[5]	55.65	71.31	47.26	73.45	81.50	65.84
DANet^[6]	56.88	71.21	42.68	73.54	81.40	65.14
TransUNet^[15]	55.07	71.08	55.13	73.27	81.01	67.11
Swin-UNet^[12]	49.48	67.12	30.78	69.31	73.37	58.01
ST-UNet^[14]	57.79	72.53	61.48	76.36	82.98	70.23
CFM-UNet(ours)	58.08	72.80	60.71	76.46	83.33	70.27

Methods	IoU					Evaluation indicator
Methods	Low vegetation	Tree	Car	Impervious surface	Building	mIoU
FCN^[2]	66.10	63.19	74.34	77.41	83.52	72.91
UNet^[3]	64.59	65.44	76.16	77.10	82.83	73.22
Deeplab V3+^[4]	67.53	63.05	78.05	79.01	84.76	74.48
UperNet^[5]	65.65	60.40	76.57	76.95	83.93	72.70
DANet^[6]	66.46	63.47	75.28	77.35	83.45	73.20
TransUNet^[15]	67.16	64.10	79.33	78.61	85.60	74.96
Swin-UNet^[12]	59.03	50.96	71.15	71.45	75.02	65.52
ST-UNet^[14]	67.89	66.37	79.77	79.19	86.63	75.97
CFM-UNet(ours)	69.49	68.32	78.89	79.58	86.86	76.63

Methods	FLOPs(G)	Parameters(MB)	Speed(FPS)	mIoU/(%)
FCN^[2]	6.2	22.70	370	63.46
UNet^[3]	7.1	25.13	210	66.35
Deeplab V3+^[4]	14.8	38.48	69	67.16
UperNet^[5]	37.1	102.13	58	65.84
DANet^[6]	13.1	45.36	107	65.14
TransUNet^[15]	36.2	100.44	33	67.11
Swin-UNet^[12]	6.5	25.89	52	58.01
ST-UNet^[14]	52.3	160.97	6	70.23
CFM-UNet(ours)	66.1	209.71	5	70.27

Methods	Evaluation indicator
Methods	mIoU
CFM-UNet without CFMM	67.32
CFM-UNet without C>>T	67.43
CFM-UNet without T>>C	68.56
CFM-UNet replaces CFMM with fusion strategy in Literature [26]	69.14
CFM-UNet	70.27