Effective underwater sensing is crucial for environmental protection and sustainable energy transitions, particularly as we face growing challenges in marine ecosystem monitoring, resource management, and the need for efficient energy infrastructure. To support these efforts, we propose a multimodal sensing approach that enhances underwater detection and distance estimation by combining affordable sonar technology with stereo vision-based depth cameras. Our method integrates the Ping 360 single-beam sonar for target detection and distance measurement with depth refinement from the Intel RealSense D455 camera. A promptable segmentation model automates sonar target detection, overcoming challenges such as acoustic noise and shadowing without requiring large labeled datasets. Depth images from the stereo camera are enhanced using a Depth-Anything model, addressing underwater-specific issues like noise, missing regions, and light attenuation, achieving accurate depth maps for distances up to 1.2 meters underwater. By leveraging multimodal sensing, this approach not only improves underwater robotics for navigation, manipulation, and exploration but also plays a key role in monitoring and maintaining energy infrastructure, such as offshore wind farms and underwater pipelines. Accurate, real-time sensing of these installations ensures more efficient operations, minimizes the environmental impact, and aids in the sustainable management of ocean resources. This enables better energy production and resource utilization, which are essential for a smarter and more sustainable energy transition.