GSoC 2022: Boat Object Avoidance with Luxonis AI Camera

Hello everyone! I am Chenghao Tan, an undergraduate student at Hangzhou Dianzi University in China. This summer, I worked on using Luxonis AI camera for boat obstacle avoidance. First of all, many thanks to my mentors @rmackay9 and @rishabsingh3003 for providing great support to my project. Their invaluable guidance has not only benefited the progress of my GSoC project but also helped me improve my programming ideas. I would also like to thank the ArduPilot community for funding and CubePilot for providing hardware for my project. I have enjoyed an extremely delightful GSoC experience, and I will summarize my project below.

Project Description

This project involves training and integrating a Luxonis AI camera to recognize obstacles and then send their estimated position to ArduPilot’s existing object avoidance feature so that the vehicle can stop and path plan around them.

Why Luxonis

Luxonis AI camera has a built-in intel MyriadX VPU, which can run simple deep learning models and can share the computational load of the obstacle detection process.

Core Idea

The depth map obtained by the binocular camera includes the water surface. So to detect obstacles, the water surface must be removed from the depth map. However, the water surface is uneven and has a slope as the boat bumps, which makes it difficult to be separated from the obstacles. Since the Luxonis AI camera itself has relatively powerful computational ability and comes with an RGB camera, we can perform the removal of water surface with the help of deep learning.

One option is to use the object detection model to mark out the obstacles directly. The problem with this is that there are too many kinds of obstacles for the dataset to cover.

The scheme used in this project is to use an image segmentation model to filter the water surface pixel by pixel on the depth map and then do subsequent processing on the depth map with only obstacle pixels left.

Dataset Selection

I found an excellent dataset, MaSTr1325. It has masks of water, obstacles, and sky, with a total of 1325 images. I reclassified the pixels into obstacle and non-obstacle categories during training—filtering out the water and sky, leaving the obstacles. Since there are distracting factors such as rain, snow, sun flare, and shadows in reality, and this dataset is too “clean”, the data was augmented using the albumentations library, adding random color jitter and geometric distortions along with the above factors. Nevertheless, more data is better. The training framework has been simplified, and the link is attached at the end of the blog, so feel free to add your own data to train a better model!

Model Training

I first tried UNet, which is widely used in image segmentation tasks. Although it performs well during validation, it is too heavy for the Luxonis AI camera. Even after pruning, it only reaches eight frames per second at 480*270, so it has now been replaced. (But you can still see it in the test script repository and the training framework support.) I ended up using DDRNet. When doing a pure segmentation job, it reached up to 28 frames per second at 640*360 with no noticeable latency. DDRNet running on Luxonis AI camera can reach up to 0.75 mIoU testing with XLink.

Segmentation Demo

Obstacle Location Acquisition

The filtered depth map is divided into grids. Grids with obstacle pixels that exceed a threshold will be marked. Large obstacles will cover multiple grids to let ArduPilot know their actual size. This processing part is implemented manually and embedded into the model’s tail due to speed consideration and DepthAI’s limitation.

The real-world obstacle coordinates are computed in two alternative ways. One follows the Luxonis’ official host-side demo, using only HFOV; the other refers to the point cloud calculation method proposed in the Luxonis community, using the camera’s intrinsic matrix. This part runs on the companion computer and will automatically read the calibration information from the camera when it starts.

Sending Obstacle Location Messages

This part differs from the original proposal in that the planned OAK-D-IoT series is no longer available, so a companion computer such as Raspberry Pi is used instead (the branch that sends messages using onboard ESP32 is still retained but buggy). Since Luxonis AI camera is responsible for most of the computation, the performance of the companion computer does not need to be very high, and a Raspberry Pi zero should be able to handle this task.

Unless the frame is not ready, the companion computer will send obstacle_distance_3d messages at a constant rate. If there are no obstacles on the entire frame, a message with a distance of maximum effective distance+1 will be sent. If the camera fails, then no obstacle location message will be sent.

What’s Next

  1. Both passive depth estimation and RGB camera require sufficient light. The former can be alleviated by projecting IR dots (some new Luxonis AI cameras have this feature), but the latter is trickier, and even with supplemental visible light, we need the appropriate dataset.

  2. Reflections and multi-scale objects have always been difficult for image segmentation tasks, especially for those small models. The model used now sometimes has difficulty distinguishing between reflections and real obstacles. It’s also known that when boats or obstacles are in the area covered by trees’ or coasts’ reflections, there’s a significant drop in segmentation accuracy. Massive objects covering the entire image’s upper half also have this effect. Limited by the computing resource of the camera, the best way at present might be to augment the dataset, making up for these lacks in MaSTr1325.

  3. When using the software implementation to detect obstacles, the workload is too heavy for companion computers such as Raspberry Pi and is only suitable for tuning parameters. However, the hardware implementation has parameters such as confidence threshold fixed. Therefore, precompiled blob file with different parameters will be provided soon.



Some links:

Camera model used during development:: OAK-D-IoT-75
In fact, all Luxonis cameras supported by DepthAI with VPU and depth measurement functions should work. DepthAI Hardware Documentation: here

UNet description: wiki (The UNet that the project tried to use had undergone a variety of pruning. It was mainly aimed at the number of channels and the depth of the model.)
DDRNet description: gtihub (Since it’s very new, there is no Wikipedia page for it. But its GitHub readme is very clear. Note that there are only model definitions in this repository. The project uses DDRNet_23_slim for segmentation.)

Depth to 3D location using HFOV: here
Depth to 3D location using intrinsic matrix: here


Changhao, I am super excited about your project! With there being little support for the realsense D435 moving forward it would be amazing to have a replacement. Further given the AI backend on the camera it should be superior to the D435 solution. So looking forward to hearing about your progress. I’m assuming you will post progress on this link? All the best and good luck with your very exciting project. Craig Flanagan PhD

This is actually an update in May: GitHub - Chenghao-Tan/DDRNet: DDRNet for marine segmentation

  1. Segment Anything Model has been utilized for dataset generation. Users can automatically annotate the pictures they take (ideally from the perspective of the on-boat camera) using their gaming PCs (with 4GB+ VRAM NVIDIA cards) without human intervention. Then they can retrain the DDRNet model for the Luxonis camera and get greatly improved recall & accuracy. Typically the whole process can be done in several hours.

  2. A browser-based GUI is provided. Dataset auto-annotation, training process, model export & conversion can all be done with the help of the GUI.

About SAM (Segment Anything Model):

SAM is a large vision model created by Meta. It’s now an optional module of this training framework/GUI. Please use it under Meta’s license.

A few prompt points are fixed in specific positions to mark sky and water, so SAM alone can work without things like Grounding DINO, which means higher speed and less VRAM consumption, allowing it to run on low-end gaming PCs. Tested on USVInland, the mIoU between the annotation generated by SAM and the official one is 93.52% and on MaSTr1325 is 78.38% (tested with vit_l. vit_h will have better results). The former has two classes (water, obstacles), and the latter has three (sky, water, obstacles). The slightly lower performance of the latter is mainly due to some individual images, not affecting much. A small filter targeting these bad masks will be applied in actual usage.

GUI Screenshots and Demo Videos:

Dataset Auto-Annotation:


Export ONNX:

ONNX to BLOB (the model running on the Luxonis camera):

(This procedure needs Internet connection because it uses Luxonis’ online service)


(Red - Obstacles, Green - Water, Blue - Sky
Due to the fixed prompt points, the result is not always accurate, especially when the trees occupy the position of the sky, but it is still usable.
Therefore, the option to segment water and obstacles only is provided.)

(Left: Use Segment Anything Model to auto-annotate, then retrain
Right: Only trained on MaSTr1325)

1 Like