The vision sensor or camera located on the drone is used in this study to identify the surrounding objects and sense the information in the environment. In short, the scene is captured as a video sequence, and is processed to extract the contextual features of the visual objects, while our proposed masking method is executed to understand the scene and provide a waypoint for guiding the navigation.
In an environment with dynamic brightness, flying a drone is a challenging task, especially for scene understanding. In our work, the masking method is applied to perform semantic segmentation to identify a safe route to navigate in the scene.
3.1. Object Detection
Every object in the scene is important to autonomous drone navigation, as the drones interpret and perform tasks based on object locations. As is currently common practice, deep learning is used for object detection in this project. However, the lack of standardized datasets that contain different types of trees in images acquired from drones makes it challenging for the deep learning models described here to create a drone-based solution.
The v2 Around NAS dataset used in this paper comprises images captured by a drone vision system. All the video data used here were recorded by a drone in the campus area of Petronas University of Technology and the nearby oil palm plantation. Ten videos of the investigation areas were recorded. The image sequences were extracted from the videos acquired and all the frames were manually annotated to train the object detection model with a deep learning method. Furthermore, to boost the performance of the model, the datasets were augmented through different processing techniques such as flips, grains, and shifts. A deep learning method with a convolution neural network was implemented using Tensor-Flow running on an Nvidia Telsa K80.
Algorithm 2 Drone Navigation Algorithm with Object Detection, Semantic Segmentation, Thresholding, and Navigation |
- 1:
functionautonomous_navigation(video_frames) - 2:
▹ Object Detection Module - 3:
detected_objects ← object_detection(video_frames) - 4:
- 5:
▹ Semantic Segmentation and Thresholding Modules (Parallel Execution) - 6:
safe_route ← semantic_segmentation(detected_objects) - 7:
bright_spots ← thresholding(video_frames) - 8:
- 9:
▹ Navigation Module - 10:
navigate_autonomously(safe_route, bright_spots) - 11:
end function - 12:
functionobject_detection(video_frames) - 13:
detected_objects ← [] ▹ Placeholder, actual implementation depends on the detection model - 14:
for frame in video_frames do - 15:
▹ Process each frame to detect objects and enclose them with bounding boxes - 16:
objects ← detect_objects(frame) - 17:
detected_objects.append(objects) - 18:
end for - 19:
return detected_objects - 20:
end function - 21:
functionsemantic_segmentation(detected_objects) - 22:
▹ Extract contextual features and apply image segmentation to identify a safe route - 23:
▹ Describe relations between objects with colored bounding boxes - 24:
safe_route ← scene_understanding_algorithm(detected_objects) - 25:
return safe_route - 26:
end function - 27:
functionthresholding(video_frames) - 28:
bright_spots ← [] ▹ Placeholder, actual implementation depends on the thresholding algorithm - 29:
for frame in video_frames do - 30:
▹ Identify bright spots using thresholding and draw contours, label with red dots - 31:
bright_spots_frame ← thresholding_algorithm(frame) - 32:
bright_spots.append(bright_spots_frame) - 33:
end for - 34:
return bright_spots - 35:
end function - 36:
functionnavigate_autonomously(safe_route, bright_spots) - 37:
▹ Use contextual information from semantic segmentation and thresholding - 38:
▹ Implement velocity control to guide the drone for precise autonomous navigation - 39:
navigation_algorithm(safe_route, bright_spots) - 40:
end function
|
These models, available as open-source, offer convenient access to their architectures and pre-trained weights, expediting development and experimentation. Several key features of these models are outlined below:
They are trained on large-scale datasets, providing a robust foundation for transfer learning with minimal additional training data.
They are designed to output bounding boxes alongside class predictions and they facilitate object localization, a critical aspect of object detection.
They are selected for feature extraction based on their high mAP (mean average precision) scores on benchmark datasets like COCO, indicating superior performance in object detection tasks.
We evaluated the performance of these feature extraction models on datasets with varying compositions. While trained on real trees, they were also tested on 3D model trees [link to dataset], exhibiting effective performance in both scenarios.
The green region corresponds to the bounding box of the detected tree, while the red region represents the fixed field of view (FOV) of the drone.
In this project, the video data were acquired by a DJI Phantom 3 Professional quadcopter drone. The output of the proposed model was simulated for planning the drone flight missions. The drone was equipped with a single camera, including a three-axis gimbal. The camera has a resolution of 30 frames per second and is able to capture 12-megapixel photos with a field of view (FOV) of 94. The 3 axes can maintain a perfect camera level in any flight conditions, which results in stable footage throughout every flight. In this study, an optimum resolution of 1920 × 1080 was used to record the videos. All the video frames were downsized to 1024 × 600 pixels according to the parameter settings of our model. Since our videos were recorded in high resolution, downsizing the dataset resolution did not substantially affect the image quality. In most common drone applications, the camera is mounted at a 90 degree top-view angle to capture the scene at a higher altitude. However, we consider a scenario where the drone is flying in a remote environment and is surrounded by many trees. In our experiment, the camera of the drone is fixed at the front view for object detection with a broader field of view.
A frame is first processed by detecting and localizing the objects of interest. Once an object is detected, its location relative to other objects in the scene is determined. The detection is performed by training an object classification algorithm that is expected to learn the useful features and produce highly accurate classification in different classes. In this project, a real-time object detection algorithm called Faster R-CNN ResNet-101 [
31] was used. Faster R-CNN is a unified model that uses a region proposal network to generate regions of interest. By unifying three independent models, the network is able to replace the selective search method entirely. Faster R-CNN computes all the feature maps by using a single forward-pass approach over the entire image. Then, the region proposals share the feature matrices that are later used for the bounding box regressors and learning the object classifier. The workflow of Faster R-CNN for tree object detection is shown in
Figure 3.
As shown in
Figure 4, the Faster RCNN detection architecture uses ResNet-101 as the ConvNet backbone to extract features. The Resnet101 Conv block consists of five fully convolution layers (
conv1_x, conv2_x, conv3_x, conv 4_x, conv5_x) [
35]. There are denoted as
C1, C2, C3, C4, and C5 for
conv1, conv2, conv3, conv4, and conv5. Each conv layer consists of a 3 × 3 convolution size with a fixed feature map dimension. The
conv1 layer has a size of 3 × 3 filters. The
conv2, conv3, and conv4 layers have 3 × 3 filters with 128, 256, and 512 feature map dimensions, respectively. The final layer
conv5 is connected to the RPN to generate the regions proposals, and then, undergoes ROI pooling.
First, the input image is fed into the ConvNe backbone of the RPN network. In the process, the ResNet-101 CNN extracts each feature vector with a corresponding fixed feature map dimension. The network has to learn at every point of the output feature map to estimate its size at the corresponding location and also to decide whether an object is located in the image. Then, a sliding window runs spatially on the convolutional feature maps of the image at the spatial pooling stage. For each sliding window, we need to predict multiple regions of various scales and ratios simultaneously. Hence, anchor boxes are introduced in Faster R-CNN to resolve the variations in the ratios and scales of the objects. In total, a set of nine anchors are generated that all have the same center (
,
), as shown in
Figure 5.
All these anchors have different aspect ratios and different scales. At each location, these anchor boxes predict the probability of the corresponding object in the image. Each of the anchors calculates how much it overlaps with the ground-truth bounding boxes. In the final step, all the bounding boxes are processed by applying spatial pooling. Then, the feature maps are flattened and fed into two fully connected neural networks with 2048 dimensions. The spatial features extracted from the convolutional feature maps are computed in a smaller network for classification and regression. For this reason, bounding box regression is applied to improve the anchor boxes at each location. The output regressors determine a final predicted bounding box with
coordinates. The output of the classification determines the probability of whether the predicted box contains an object or not, as shown in
Figure 6.
3.2. Masking
The masking operation was performed on a background layer that overlapped the actual frame to determine the navigation zone. A sample of the scene combined with the background masking layer is shown in
Figure 7. As shown in the figure, the image is 1024 pixels in width and 600 pixels in height. The center of the image is the region of interest, and we consider the center of the image to be the navigation zone. The navigation zone has an area of 341 × 250. The field of view (FOV) of the drone is always focused in the navigation zone. The other areas in black are ignored. For our scenario of forest inspection, the drone is always flying in the forward direction using the vision of the drone. Hence, the region of interest is set in the upper center area. In addition, there is a large empty space below the navigation area to prevent the drone from crashing on the ground or colliding with lower ground obstacles.
The navigation zone is slightly higher than the center, and there is a free space below the navigation zone. All the images were captured at the same altitude, which was approximately 1.5 m to 2 m from the ground. Our model was designed to enable a drone to fly in a forest or oil palm plantation surrounded by mid-sized trees. For these conditions, when the drone flies among obstacles, especially in a forest, the drone will not collide with the ground. The drone is able to navigate swiftly around the obstacles in the forest without adjusting its altitude. Furthermore, constantly changing the altitude in a forest might cause collisions, as the drone would not be able to see obstacles above or below it. Since our drone setup is only equipped with a fixed camera, forest inspection is conducted in the forward direction. Based on the visual input, the drone is expected to navigate through the obstacles, perform object detection, and find its way out of the forest.
By partitioning a scene into multiple regions, the visual objects in the scene can be represented by different colors to make the scene more meaningful and easier to analyze. In this study, the visual objects in the scene such as the trees are represented by colored regions. The primary goal is to divide the scene into multiple regions for further analysis. In this scenario, the areas between the tree objects are described with red-colored regions so that the drone knows which areas can be navigated through. Simultaneously, the relationship between the color-based regions and the green foreground regions is described so that the drone can autonomously avoid obstacles.
To illustrate this process, we take a frame as an example, as shown in
Figure 8a. The tree objects are detected with bounding boxes to enclose the objects. First, the background frame is filled with black, as shown in
Figure 8b. A layer depicting a fixed FOV navigation region is added to the frame, as shown in
Figure 8c. Since we are using only one camera attached to the drone and the drone is always flying forward, the FOV is set at the center of the frame as the initial view. This region is represented by red as the region of interest and navigation route in the scene. To be precise, the red region is the safe zone in front of the drone with a short distance that the drone can fly through. The tree objects detected in the scene with relative bounding boxes are shown in
Figure 8d. The bounding boxes are colored in green, which indicates obstacles. The red regions overlapping the green boxes form the red spaces among the green boxes. By comparing the spatial relationships between the regions, the most significant red area is considered a potential clear path for the drone to navigate.
We must ascertain how to determine the most significant red safe area. Since our goal is to find the largest clear path in the scene that the drone can fly through, the corresponding image areas are masked based on the specific range of the pixel values in the HSV color space. In this case, the HSV color space for the obscured red areas shown in
Figure 9a are lower = [0,100,100] and upper = [18,255,255]. The particular weighted average of the image intensities shown in
Figure 9b are obtained by using masking operations. In the drone FOV, when a detected screen is filled with many obstacles, to find a clear path for the drone to navigate, the red area can be specified through a binary mask. There are two bitmaps in each scene: the actual scene and an additional mask. The unused areas from the actual scene are assigned a pixel value with all bits set to ‘0’. For the additional mask, the red areas are assigned a pixel value with the bits set to ‘1’.
For instance, the image in
Figure 9b is a binary image that has two possible intensity values for each pixel, i.e., ‘0’ for black and ‘1’ for white. When an obstacle, i.e., a detected tree object, appears over the background, masking is performed on the overlapping scene pixels in the red area in
Figure 9a. Hence, the free and open areas of the scene are preserved. The masked areas of a clear path with pixel intensities of ‘1’ are obscured by the overlapped unused ‘0’ areas, as in
Figure 9b. Then, the system combines the original image pixels with the background pixels using the bitwise OR operation. The original image pixels are appropriately placed while keeping the masked region.
To be precise, the largest red area is calculated using the image moments. The properties of the image determined via the image moments include the area (or total intensity), centroid, and information about the orientation. The n area of the nonzero pixels is calculated based on the image moments.
We consider the first moment as follows:
The
x coordinates of all the white pixels (where
) are summed up. In addition, the sum of the
y coordinates of all the white pixels is calculated as follows:
The area of the binary image is calculated as follows:
When there is more than one area of white pixels in the image, the centroid is somewhere in between. Hence, we need to extract each white pixel area separately to obtain the centroid. Given that
A is an array used to store all the white-pixel areas found in the image as follows:
to find the largest
area, we calculate the following:
Technically, the largest white-pixel area is selected as the final clear path because this is the largest empty area between the obstacles. When we have the sums of all the
x and
y pixel coordinates, to find the average in the binary image, the values obtained are divided by the number of pixels in the respective area in the image. Thus,
Therefore, the centroid obtained serves as the guided waypoint for the drone. From the drone FOV, the drone navigates to the largest red region. By combining all the layers, given that the layers do not have the same colors, the semantic color-based segmentation forms a comprehensive description of the scene. Essentially, the contextualized information is used to solve the problem of self-navigation in the scene.
3.3. Drone Navigation with Simulator
The navigation of the drone is simulated by integrating the output of the algorithm with an open-source ground station application called the APM Planner 2.0 [
36]. This application is a lightweight messaging protocol for communicating with drones, especially among the onboard drone components. Thus, this application is used for controlling the drone’s movements by manipulating the vehicle velocity vectors. The software in the loop (SITL) of APM Planner 2.0 is a simulator that allows the user to run any aerial copter, drone, or plane without any hardware. This simulator allows the user to test the behavior of the code without any hardware. Therefore, users are able to build customized functions for a drone that suits the scenario, conditions, and environment. As shown in
Figure 10, the simulator comes with multiple panels such as a map, a mission planner, vehicle data, and flight data. All these individual panels enable users to simulate the vehicle movements in a real environment.
In our scenario of vision-based drone navigation in a forest, the goal is to perform autonomous navigation without relying on any GPS devices. To minimize the cost and prove that our work is successful, all the navigation experiments are conducted with a simulator. Based on the vision sensor, all the image-based data and the coordinates of the obstacles are collected from the modules of the drone and sent to the simulator for further processing. The simulator runs in parallel with a drone-kit [
33] application to enable two-way communication between the simulator and the controller application. The drone-kit allows users to create custom apps that run on an onboard companion computer and communicate with the ArduPilot flight controller using a low-latency link. Thus, there are no latency issues in simulating vision-based navigation via the simulator. The simulator displays the precise navigation path, movements, and flight data on the map in real time.
The features obtained from the framework allow the largest red region in the scene to be selected as the navigation path, which is also considered the region of interest. A waypoint is drawn on the centroid of the region of interest. The coordinates of the guided waypoint,
and
, obtained from Equation (
7), are sent to the drone navigation command to control its movement without human intervention. Hence, the waypoint serves as the guided navigation point for autonomous drone flight without a predefined goal. Similarly, the centroid of the brighter region in the scene is denoted by red points. However, these red points are only useful for finding the brighter spots between trees. In the proposed method, navigation is based on the drone kit [
33] velocity control command to react to new scenarios in the scene as they occur. The path always follows the guided waypoints obtained from the front view of the camera.
In this study, the camera FOV of the drone faces forward, as depicted in
Figure 11. All the objects that fall in this FOV are detected for autonomous navigation control purposes. To calculate the velocity control value, the horizontal FOV is fixed at 118.2, and the vertical FOV is fixed at 69.5. Between these two FOVs is the region of interest in the visual scene, and the velocity control of the drone is determined by the coordinates in the free path.
The
and
coordinates are used to calculate the vehicle velocity control. Due to the nature of the scenario, coordinate
z is not used in this study. Moreover, the speed and altitude of the drone are fixed. Referring to Equations (
8) and (
9), the coordinates
and
from the blue point are applied in the equations to obtain the velocity control output.
If the output coordinates,
and
, fulfill the velocity threshold values, as shown in
Table 1, the direction of the drone changes accordingly. We assume that there is no additional information regarding the height of the drone including a GPS signal. We obtain the output coordinates and perform waypoint guiding in the region of interest (ROI). The size of the ROI will be slightly adjusted from the center position of the frame to prevent the drone from flying toward the ground or a lower area, which is based on the velocity control command. The reason we chose these FOV and ROI resolution settings is to ensure that the obstacles and navigation marker appear inside the selected ROI. Any coordinates that fall in other areas in the visual scene are ignored.