Two-dimensional images are acquired by the camera and adjusted by the camera’s internal parameters. The image is rich in texture information, so the image
can consist of pixel points
with three channels of r, g, and b for each pixel point. The 3D point cloud data acquired by the Velodyne HDL-64E rotating 3D laser scanner acquisition consists of the spatial coordinates x, y, z, and the intensity information i. The image and point cloud representation are shown in the Equation:
where
is the feature representation of the image and
is the feature representation of the point clouds.
represents the pixel coordinates.
R,
G, and
B are the color channel values,
,
and
are the point transport coordinates, and
is the point cloud reflection intensity information.
A graph convolutional network named FGCN is proposed. As shown in
Figure 2, the FGCN consists of five core modules, which are data alignment, data enhancement, building topological maps, spatial attention mechanism, and multi-layer perceptron.
The 3D point clouds are registered with the 2D image data using camera and radar parameters. A unique pixel point correspondence is found for each point cloud and the color information is added to the point cloud to obtain , . Data enhancement is performed by Module II. The multi-channel graph convolution network of Module III constructs an image-guided topological graph structure, and the graph is constructed by combining chromaticity values . The feature learning is enhanced by adding residual connections to the multilayer graph convolution to obtain the local feature map, . The data features are learned after the spatial attention of Module IV and then aggregated into a multi-mode topological diagram. The prediction labels for each point are obtained after the Module V MLP and output layers.
3.1. Image-Guided Multi-Channel Graph Convolution
In the field of computer vision, spatial graph convolution networks for feature extraction can optimize the problems of inadequate local information extraction and limited regional information merging. Therefore, this paper uses graph convolutional networks. It is necessary to convert 3D point clouds into vectors before entering them into the network. A sequence of points
is written as a multidimensional tensor using the mapping function
, as shown in the Equation:
where
is the mapping function.
Since the KNN algorithm focuses more on the distance between nodes and the graph convolution focuses more on the edge information, we use images to guide the graph creation. The image’s color information is extracted as a compositional attribute, and the color information is added to the composition after the KNN aggregates the neighborhood features. The RGB feature set can be represented as
,
. In the multichannel KNN module, the RGB features are processed as a chromaticity-valued feature set
to represent the color features, as shown in the Equation:
where
is the chromaticity value of the color image, and
,
and
are the values after the color channel operation.
Relying on color invariance in the form of normalized colors, this color is insensitive to the illumination intensity. The new point cloud representation is shown in the Equation:
The process of building the graph structure is shown in
Figure 3, where the image-guided topological graph structure is built around the input sequence of central nodes. The two-channel KNN structure extracts the features of the central node
and uses different K values to ensure that the local features are fully learned.
As shown in
Figure 3, The topological map structure aggregates the central node information, edge information, color information, and node information aggregated in the previous layer. The spatial and color features of each point cloud are fully utilized to speed up learning. The aggregation process of neighboring node features
of node
is shown in the Equation:
where
is the edge features between nodes
and
.
are nonlinear functions with learnable parameters of the layer.
and
are edge features and color features of
. Respectively,
is a node of aggregated color features, where the label c represents the color. The node attribute feature
forms the local feature graph
,
, and
attribute information can enhance the graph structure building.
After constructing the graph structure, the edge features of the center point and the selected point are fused to achieve the aggregation of local and global information within the neighborhood graph. The
lth layer features of
after fusion are shown in the Equation:
After aggregating the features, max-pooling is used to filter the features. The max-pooling operation replaces an area’s output with the highest value found in the nearby rectangular region at a position. The max-pooling operation is shown in the Equation:
Inspired by the jump connection network, this paper connects the local features with residuals after computing them. To improve graph structure perception using nearby nodes, the input and output of the current layer are used as the input of the following layer. The residual connectivity is shown in the Equation:
where
is the feature at layer
i,
is the feature at layer
i + 1,
is the convolution operation at layer
i, and
is the residual function.
The FGCN uses a three-layer graph convolution structure, and the central node
aggregation is the last layer of features. Taking the
lth layer graph convolution as an example, the nodes are first embedded, and the neighborhood information of the nodes is aggregated using the two-channel KNN algorithm. All of the neighborhood node information, edge information between nodes, and aggregated node information from the
l-1th layer are contained in the
lth layer features. To create a local neighborhood feature graph with color information, the layer
l-1 feature graph is first enhanced with color information. The features from the previous layers, color features, and edge features are all simultaneously aggregated when the layer 1 graph convolution process is carried out. The local feature map
is created by combining the features of all the nodes in layer 3, as shown in the Equation:
it is the aggregation mode of layer
l-
1 features and layer
l features of graph convolution. Where
is the feature of layer
l-
1,
are the learnable parameters,
is one of the neighborhood nodes.
is the feature of layer
l, and
is the set of neighborhood nodes of node
v.
The sense field is expanded using the structure of multi-layer graph convolution. All node features are aggregated to the centroid, which eventually forms a local neighborhood graph .
3.2. Spatial Attention Module
The multi-channel graph convolutional network learns local features, and use of the spatial attention mechanism may concentrate on crucial areas in the local features and handle the scene’s critical elements.
As shown in
Figure 4, first, each channel of the feature
is subjected to mean operation and max operation respectively. The
obtained by the mean operation contains the basic shape and overall distribution of the features in the scene, and the
obtained by the max operation contains the salient features of each object. Then,
and
are stitched to get
, and the attention weights are obtained by convolution and pooling layers.
Spatial attention is shown in the Equation:
where
is the maximum pooling function and
is the maximum pooling result of the channel.
is the average pooling and
is the average pooling result of the channel.
is the convolution function.
is the nonlinear activation function.
Spatial attention allows for the analysis of the details between objects at a local scale and uses residual structures to accelerate feature transfer. This module extracts attention information and makes the weights non-negative using sigmoid. The information about the same object in the immediate area can be filtered using the spatial attention module, while the information about interference can be ignored. The spatial attention mechanism can concentrate on essential information within the data, aiding the model to acquire pertinent features more efficiently and enhancing the model’s performance. The spatial attention mechanism plays a crucial role in enabling the model to better manage noise and interference. In our ablation study, it becomes evident that the removal of the spatial attention module leads to a significant drop in both the mean accuracy (MAcc) and overall accuracy (OAcc) of the model, indicating that point cloud noise has a detrimental impact on segmentation performance. Additionally, we noted that when the point cloud becomes excessively sparse, the spatial attention mechanism’s ability to handle noise diminishes.
3.3. Output Layer
After obtaining spatial attention weights, deeper and more semantic information is multi-scale fused with shallow features. As shown in
Figure 5, MLP is used to help the network converge. A nonlinear factor is introduced through the activation function
.
. The nonlinear activation function Relu is used in MLP, which allows the neural network to learn and represent complex linear relationships. Batch normalization [
31] is also used, which helps to alleviate the vanishing gradient problem and accelerates the convergence of the model by normalizing the input of each layer to have zero mean and unit variance.
After the convolution operation, the labels and categories are processed into a dictionary form for the statistics and output of the results; the Equation is as follows:
where
is the category,
is the label, and
is the dictionary.
As shown in
Figure 6, the Softmax function makes one prediction for all categories of each point cloud
. These predictions are given different weight values, and the prediction with the largest weight is taken as the final result. The use of log_softmax can prevent data overflow and thus improve data stability. Speeding up the derivation in the training process makes the backpropagation operation faster, as is shown in the Equation: