3.1. Object Field
The convolutional neural network can be abstracted into a mathematical model Y = F (W, X), where X is the input, Y is the output, and W is the convolution kernel parameter. CNN can be seen as a directed acyclic graph from X to Y. Its basic architecture consists of input layer, convolutional layer, pooling layer, upsample layer and output layer. Therefore, when designing the network structure F, it should be able to express Y more quickly and accurately. However, in the CNNs, the pooling layer extracts the intensity information of the object, and the spatial information, such as the maximum response neuron offset coordinate and the object width and height, cannot be transmitted by the pooling layer. Therefore, this convolutional network has a weak ability to express spatial information. So in order for the convolutional network to better express spatial information, we add an object output field to regress the probability of the objects Y appearing in the image. Because the values of the central field data is in the range of [0, 1], we transform the output layer of the neural network into the final field output value through the logistic activation function. Through the object output field, we can further obtain the location information.
The object output field is the probability distribution map of the object appearing on the image. The probability of the object center is the largest, and the closer to the edge, the lower the probability. This field can be expressed by the two-dimensional normal distribution formula:
We can get a maximum probability when the field coordinates are at the center of the ellipse. According to this definition, we use neural networks to regress this field probability information in an elliptical distribution.
On the basis of backbone, we added the object field to the output section. We abstract the object field into a normally distributed elliptical field containing two components, the Center Field and the Edge Field. The architecture is shown in
Figure 2. We give the loss function as
where the
are defined in Equations (
5) and (
6) respectively.
Center Field. The intensity of the normal distribution is related to the elliptic equation, so we use the elliptic equation to describe the distribution of an object on a two-dimensional image. The output value of the center field indicates the probability that the pixel is close to the target center, so we define the range of values for each output element to be [0, 1]. The output intensity of each pixel is calculated by
where ’ccp’ is an abbreviation of center class pixel. And
is the ground truth of the pixel P of feature map of class C in the center field.
is the distance from the pixel P in the class C feature map to the i-th object.
Figure 3 shows the distribution of object intensity in image space. To build this mathematical model, we describe how close the pixel is to the center of the object by
and we give the loss function of Center Field as
Direction Field. The Direction Field is used to describe the direction information of the object and requires the training dataset to have direction information. We also add
channels to output the
direction field of the
C class object, then the direction vector of a certain neuron in the field is
. The loss function of Direction Field is
where
if an object has diectioin and
if not. Meanwhile, if
belongs to at least one object in the field, then
, otherwise
. We give
as
where
and
are ground truth of the x and y component of the Direction Field at the object point p of class c respectively.
We define the default value of the back propagation weight of the Direction Field
in Equation (
2). According to the theory of constrained neural networks, we unitize
and
to obtain the direction vector
and
at
by
The object direction represented by rotation angle in regression will lead to the ambiguity of direction. To solve this problem, we use the unitization constraint algorithm to obtain the object unitization direction vector
, which is used to regress to the Ground Truth direction of the object. As shown in
Figure 4, the output values of the two channels of the direction output layer of a neuron in the object area
are converted to
by unitization.
The direction of each iron atom in the magnet determines the direction of the magnetic field. For the same principle, we find the direction of each point in the Direction Field in the RFA to get the direction of the object. Then we can calculate the direction of the object by
which is used to calculate the average direction of n points in the object area. The detailed description of the object points searching can be found in
Section 3.2.
In the DOTA [
25] dataset, the object is described by four clockwise enclosing points
, where
is the left front point relative to the object itself. The front end center point
and the back end center point
of the object can be obtained by
then we can get main direction of the object by
Figure 5 shows the number and composition of the feature maps of the output layer, where c is the number of classes. In order to represent the two-dimensional direction, we output two direction fields as well as the centers field.
3.2. Region Fitting Algorithm
In this section, we propose a field-based object region fitting algorithm called RFA. We process the Center Field and the Direction Field. The output feature maps of the Center Field and the Direction Field are C and 2 × C respectively, which represent class C objects and 2 × C direction vectors. At inference, we choose the largest field of pixel P in the output C-class Center Fields to get the class of P.
Getting the object point according to the Center Field. For the output value of each pixel in the center field, if , search for the maximum intensity value that has not been searched in the eight neighborhoods of the pixel. Then move to the position of this maximum value and repeat the search step until there is no greater value around it, then note the coordinates of the point .
Getting the object edge point sets by searching the Center Field from the center point
. We use the center point as the starting point to get the point set of the edge area of each object through breadth-first search. As shown in
Figure 3b, we spiral down from top to bottom to search the entire Center Field. The whole search process is as follows:
Step 1: Initializing a queue Q and put the starting point into Q.
Step 2: The head element in Q is taken out, and then the 8 pixel neighborhood points of are pushed into Q by value in Center Field in descending order. must be a point that has not been searched.
Step 3: Repeat Step 2 util Q is empty. In addition, if the average intensity of all points in Q is less than 0.5, the loop is exited. Finally, we can get all the point sets in Q corresponding to the starting point .
Step 4: We sample the point set in the Direction Field to get
in the object, then get the unitized vector
of each point by Equation (
8), and then get the whole direction vector according to Equation (
9).
Figure 6 is a diagram illustrating the above algorithm. It can be seen from the
Figure 6b that as the iteration progresses, the search range gradually expands, and the center field strength of each object gradually decreases during the regional growth iteration process. When the average value is around 0.25, the center field intensity tends to be flat and there is a sufficient amount of sampled data. At this time, ellipse fitting can be performed, and the region growth process of a single object ends. This algorithm can converge quickly, and can collect a sufficient number of points that can regress to the object ellipse parameters.
Calculating the elliptic equation of the object. An ellipse can effectively describe the regional distribution of an object of arbitrary aspect ratio in the image space. We substitute the edge points into Equation (
4), and use the LM(Levenberg-Marquard) algorithm to solve the equations. In addition, we add a central restraint condition as
Since the value interval of the center point
is [0, 1], we define the default value
to have a better effect. According to Equations (
3) and (
4), we can get
, where
is the output of the neural network at the pixel
of the Center Field. Then we give the Jacobian matrix equations as shown in Equation (
13). Where a and b are the major axis and minor axis of the ellipse respectively, and
is the inclination angle of the ellipse. Because the ellipse is symmetric, the exact direction of the object needs to be further determined by the direction field.
and
are the offset of the ellipse from the search center. And
is the value of the ellipse field at pixel
.
Then we define
as
and
is the intensity. We compute the params
by minimizing the Mahalanobis distance:
In addition, if an object has no direction, the ellipse fitting equation is defined as