**2. Related Work**

#### *2.1. Image Parsing by Traditional Methods*

To segmen<sup>t</sup> an image using the traditional methods, the first step is to calculate the correlation between the adjacent pixels in the scene image, and then segmen<sup>t</sup> the image into fragmented regions by a certain convergence criterion [21,22]. The superpixel algorithm, for example, converts the image from the RGB color space to CIE-Lab color space to form a five-dimensional vector (brightness, color A, color B, and position x, position y), and the vector distance between two pixels representing their similarity, is used to generate the small segments patches [14,23]. A spatial pyramid descriptor fuses the gray, colored and edge gradient into one feature vector for the SVM classifier to recognize a tra ffic sign [24]. The image can also be converted into the YCbCr color space, and the local texture features in di fferent channels are matched with the artificially designed template to locate the position of the tra ffic sign [25]. Therefore, converting the image from the RGB color space into another feature space can obtain more dimensional information channels: brightness, texture, and other feature maps besides RGB color.

To achieve the final segmentation, the fragmented regions need to be combined. The internal correlations among the adjacent regions are calculated according to di fferent rules, and the regions are combined into local areas according to their correlation values. For example, the K-means clustering rules are used in di fferent practical engineering applications, such as object detection for the synthetic aperture radar (SAR) image and the sea scene [26–28]. The MCG algorithm is another grouping strategy using random forests to combine the multiscale regions into highly accurate object candidates.

MCG can process one image (pixel size 90 × 150) in 7 s, and the mean Intersection over Union (IU) is about 80% [18,29,30]. The clustering rules influence the combination precision, which is also directly proportional to the calculation time; as a result, MCG is suitable for the initial or post processing of a fixed scene, not for real-time processing of temporarily changing scenes. Therefore, to accelerate the whole scene segmentation process, we choose to improve the traditional methods in both generation and combination of the fragmented regions while maintaining the segmentation precision.

#### *2.2. Image Parsing by Deep Learning Methods*

Deep learning methods have also been widely used in image parsing recently, e.g., various convolutional networks, which have better robustness to image translation, rotation, scaling, and distortion. Deep learning methods can be divided into three types: image classification [19], object detection [31], and pixelwise prediction [20]; and the complexity of their network structures increases from image-wise to pixel-wise.

For the pixelwise segmentation of a scene, the convolutional networks can be combined with the superpixels, the random effect model and the texture segmentation to generate the pixelwise labels [32], and also can be used as a classifier to classify the feature maps containing RGB and depth information [33–35]. FCN can even process feature extraction, combination, segmentation, and recognition at the same time, also achieving a pixelwise prediction [20].

Depending on the details of different FCN structures, the mean IU of FCN is about 80%, the accuracy is about 90%, and the quantity of the parameter is about 57 M to 134 M. The massive number of parameters and computation need a GPU with big memory to handle the operation, leading to a high cost for practical applications. Therefore, we choose to use the traditional methods to ge<sup>t</sup> the precise boundary of the local area first, and then use a simplified CNN only to classify the local areas without the need of GPU. However, the reduction of the network size causes low accuracy in the classification, so extra care has to be taken in optimizing the network structure and the training process.

#### **3. Railway Scene Segmentation**

As shown in Figure 2b, typical railway scene consists of different areas, including track area, sky, catenary system, green belt, and ancillary buildings. The precision of the track area boundary directly affects the reliability of the judgement about whether the intrusion occurs or not. The track area is defined as the clearance area including rails, sleepers, subgrades or high-speed railway slabs, as shown in Figure 2a. To avoid manual labeling, a fast and precise railway scene segmentation algorithm is proposed.

Figure 3 illustrates the outline of the proposed algorithm. We first calculate the feature distribution in a small image patch (pixel size 15 × 15) representing the central pixel of the patch, then evaluate the central pixel's probability of being a boundary point, and finally use the boundary weights to segmen<sup>t</sup> the image according to a fast combination rule. Unlike the traditional method, we use a smaller set of adaptive Gaussian kernels to extract the pixel color (*PC*) distribution and pixel similarity (*PS*) distribution of the image in different channels *C* and by different scales *S*. The Gaussian kernels are rotated by a set of adaptive θs, calculated from Hough transformation. The detailed procedure of boundary weight generation is described in the remainder of this section.

**Figure 3.** The procedure to segmen<sup>t</sup> an image into fragmented regions and combine them into local areas.

#### *3.1. Generation of Fragmented Regions*

Firstly, we convert the image into the CIE lab color space, getting 3 channels: brightness, color A, and color B. Images in different channels will be scaled by *s* = (0.5, 1, <sup>2</sup>). In each channel, the image is convoluted with Gaussian kernels to ge<sup>t</sup> the color value distribution; each kernel has a special orientation angle θ. Define *<sup>G</sup>*(*<sup>x</sup>*, *y*, θ, *<sup>c</sup>*,*<sup>s</sup>*) as the convolution result at pixel *<sup>P</sup>*(*<sup>x</sup>*, *y*), with angle θ, in channel *c*, by scale *s*. Then *PC*, the pixel's color distribution, can be obtained by

$$PC(\mathbf{x}, \mathbf{y}, \boldsymbol{\theta}) = \sum\_{\mathbf{s}} \sum\_{\mathbf{c}} \alpha\_{\mathbf{c}, \mathbf{s}} G(\mathbf{x}, \mathbf{y}, \boldsymbol{\theta}, \mathbf{c}, \mathbf{s}) \tag{1}$$

where <sup>α</sup>*c*,*<sup>s</sup>* is a weighting coefficient.

Secondly, define *Similarity*(*<sup>i</sup>*, *j*) as the maximum *PC* value of all pixels on the line *li*,*j* connecting two pixels *i* and *j* in an small image patch by Equation (2), representing the similarity between pixel *i* and *j*.

$$S\_{\text{similarity}}(i, j) = \exp\left(-\text{Max}\{\text{PC}(\mathbf{x}, y) | (\mathbf{x}, y) \in l\_{i, j}\}\right) \tag{2}$$

Calculate the similarity of each pixel *ix*,*<sup>y</sup>* in the patch and the central pixel *jcenter*, assign *Similarity*(*ix*,*y*, *jcenter*) to each element *MS*(*<sup>x</sup>*, *y*) of the Matrix of Similarity *MS*, and assemble *MS* representing the similarity matrix between each pixel in the image patch and the central pixel.

Calculate the top *t* eigenvalues and eigenvectors of *MS*. Assign the eigenvector to the central pixel *<sup>P</sup>*(*<sup>x</sup>*, *y*) marked as *<sup>e</sup>*(*<sup>x</sup>*, *y*,*<sup>t</sup>*), forming a feature map *E* of the image, representing the similarity of the adjacent points. Again, in each dimension of the feature map, convolute *E*(*t*) with Gaussian kernel of orientation θ to ge<sup>t</sup> the similarity distribution. Define *g*(*<sup>x</sup>*, *y*, θ, *t*, *S*) as the convolutional result at location *<sup>E</sup>*(*<sup>x</sup>*, *y*), with angle θ, in dimension *t*, by scale *s*. Then, the pixel's similarity value distribution can be obtained as 

$$PS(x, y, \theta) = \sum\_{s} \sum\_{t} \beta\_{c, s} g(x, y, \theta, t, s) \tag{3}$$

where β*<sup>c</sup>*,*<sup>s</sup>* is a weighting coefficient.

> Finally, *<sup>B</sup>*(*<sup>x</sup>*, *y*), the possibility of the pixel *<sup>P</sup>*(*<sup>x</sup>*, *y*) being a boundary point, can be estimated by

$$B(\mathbf{x}, \mathbf{y}) = \sum\_{\theta} P\mathbf{C}(\mathbf{x}, \mathbf{y}, \theta) + \sum\_{\theta} P\mathbf{S}(\mathbf{x}, \mathbf{y}, \theta) \tag{4}$$

#### *3.2. Finding the Optimal Set of Gaussian Kernels*

It can be found that, in the process of estimating *<sup>B</sup>*(*<sup>x</sup>*, *y*), convolution operations (Equations (1) and (3)) using Gaussian kernels with different orientation angles θ cost most of the computation, which can be reduced if a smaller set of Gaussian kernels are used. The traditional UCM algorithms choose fixed size of θ = (<sup>θ</sup>1, θ2, <sup>θ</sup>3...) with 8 or 16 values uniformly distributed from 0 to π. Here we propose to utilize the characteristics of the railway scene to find a much smaller set of useful orientation angles and thus a smaller set of Gaussian kernels. Usually, in railway scene, there is a clear vanishing point (VP), and the boundaries of many local areas are lines passing through the VP. Therefore, if we can automatically adjust the candidate θ for each specific scene to enhance the weights of the line boundary points of the relevant areas, then we will be able to use a smaller set of θ to accelerate the process.

We propose to find the candidate θ by filtering the original image with a Canny kernel [36], and then convert the obtained texture feature into the Hough coordinate system using

$$\rho = x \cos \theta' + y \sin \theta', -\frac{\pi}{2} < \theta' < \frac{\pi}{2} \tag{5}$$

As shown in Figure 4a, each curve in the Hough coordinate system stands for one point in the Cartesian coordinate system. If the curves (colorful curve lines in Figure 4a) have one intersection point in the Hough coordinate system, then the corresponding points (blue point in Figure 4a) in the Cartesian coordinate system are collinear.

**Figure 4.** Using Hough transformation to detect the most significant lines in the Hough coordinate system. (**a**) The intersection point of a group of curves in the Hough coordinate system means there are a group of collinear points in the Cartesian coordinate system. (**b**) The more the curves intersect in the Hough coordinate system, the lighter the intersection point is, meaning that there are more collinear points along this line in the Cartesian coordinate system. (**c**) The texture feature maps filtered by canny filter. (**d**) The top four significant lines.

Let *<sup>H</sup>*(<sup>θ</sup>, ρ) be the number of curves intersecting at point (<sup>θ</sup>, ρ) and find the point with maximum *<sup>H</sup>*(<sup>θ</sup>, ρ), where there are the largest number of points which are collinear on the corresponding line in the Cartesian coordinate system. The line can be expressed as

$$y = -\frac{1}{\tan \theta'} \mathbf{x} + \frac{\rho}{\sin \theta'} = k\mathbf{x} + b \tag{6}$$

To find a small set of four orientation angles, one can take the top four maximum θ in *<sup>H</sup>*(<sup>θ</sup>, ρ), e.g., the points with highest 'lightness' in Figure 4b: θ = 68◦, 52◦, 0◦ and <sup>−</sup>88◦. Here we change the θ = 90◦ − θ = 22◦, 38◦, 90◦, and 178◦ in order to obtain a range of values from 0◦ to 180◦ (0–π). Based on the selected set of orientation angles, the Gaussian kernels can be constructed correspondingly by rotating the Gaussian. As shown in Figure 5, in the Cartesian coordinate system X-O-Y, point *<sup>P</sup>*(*<sup>x</sup>*, *y*) rotates around the point *o*( *W*2 , *W*2 ) an angle θ to *<sup>P</sup>*(*<sup>x</sup>*, *y*), which can be formulated as

$$
\begin{aligned}
\begin{bmatrix}
\mathbf{x} & \mathbf{y} & 1
\end{bmatrix} &= \begin{bmatrix}
\mathbf{x'} & \mathbf{y'} & 1
\end{bmatrix} \begin{bmatrix}
1 & 0 & 0 \\
0 & -1 & 0
\end{bmatrix} \begin{bmatrix}
\cos\theta & -\sin\theta & 0 \\
\sin\theta & \cos\theta & 0
\end{bmatrix} \begin{bmatrix}
1 & 0 & 0 \\
0 & -1 & 0 \\
0 & 0 & 1
\end{bmatrix} \\
&= \begin{bmatrix}
\mathbf{x'} & \mathbf{y'} & 1
\end{bmatrix} \begin{bmatrix}
\cos\theta & \sin\theta & 0 \\
\end{bmatrix}
\end{aligned} \tag{7}
$$

**Figure 5.** Calculating the rotation matrix of the Gaussian kernel. The rotation center is on the kernel center.

Figure 6 shows several Gaussian kernels rotated by the optimal set of θ <sup>=</sup>22◦, 38◦, 90◦, and 178◦ obtained above and θ = 112.5◦, one of the eight uniformly-distributed values commonly used in traditional UCM algorithms, respectively. The results show that the features of the horizontal catenary bracket, the vertical catenary column and the declining track are strengthened obviously in the first four filters, contrasting with the feature extraction equality in the fifth filter. The universality of using 8 or 16 uniform values in different angle θ causes a redundant calculation when applied to the railway scene. Therefore, adjusting a smaller number of θ adaptively to filter the feature map can accelerate the boundary weighting to generate the fragmented regions.

**Figure 6.** Different kernels and the convolution results of CIE-lab color L channel. (**a**) First order derivative Gaussian kernels rotated by five angles. (**b**) Results of the Gaussian convolution.

## *3.3. Combination Rule*

The fragmented regions generated by the adaptive boundary detection are shown in Figure 7a. The higher the boundary weight is, the brighter the point is shown in the gray feature map, indicating that the point is more likely to become a boundary point.

A clustering rule based on both of the boundary weight and the region size is proposed to combine the fragmented regions into local areas. The number of the regions will be reduced in the process of weak boundary point removal by filtration. The smallest remaining region will be combined with its neighbor region, with which it shares the weakest boundary. Repeat this iteration until the statistical parameters meet the requirements. The process is as follows:

1. Let *<sup>B</sup>*(*m*) be the normalized value of the boundary point's weight *<sup>B</sup>*(*xm*, *ym*), where *m* = 1, 2, 3...*M*, and *M*is the total number of boundary points:

$$B(m) = \text{sigmoid}(B(\mathbf{x}\_{m}, y\_{m})) = \frac{1}{1 + e^{-B(\mathbf{x}\_{m}, y\_{m})}} \tag{8}$$


Figure 7g is the original railway scene image, and the Figure 7h is the result of our segmentation algorithm. The railway scene only contains five categories of areas, and the shape of the area is usually in a large and radial pattern. Therefore, we set the minimum area threshold *S* to 10% of the whole image and the maximum quantity threshold *Q* to 10, which will prevent the remained regions from being too fragmented. The remained regions will be adjusted into a standard size of 64 × 64 and RGB 3 channels, after being classified by the CNN in Section 4, the remaining regions with the same labels will be combined as one local area.

**Figure 7.** *Cont*.

**Figure 7.** The procedures of combining the fragmented regions into local areas. According to the adjustment and experiments, for the railway scene, the scene image is set to a pixel size of 90 × 150, the number of adaptive θ is reduced to 4, the number of reserved areas *Q* is set to 10, and the smallest fragmented area *S* is set to 10% of the total size of the image. (**a**) Boundary with weight. (**b**) Distribution of boundary weight and quantity. (**c**) Delete the weak boundary. (**d**) Fragmented regions. (**e**) Distribution of the region size and serial number. (**f**) Local areas after the fragmented regions are combined. (**g**) The original railway scene image, (**h**) is the result.

#### **4. Local Area Recognition in Railway Scene**

To automatically label the local areas in real time without the help a GPU, we design a simplified CNN with less layers and kernels. To compensate the reduced accuracy, the convolution kernels are pre-trained, and a sparsity penalty term is added into the loss function to enhance the diversity of the feature maps.

#### *4.1. Structure of Simplified CNN*

Before designing and applying a simplified CNN, we first construct a dataset of local area images for training it. As shown in Figure 8, there are mainly five basic categories of elements in a typical railway scene, including track area, sky, catenary system, green belt, and ancillary buildings. To sample the dataset, five solid line rectangles are manually defined to cover the five different areas. We program a simple extraction code to take the image patches using the dotted-line box as samples with the same category of the outer rectangle. We set up a group of constraint parameters to control the dotted box to extract the patches at a random position, by a random scale, maintaining inside of each rectangle. The image patches are adjusted into a pixel size of 64 × 64 and RGB 3 channels to assemble our five-category datasets of railway local area. However, for the specific application of this paper, our target is focused on the track area for judging intrusion behavior, so besides the 'track' label, we merge the other four elements into one category labeled as 'others'. There are 9000 image patches in total, in which 5000 images are used for training our net, 2000 images are used for cross-validation, and 2000 images are used for testing.

**Figure 8.** Collecting samples of local areas for CNN training. (**a**) Solid-line rectangles are delineated by manual with labels, including the track area (red), sky (blue), catenary system (purple), green belt (green), ancillary buildings (yellow). The dotted-line boxes are extractor windows. (**b**) The dataset containing two categories for training the CNN.

A simplified CNN structure is designed for fast recognition, which consists of an input layer, two convolution layers *C1* and *C2*, two mean pooling layers *S1* and *S2*, and a logistic classification layer, as shown in Figure 9.

**Figure 9.** Structure of the simplified CNN. The size of the input image is a pixel size of 64 × 64 with RGB 3 channels. The output is one of the two category labels.

As shown in Table 1, we conducted five experiments with different kernel quantities and sizes. It can be seen that increasing the kernel size and quantity may increase the accuracy, but the accuracy is still less than 80%. Although the railway scene is very simple, only containing several typical area categories, the shapes, color, and texture features of the area belonging to the same category are still very complex and different. Therefore, the training process must be optimized to increase the accuracy.


**Table 1.** Experimental results of different CNN network structures.

#### *4.2. Optimization of the Simplified CNN*

To increase the accuracy, kernels are pre-trained to extract better low-level features. The pre-training strategy is based on autoencoder-decoder network; and the *<sup>W</sup>*1*i*,3×3×<sup>3</sup> after training in first layer is applied as the convolution kernel in the first convolution layer *C1*, as shown in Figure 10 for the case

with kernel size of 3 × 3 and in RGB 3 channels. During the training, 3 × 3patches in RGB 3 channels are randomly selected from random railway scene images, as shown in Figure 11a. The result of the pre-trained kernels is shown in Figure 11b, where the patches and the kernels are all in RGB 3 channels.

**Figure 10.** Structure of the autoencoder-decoder network. The hidden layer contains 70 hidden neurons; *W* denotes the weight associated with the connection between neurons; and the network is trained to produce output the same as its input.

**Figure 11.** Pre-trained convolution kernels using the autoencoder-decoder algorithm. (**a**) The image patches are extracted from the left railway scene image for the kernel training. (**b**) The pre-trained kernels used in convolution layer *C1*.

After pre-training, the input weights of each neuron in the hidden layer are used as the initial weights of kernels in the first convolution layer *C1* in Figure 9. The rest of CNN in Figure 9 are randomly initialized and then trained by using a backpropagation algorithm (stochastic gradient descent, SGD). To enhance the diversity of the feature maps, a sparsity penalty term is added into the loss function *J* as

$$J = \left\{ \frac{1}{P} \sum\_{p=1}^{P} \frac{1}{2} \left[ h(c\_p) - l\_p \right]^2 \right\} + \tau \sum\_{f=1}^{10} \left[ \chi \lg \frac{\chi}{\eta\_f} + (1 - \chi) \lg \frac{1 - \chi}{1 - \eta\_f} \right] \tag{9}$$

where

$$\eta\_f = \frac{1}{P} \sum\_{p=1}^{P} \sum\_{u=1}^{29} \sum\_{v=1}^{29} O\_{f, c\_p}^{(2)}(u, v) \tag{10}$$

*ep* is the *p*-th input image, *lp* is the ground truth label, there are totally *P* images in the dataset, *<sup>h</sup>*(<sup>e</sup>*p*) is the output label, τ is the weight of the sparsity penalty term, χ is the sparsity parameter (a smaller value close to 0, e.g., 0.05), η*f* is the average output of the *f*-th feature map in convolution layer *C2* (averaged over the training dataset), and *O*(2) *f*,*ep* (*<sup>u</sup>*, *v*) is the value at position (*<sup>u</sup>*, *v*) in the *f*-th feature map of the input *ep* in the second convolutional layer *C2*, the size of the feature map is 29 × 29 pixels.

In the process of backpropagation, the sparsity penalty item will suppress the average output of all feature maps in the second convolutional layer *C2*, but enforce the output of one feature map at the same time, so as to enhance the diversity of the feature maps and improve the accuracy. The learning rate is set to 0.1, and the decay of the learning rate is 0.001 after each iteration, the final value of *J* should be less than 0.05.

#### *4.3. Performance of the Simplified CNN*

As shown in Table 2, the accuracies of the simplified CNNs with di fferent structures are all increased by using the proposed optimization method, compared with the results of traditional training method shown in Table 1, e.g., the simplified CNN with 70 kernels (3 × 3, 3 channels) in *C1* and 10 kernels (3 × 3, 70 channels) in *C2* is used for the proposed segmentation algorithm. The quantity of the network parameters is only 0.02912M. After the railway scene is segmented and classified, the regions with track labels can be combined together as the final track areas.


**Table 2.** Experiment results of di fferent CNN network structures after the optimization.

#### **5. Experiments and Results**

#### *5.1. Railway Scene Dataset*

We collect images from 16 PTZ cameras at straight lines, curves and bridges in the high-speed railway from Shanghai to Hangzhou, China. For each camera, images are collected from 10 di fferent shooting angles, lenses, and under di fferent illumination conditions from 8:00 a.m. to 5:00 p.m. Examples are shown in Figure 12a. There are totally 1760 scene images in the dataset, in which 1000 images are used in the training dataset, 400 images are used in the cross-validation dataset and 360 images are used in the test dataset. These datasets are used to generate the datasets for our simplified CNN (Section 4.1) and the dataset for training the FCN for the comparison experiments.

**Figure 12.** Samples in railway dataset. (**a**) Images from PTZ cameras under di fferent conditions. (**b**) Ground truth of the track area.

#### *5.2. Modification of the Workflow for the Case of Small Track Portion*

For cameras on line sections, track area only takes up a small portion of the scene image, while for the ones at tunnel entrances and bridges over railway line, track area usually takes up most of the scene. As shown in Figure 12b, the red track area takes about 25–70% of the whole scene image for di fferent cameras. That means, the complete-processing workflow (Sections 3 and 4) would waste a lot of time calculating the boundaries between the 'others' areas (Figure 13b) rather than focusing on the potential track area as the red dotted line rectangle shown in Figure 13a. In order to find the potential track area and reduce the segmentation calculation furthermore, we design a partial-scanning workflow to locate the potential position of the track area before the segmentation and classification by scanning over the railway scene roughly using the proposed CNN. As shown in Figure 13c, we firstly divide the railway scene image into 6 × 10 cells (yellow cell); each cell and its peripheral zone (red dotted line rectangle) are resized to 64 × 64 pixels, define their classified labels as the representation of its central cell (red area in Figure 13c); the proposed CNN is used to classify these cells and the output labels are used to identify the potential track area roughly as the red area shown in Figure 13d; A minimum enclosing dotted line rectangle is used to adjust the potential track area into a regular shape as shown in Figure 13d.

The strategy of the partial-scanning workflow reduces the segmentation area, but spends extra scanning time. Thus, the overall processing time depends on the proportion of the track area to the railway scene, as shown in Table 3, the numbers on the left are the scene images in Figure 12, from the left to the right. If the track area takes over more than 88.1% of the railway scene, the performance of the partial-scanning workflow would be worse than the complete-processing workflow. These two workflows can be chosen for di fferent cameras: for those with short focus lens and focus on the near scene full of track area, the complete-processing workflow should be used; for the ones with long focus lens and track area only take a small part of the scene, the partial-scanning workflow should be used.

**Figure 13.** *Cont*.

**Figure 13.** Rough scanning over the scene to find the potential track area. (**a**) Segmentation result of the whole railway scene images. (**b**) Different local areas. (**c**) Scanning the railway scene image roughly using proposed CNN. (**d**) Area in red dotted rectangle is the potential track area, which will reduce segmentation calculation by three-quarters.

**Table 3.** Calculation time of the comparison experiments with different workflow to segmen<sup>t</sup> the railway scene.

