**1. Introduction**

With a continuous increase in the public's expectation for railway safety, railway intrusion detection systems require more e ffective technology to detect objects intruding into the track area and to provide real-time alarm information for the command center [1]. Railway intrusion behavior is defined as an object intruding into the track area and endangering the safe operation of trains. Typical intruding objects include rocks falling from a hill beside railway line or a tunnel entrance, pedestrians, vehiclesand their cargo staying in the railroad crossing area or falling from the bridge over the railway.

Depending on the detecting principle, railway intrusion detection systems can be divided into two categories: the contact type and the non-contact type. A representative of the contact type is the protective metal net installed along the line to block an object from intruding into the clearance, and the system will send the alarm information when the physical deformation of the net is measured by a dual-power sensor [2] or fiber grating sensor [3,4]. The systems based on the non-contact measurement technology use infrared sensor [5] or laser scanner [6,7] to ge<sup>t</sup> the size and location of the intruding object [8]. Video surveillance is also widely used as another kind of non-contact intrusion detection systems because of the large monitoring area, convenient installation, maintenance, and good observable results [9]. As shown in Figure 1, we established an intrusion detection system for the Shanghai–Hangzhou high-speed railway in China. The system contains data process servers, communication networks, and 1550 cameras, including both of fixed and PTZ cameras.

**Figure 1.** Structure of the railway intrusion detection system.

The threat level of an intrusion behavior will be evaluated by the category, location, and moving trajectory of the object with respect to the track area. The information of the intruding object can be extracted by image processing algorithms, e.g., density-based spatial clustering of applications with noise (DBSCAN) [10], fast background subtraction (FBS) [11], Kalman filtering [12], principal components analysis (PCA) [13]. DBSCAN uses extremum points of scan sequence as core objects of clustering, and the movement and distribution characters are used to judge whether the cluster is a train or other foreground object. FBS projects the scene image into one dimension (x or y dimension) to locate position of the foreground object by the change of the peak value. KF classifies the objects acquired via image background subtraction by support vector machine (SVM), and then using the Kalman-filter tracking algorithm to analyze the behavior and moving trend of the objects. PCA projects the statistic of the scene images and the successive images in a transformation space and calculates the Euclidean distance, which is greater than a threshold, is considered like belonging to motion objects. Most of the above-mentioned algorithms only focus on the foreground object, rather than the track area in the background. Therefore, the position and boundary of track area are still delineated manually in advance, as shown in Figure 2.

**Figure 2.** Railway scene and the local areas, labeled manually. The image quality is susceptible to external influences, such as the illumination, weather, and even the dust on the lens. (**a**) The red area is the track area to be surveilled. The track area includes the rails, sleepers, subgrades, or high-speed railway slabs. (**b**) Labeling the different area of the railway scene with different colors by manual, including track area (**red**), sky (**blue**), catenary system (**purple**), green belt (**green**), and ancillary buildings (**yellow**). The precision depends on the patience of the manual operator.

The precision of the track area boundary directly affects the reliability of intrusion detection. With an increasing number of surveillance cameras along the railway line, especially as some PTZ cameras will change their focal lengths and angles temporarily for different applications, manual labeling has become time-consuming and laborious. Thus, for the efficiency of the railway intrusion detection system, a scene segmentation algorithm is needed to recognize the track area and delineate the boundary automatically. The algorithm will be applied to initialize surveillance areas after the installation of all cameras, and to relearn them when the operator adjusts PTZ cameras. Meanwhile, the practical engineering application has many requirements: the relevant image parsing algorithm should not only have good segmentation precision and classification accuracy, but also be able to process temporarily changing scenes quickly. In addition, the algorithm should have small number of parameters and can be easily applied into the data processing servers with different hardware configurations and even into the embedded surveillance equipment in the field.

Currently, there are two ways to parse a scene. The traditional way will segmen<sup>t</sup> the scene image into superpixels, ultrametric contour maps (UCM), or other fragmented segmen<sup>t</sup> regions [14,15], and then combine them into candidates of objects or local areas based on Markov random fields (MRFs), conditional random fields (CRFs), multiscale combinatorial grouping (MCG), or other rules [16–18]. These traditional methods will generate fragmented regions with precise boundaries and require time-consuming iterative calculations to form a best candidate of an object or a local area. In addition, category information of objects cannot be produced. The second way relies on deep neural networks, e.g., fully convolutional networks (FCN) [19,20], to process the feature extraction, combination, segmentation, and recognition at the same time. A FCN can achieve the segmentation and recognition in a single process. One drawback of FCNs is that the boundary line generated is usually a smooth curve, which will miss the corner of the track area. In addition, FCN has big memory footprint and needs a GPU to accelerate its large amount of computation.

In this paper, we propose an adaptive segmentation algorithm that can take advantage of both methods while avoiding their shortcomings. Like the existing traditional methods, we extract the texture distribution of the image to generate the boundary point with different weight for segmenting the image into small fragmented regions, and then the regions are combined into local areas with precise boundaries; finally, we apply a specially designed convolutional neural network (CNN) for the area's classification without the need of GPU. Our main contributions include:


The rest of this paper is organized as follows. In Section 2, we review the related works on image parsing algorithms. Section 3 explains the proposed fast image segmentation process. Section 4 explains the proposed simplified CNN network structure and the optimization process. Section 5 presents the experimental results and discusses them. The last section summarizes our conclusions.
