When video surveillance systems are employed on urban roads, the statistics that are obtained from the resulting video images are simple and convenient to use, and the detection and statistic amounts of data are very large, which will not affect the normal travels of vehicles. In probability theory and statistics, a Gaussian model is a stochastic process, because it can collect random variables that are indexed by time or space. A Gaussian model is a simple, easy method for generating background images. When background images are processed while using a Gaussian background updating method, foreground targets can be imaged using a background difference method, with shadow elimination and morphological processing to reduce the illumination intensity and noise [
5,
6]. This has yielded more accurate traffic statistics, but the statistical accuracy is still not ideal when the traffic flow is large. Previously, Li et al. introduced the Vi Be algorithm, which uses the first frame of a video to initialize the model, sets the foreground threshold and background candidate conditions, updates the model, and extracts the background from the second frame of the video [
7]. However, this method is too reliant on relative experiences, and it needs to scan and judge the video image several times when it is used to identify a pixel.
Tan and Dong trained a vehicle classifier to perform vehicle detection on a set area of a video [
8]. This method is simple and easy to implement, but only few factors are considerately used in this study, and the classifier is not in general. The studies in ref. [
9,
10] describe the virtual coil method, which manually sets candidate areas in a video, detects and counts traffic flow by changing the coil state, simplifies calculations, and protects road surfaces, but it is not very effective at detecting multi-lane traffic flows. Other studies examine vehicle-flow statistics that are based on a target tracking method [
11,
12,
13]. Vehicles detected between adjacent frames in a video are matched and tracked by their specific characteristics to obtain each vehicle’s motion trajectory. The images that were obtained from the trajectories are used to detect and calculate traffic flow, but most of the data are noise caused by non-motorized movements and other factors. These results indicate problems with current methods for gathering traffic-flow statistics.
Today, with the aid of artificial intelligence, deep learning is used for target detection, semantic segmentation, image classification, and other identification tasks in various scenarios. The region-based method is the most common for target detection, while using the R-CNN, SPP-net, Fast R-CNN, and Faster R-CNN algorithms [
14,
15,
16,
17,
18,
19,
20,
21]. R-CNN uses a selective search to extract regions from an image, which is efficient, but training is cumbersome, and its test speed is slow [
14,
22]. SPP-net makes the network input unrestricted, but the training features are stored on disk, which limits the detection speed [
15]. Fast R-CNN combines the R-CNN and SPP-net concepts and it can improve detection speed and accuracy, but the speed is still relatively slow [
15,
16]. Faster R-CNN uses an RPN network instead of selective search algorithms, which greatly shortens the time to extract regions from images, but only achieves seven frames per second for video detection, falling short of real-time detection requirements [
20,
21,
22]. In order to improve the detection speed and accuracy, based on the DenseNet model [
23], the researchers proposed a lightweight PeleeNet model for mobile devices, which had the highest target classification accuracy [
24]. Their further study had combined the PeleeNet and optimized SSD (Single Shot MultiBoxDetector) to develop a real-time target detection system for mobile devices, which had low computational cost and reliable targets’ detection performance [
24,
25]. The ideas have provided very good help and inspiration to subsequent researchers. Next, this paper proposed a method to detect the pedestrian features, which combined a histogram of the oriented gradient (HOG) and discrete wavelet transform (DWT). The method uses the motion amplitude to set the interest region to improve the detection speed. HOG and DWT systems are used to detect pedestrian multi-features, and ROI is then classified by SVM with multi-feature mechanisms, and the detection speed can be improved [
21,
26]. Subsequently, a network model is proposed for the target detections of lightweight RFBNet, which combines the enhancements of speed and accuracy [
27]. The model proposes adding dilated convolution to form a Receptive Field Block (RFB) module, which is based on the inception structure to increase the receptive field, and introduces RFB into the SSD network to enhance the extraction capability of the network receptive field [
25,
27]. At the same time, it redefines target detection as a large-scale, but not a sparse distribution problem of boundary box probability, and that proposes the directed evaluation mechanism of the sparse sampling distribution. That can be applied to the end-to-end detection model, which improves the detection performance of the model, but the detection accuracy is not satisfactory [
21,
28]. In view of this real-time monitoring problem, we need an easily optimized model, with an algorithm that is fast and uses relatively few calculations. In this paper, we present YOLO-UA, which is a regression-based, high-performance algorithm for real-time detection and statistics gathering from vehicle flows that use the You Only Look Once (YOLO) algorithm as the base [
29,
30,
31]. We employed the Intersection Over Union (IOU) metric and the Generalized Intersection Over Union (GIOU) metric to optimize the loss function directly, and we show that the YOLO model can be modified and promoted to enhance traffic flow monitoring [
31,
32]. The method of fine-tuning the model structure and the GIOU optimization loss function is proposed to enhance the accuracy of the target positioning in order to solve the problem of poor positioning of the YOLO model and low accuracy of vehicle statistics. After optimization, it can be more reliably applied to video vehicle statistics in real-time and actual scenes. We achieve superior traffic flow detection through optimizing the model and algorithm.