To demonstrate the performance of the proposed method, we conducted experiments on datasets of two different physical scenarios: FDB and FTB. As can be seen in
Table 1 and
Figure 7, FDB comprised only synthetic data, and FTB comprised synthetic and real data. We generated synthetic data using the Unreal Engine 4, and we generated real data using a Sony rx100 camera at 24 fps with a shutter speed of 1/25. The FDB dataset was divided into FDB1 to FDB3 according to the number of objects, and the FTB dataset was divided into FTB-synthesis and FTB-real, which comprised synthetic and real data, respectively. FDB1 and FDB2 were composed of 1000 video sequences, and FDB3 was composed of 2800 video sequences. FTB-synthetic was composed of 4800 video sequences. 100 of them were used for testing. FTB-real was composed of 94 video sequences, 30 of which were used for testing. Each video sequence in FDB dataset was composed of 20 frames of 64 × 64 frames, and each video sequence in FTB dataset was composed of 10 frames of 160 × 120 frames. Synthetic datasets had labels for object states and shutter speed, whereas real data had no labels. Examples of FDB and FTB datasets can be seen in
Figure 7. It shows the frames that make up the video sequence of each dataset sequentially along the time-step. FTB Real (BS) means the background subtraction result of FTB-real.
4.1. FDB
The proposed model successfully recognized the object state and predicted the future object states and frames from a motion-blurred video in experiments using the FDB dataset. We performed the experiments to predict the next 18 frames and future object states when only the first two frames were given among 20 frame sequences constituting FDB data.
Table 2 and
Figure 8 shows the sum of squared error (SSE) and the structural similarity (SSIM) between the predicted future frames and ground-truth frames in the experiments using FDB1, FDB2, and FDB3. It also shows the position error (PE) between the predicted object position and ground-truth position. For calculating the PE, just comparing the difference between the naive positions cannot be a valid performance indicator, because the objects in the FDB elastically collide with the walls in all directions and continuously drift through a limited space. Therefore, we compared the differences between the deformed positions: whenever the object collides with a wall and the velocity changes, the velocity is symmetrically transformed based on the collided wall to calculate a new position. In other words, the PE is calculated after correcting the position as if the object is going through the wall and moving in one direction (
Figure 9). As shown in
Table 2, the error increases as the number of objects in the scene increases. In FDB1 with one object, SSE and PE are the lowest, SSIM is the highest. As the number of objects increases, SSE and PE become larger and SSIM becomes smaller. This shows that the accuracy of motion prediction decreases as the number of objects increases. The graph in
Figure 8 shows the errors at each prediction time-step. As can be observed in the graph, the error tends to increase as the prediction time-step increases. For SSE and PE, the error increases as the prediction time-step increases, and SSIM decreases as the prediction time-step increases. This means that the accuracy of motion prediction decreases as the predicted time-step increases. This is because the errors accumulated via the iterative process of reusing the predicted object states as an input of the prediction network. However, as can be seen in
Figure 10 in the “Ours’’ row, our model predicted the intricate movements of the objects reasonably, even when the number of objects and predicted time-step increased. If there was one object in the scene, our model predicted future frames similar to the ground-truth up to the last 18th frame. If two or more objects were in the scene, although the error was larger than the single-object case, our model still made reasonable predictions.
Because there has been no prior study about predicting future frames and movements of an object on a video where intense motion blur occurs because of the fast movement of the object, we compared our performance with those in recent studies for predicting the future frames of a video. We trained a PAIG model [
9], which predicts the future object states and frames for a sharp video, and the Eidetic 3D LSTM model [
45], which predicts only the frames without predicting the future object states, using the FDB dataset. We then compared their tested results with those generated by our model. First, we trained the PAIG model to predict future frames by receiving four blurred frames of FDB1, because PAIG requires at least four input frames to predict future motion. We compared the results of PAIG with those of our model under three metrics:
, SSE, and SSIM. We also compared the time it takes to predict a frame: Execution time. As can be seen in the top row of
Table 3, although PAIG received more frames than our model, our results were better in all three metrics. Our model showed 33 times lower PE than the comparative model. Additionally, our model had a lower SSE and a higher SSIM than the comparative model, which means that our model predicted more accurate frames. The PAIG model could not correctly recognize and predict the object states nor could it reconstruct the frame if there was motion blur on the object.
Figure 10 shows the frames predicted by our model and the compared models. Our model predicted correctly until the last time-step, while the comparative model continued to fail to predict correctly. The execution time was longer than that of the PAIG model, but the processing time per frame is about 0.012 s, and real-time processing is possible for videos of about 80 fps. Also, since our model has not been optimized to reduce execution speed, there is room for further improvement in speed. The Eidetic 3D LSTM model was trained using FDB2, and the results were measured using the previous two metrics (i.e., SSE and SSIM) and the execution time. We omitted
, because Eidetic 3D LSTM model does not immediately predict object states. As shown at the bottom of
Table 3, our model performed better in both metrics. As can be seen in
Figure 10 “FDB2’’ row, our model made reasonable predictions similar to the ground-truth, although the error increased as the predicted time step increased. However, the Eidetic 3D LSTM made relatively reasonable predictions at first, but errors rapidly increased as the prediction time step increased. Moreover, after several time-steps, errors such as objects disappearing or objects not being drawn occurred. For FDB2, the execution time of our model was longer than that of the comparative model, but the processing time per frame is about 0.023 s, and real-time processing is possible for videos of about 40 fps.
Even if we film the same scene using the same camera at the same frame rate, various motion blur can occur, depending on the shutter speed of the camera. Thus, the model trained with the training dataset consisting only of videos of certain shutter speeds is difficult to apply to videos having other shutter speeds. Therefore, we organized FDB into videos assuming various shutter speeds such that our model could be applied well, regardless of the shutter speeds. To compare the accuracy of our model according to the degree of motion blur, we constructed an FDB1 test set having four different shutter speeds and compared the errors: videos with strong, moderate, weak motion blur, and no motion blur (sharp). Consequently, our model made reasonable predictions from strong motion blur to weak motion blur, but not in sharp video (
Figure 11). As can be seen in
Table 4, our model obtained the highest accuracy when the motion blur was weak, and the accuracy decreased as the motion blur became stronger. The accuracy was lowest when there was no motion blur at all. This is because, when there was a moderate motion blur, our model could obtain additional information regarding the object states from it. However, the stronger the motion blur, the more difficult it was to recognize the object states. If no motion blur occurs, owing to the considerably short shutter speed, the error increases, because additional information about the object states cannot be obtained from the motion blur.
The proposed velocity encoder was divided into a direction encoder and a speed encoder. The final velocity was recognized by multiplying the direction and speed measured by each sub-network. By configuring the velocity encoder in this manner, it performed better than when the velocity encoder directly recognized the velocity. To verify this, we experimented with FDB1 and compared the velocity error and the position error. Velocity error is the mean-squared error between the ground-truth velocity and the predicted velocity and the position error is the mean-squared error between the ground-truth position and the predicted position. As shown in
Table 5, the velocity encoder constructed using the proposed method performed much better in both metrics. This is because it could be better trained by dividing and solving the velocity recognition problem into two easier sub-problems.
4.2. FTB
We conducted experiments with FTB to show that our model works well for real data. Because the FTB-real data do not have labels for object states, we used SSE and SSIM between the ground-truth frames and predicted frames as a comparative metric for quantitative comparison of the results. When we trained our model using FTB-synthetic data, it could predict the future frames of FTB synthetic accurately (
Table 6 top,
Figure 12 top), but not for FTB real (
Table 6 bottom 1st row,
Figure 12 bottom “w/o fine tuning” row). However, by fine-tuning the model, which was pre-trained using FTB-synthesis, using FTB-real, it was able to predict future frames accurately in FTB-real. As can be seen in
Table 6 bottom, fine-tuning the prediction network using only the next-reconstruction loss (
) reduced accuracy of the model, while fine-tuning the prediction network using only the next-prediction loss (
) increased accuracy. The accuracy was higher if prediction network was fine-tuned using both reconstruction loss and prediction loss (
). The highest accuracy was obtained when the fine-tuning was performed for both the motion encoder and the prediction network (ALL). An example of the experimental results for the fine-tuning is shown in
Figure 12.