2.1. Image Sampling for Human Eyes
Nyquist pointed out that a band-limited signal can be recovered without any loss of information as long as the sampling frequency is more than twice the highest frequency. However, when specific to the field of image sampling, the sampling interval of the whole image is always dense since it is determined by the most drastic region, and the above sampling theorem must be satisfied in three channels at the same time. So uniform sampling must result in serious redundancy and has almost no value for reducing the amount of data.
Non-uniform sampling is proposed, which focuses on assigning more sampling points to the foreground regions of interest. Adaptive mesh sampling was proposed in [
6], modeling the distance between pixels as the elastic system to sample points. Elder [
7] introduced the farthest point strategy into the field of image sampling and derived the best sampling points through the error function. Farthest point sampling is also one of the most popular methods in the field of point clouds. The distortion of the image luminance is taken into account for sampling in [
8]. In addition, the authors of [
9,
10,
11,
12,
13] propose methods for allocating sampling points from different views. Wavelets were introduced into the field of image sampling [
14,
15,
16,
17] following the idea of signal processing. From the perspective of a manifold, the authors of [
18,
19,
20,
21] sought to model and solve the problem of image sampling in a mathematical style. From a statistical point of view, an image can be regarded as a Markov random field, which was applied in [
22]. Much research has focused on the sparsity and low sampling rate [
23,
24,
25]. With the development of deep learning, related methods have been developed to obtain good performance [
26] but are associated with high complexity.
In industrial applications, the JPEG coding standard first converts the image into the YUV color space for sampling. The luminance component Y is retained completely, and the color components U and V can be downsampled with different factors taking into account the need to balance reducing the amount of data and the feel of human eyes. The idea of keeping the luminance and the downsampling color information is valuable for reference.
The above traditional sampling methods are all designed for human eyes and are unable to retain the information required by machine analysis, so the amount of data is inevitably redundant.
2.2. Image PreProcessing for Computer Vision
Focusing on learning-based computer vision algorithms, existing studies have confirmed that image preprocessing modules can improve the performance of machine analysis tasks, with image sampling representing one of the methods. In the image coding field, previous studies [
27,
28] have shown the necessity of coding aimed at computer vision.
Specifically for preprocessing methods of collected images, image enhancement approaches, such as super-resolution, pretransformation, and denoising have been explored. A super-resolution network was adopted in [
29] to preprocess low-resolution images and was trained jointly with the detection network, effectively improving the performance of the detection task. Gandal [
30] improved the quality of images generated by GAN networks by introducing a texture loss function, which ensured the following visual tasks worked well. In [
31], a dual-directed capsule network combining high-resolution image anchor loss and reconstruction loss was used to reconstruct very-low-resolution images to enable face recognition. Suzuki [
32] used a deep encoder-decoder to pretransform and compress images, keeping the accuracy for recognition while reducing the image bit rates. The authors of [
33] used dynamic convolution for filtering to enhance images and to improve the performance for classification. A dual-channel model and denoising algorithm were used in [
34] to improve the quality of noisy images, thereby improving recognition accuracy. RSRGAN was proposed in [
35], which utilizes super-resolution to enlarge small objects in infrared images and can obtain better detection performance. The authors of [
36] noted the positive effect of image enhancement for the detection of COD in eye fundus images and proposed practical methods. In [
37], a novel illumination normalization method was proposed to remove illumination boundaries and to improve the image quality under dark conditions, improving face detection. A multi-scale fusion of various prior features was used in [
38] to enhance underwater images and to facilitate subsequent visual tasks for the capture of underwater scenes. A preprocessing method was proposed in [
39] to suppress background interference for infrared pedestrian object detection.
Though offering better computer vision performance, the above-mentioned methods do not fully take computation costs into consideration. Learning-based networks are generally adopted for such preprocessing modules. Therefore, they will inevitably introduce additional computation and resource costs to the original computer vision algorithm, and the complexity of these network models is normally significant. Moreover, the spatial resolution of the input image is not decreased, whereas the image will have an even larger resolution when super-resolution-based approaches are used. With reference to the previous discussion that the spatial resolution is proportionate to the costs for image processing, these methods will have limited scope for application in lightweight scenarios.
To this end, learning-based image-resizing methods have also been examined to achieve image sampling. The authors of [
40] designed an image resizer network with the target of achieving optimal visual task performance, which achieved excellent recognition performance on ImageNet [
41] and AVA datasets through joint training with visual algorithms. ThumbNet and related training strategies were proposed in [
42], which can reduce the image size before performing visual tasks, and which can even ensure the accuracy of downstream tasks, with 16-fold smaller images. Chen et al proposed decomposing the input image into two low-resolution sub-images carrying low-frequency and high-frequency information in [
43], thereby accelerating the processing of visual tasks.
In these studies, the spatial resolution of the input image can be reduced, and the loss of machine analysis performance is limited as far as possible by training. However, their generalization ability is limited. In general, the application of a learning-based image resizing model must match the backbone network that participates in the training process. The above approaches only demonstrate strong performance on image classification tasks. Moreover, they also introduce much additional complexity.
In addition to image sampling, immersive data sampling, including point clouds sampling, is also of great significance for the development of computer vision tasks. There are a growing number of tasks that work directly on point clouds. As the size of the point cloud grows, so do the computational demands of these tasks. A possible solution is to sample the point cloud first. A widely used method is farthest point sampling (FPS) [
44,
45]. FPS starts from a point in the set and iteratively selects the farthest point from the points already selected. Ref. [
46] introduced a novel differentiable relaxation for point cloud sampling that approximated the sampled points as a mixture of points in the primary input cloud. Ref. [
47] proposed a resolution-free point clouds sampling network to directly sample the original point cloud to different resolutions, which was performed by optimizing the non-learning-based initial sampled points to better positions. Furthermore, data distillation was introduced to assist the training process by considering the differences between the task network outputs from the original point clouds and the sampled points. Ref. [
48] proposed an objective point cloud quality index with structure guided resampling to automatically evaluate the perceptual visual quality of 3D dense point clouds. Ref. [
48] exploited the unique normal vectors of point clouds to execute regional preprocessing, which involved key point resampling and local region construction.
For applications in other fields, such as sampling for training physics-informed neural networks, ref. [
49] proposed a novel sampling scheme, called dynamic mesh-based importance sampling, to speed up the convergence without significantly increasing the computational cost. To reduce the computational cost, a novel sampling weight estimation method was introduced, called dynamic mesh-based weight estimation, which constructs a dynamic triangular mesh to estimate the weight of each data point efficiently.