Figure 1.
The JPEG compression pipeline. The lossy step is indicated as a rounded block.
Figure 1.
The JPEG compression pipeline. The lossy step is indicated as a rounded block.
Figure 2.
The JPEG compression quality. An example
image section compressed with the selected Q values, together with the corresponding SSIM [
21]. For visibility, the image section was zoomed
using the nearest-neighbor interpolation. Source:
http://r0k.us/graphics/kodak/kodak/kodim04.png, license: CC0, accessed on 19 December 2021.
Figure 2.
The JPEG compression quality. An example
image section compressed with the selected Q values, together with the corresponding SSIM [
21]. For visibility, the image section was zoomed
using the nearest-neighbor interpolation. Source:
http://r0k.us/graphics/kodak/kodak/kodim04.png, license: CC0, accessed on 19 December 2021.
Figure 3.
A high-level object detection pipeline that exploits deep learning. The deep CNN-based object detectors used in this study consist of a backbone and a head. The rectangles represent the functional elements, whereas the rounded rectangles are their input and output data.
Figure 3.
A high-level object detection pipeline that exploits deep learning. The deep CNN-based object detectors used in this study consist of a backbone and a head. The rectangles represent the functional elements, whereas the rounded rectangles are their input and output data.
Figure 4.
The backbone of an object detector. This part of the network extracts the deep features.
Figure 4.
The backbone of an object detector. This part of the network extracts the deep features.
Figure 5.
A two-stage detector head, also referred to as the sparse detectors.
Figure 5.
A two-stage detector head, also referred to as the sparse detectors.
Figure 6.
The AP score derived from a precision-recall curve. (A) Example plot of recall and monotonic precision on the k top confidence detections. (B) Monotonic precision plotted against recall. (C) AP as the area under the precision-recall curve in the unit square. (D) AP as the average of precision sampled at 11 recall values . If a recall value is never reached, the precision becomes zero for that value.
Figure 6.
The AP score derived from a precision-recall curve. (A) Example plot of recall and monotonic precision on the k top confidence detections. (B) Monotonic precision plotted against recall. (C) AP as the area under the precision-recall curve in the unit square. (D) AP as the average of precision sampled at 11 recall values . If a recall value is never reached, the precision becomes zero for that value.
Figure 7.
Baseline TPR, PPV and F1 as a function of . Top row: 101-layer backbone models, bottom row: 50-layer backbone models. (A) RetinaNet achieves low values of each metric at respective extremes, and a narrow range of the best F1. (B,C) Faster R-CNN without FPN exhibits similar behavior regardless of using dilated convolutions, the best F1 is returned at a higher . (D,E) Faster R-CNN with FPN maintains the balance between TPR and PPV in a wide spectrum, with the ResNeXt backbone having better metrics and a symmetrical F1 curve. (F–I) Models with backbones that are less deep are analogous to their larger equivalents.
Figure 7.
Baseline TPR, PPV and F1 as a function of . Top row: 101-layer backbone models, bottom row: 50-layer backbone models. (A) RetinaNet achieves low values of each metric at respective extremes, and a narrow range of the best F1. (B,C) Faster R-CNN without FPN exhibits similar behavior regardless of using dilated convolutions, the best F1 is returned at a higher . (D,E) Faster R-CNN with FPN maintains the balance between TPR and PPV in a wide spectrum, with the ResNeXt backbone having better metrics and a symmetrical F1 curve. (F–I) Models with backbones that are less deep are analogous to their larger equivalents.
Figure 8.
The R101 detection examples: bicycles and motorcycles. (
A) Q = 5: the single FP is a box around parts of two different
bicycle objects (chimera). (
B) Q = 10: no
person detected, multiple
bicycle FPs, one
bicycle detected correctly, the other with extra detections of its parts. The
motorcycle detected both correctly, and falsely as
bicycle. (
C) Q = 20: both
person objects detected, with repeated detections of one, similar situation for
bicycle. (
D) Q = 30:
person and
bicycle correctly detected, the
bicycle FP includes part of the first bicycle and all of the visible
motorcycle. Source:
http://images.cocodataset.org/val2017/000000011149.jpg, license: CC-BY, accessed on 19 December 2021.
Figure 8.
The R101 detection examples: bicycles and motorcycles. (
A) Q = 5: the single FP is a box around parts of two different
bicycle objects (chimera). (
B) Q = 10: no
person detected, multiple
bicycle FPs, one
bicycle detected correctly, the other with extra detections of its parts. The
motorcycle detected both correctly, and falsely as
bicycle. (
C) Q = 20: both
person objects detected, with repeated detections of one, similar situation for
bicycle. (
D) Q = 30:
person and
bicycle correctly detected, the
bicycle FP includes part of the first bicycle and all of the visible
motorcycle. Source:
http://images.cocodataset.org/val2017/000000011149.jpg, license: CC-BY, accessed on 19 December 2021.
Figure 9.
The R101 detection examples: two bears. The detector localizes the objects correctly, but there are classification mistakes. (
A) Q = 10: classified as
teddy bear and
person. (
B) Q = 20: classified as
teddy bear and
dog (left) and as
elephant,
sheep and
teddy bear (right). (
C) Q = 25: one
bear is correct, the other classified as
dog. (
D) Q = 30: finally both
bear objects are correct. Source:
http://images.cocodataset.org/val2017/000000020247.jpg, license: CC-BY, accessed on 19 December 2021.
Figure 9.
The R101 detection examples: two bears. The detector localizes the objects correctly, but there are classification mistakes. (
A) Q = 10: classified as
teddy bear and
person. (
B) Q = 20: classified as
teddy bear and
dog (left) and as
elephant,
sheep and
teddy bear (right). (
C) Q = 25: one
bear is correct, the other classified as
dog. (
D) Q = 30: finally both
bear objects are correct. Source:
http://images.cocodataset.org/val2017/000000020247.jpg, license: CC-BY, accessed on 19 December 2021.
Figure 10.
The X101 detection examples: under a bridge. (
A) Q = 10: two
person objects; errors: a mistaken
baseball bat. (
B) Q = 15: three
person objects; errors: a hallucinated
train. (
C) Q = 20: three
person objects, one
bird, one
backpack, the fourth
person is annotated with the inaccurate bounding box. (
D) Q = 25: all
person objects are correct, the
bird, a
handbag, a
boat are annotated with the inaccurate bounding boxes; errors: one
backpack is wrongly classified as
suitcase, there is one mistaken
suitcase. Source:
http://images.cocodataset.org/val2017/000000001268.jpg, license: CC-BY, accessed on 19 December 2021.
Figure 10.
The X101 detection examples: under a bridge. (
A) Q = 10: two
person objects; errors: a mistaken
baseball bat. (
B) Q = 15: three
person objects; errors: a hallucinated
train. (
C) Q = 20: three
person objects, one
bird, one
backpack, the fourth
person is annotated with the inaccurate bounding box. (
D) Q = 25: all
person objects are correct, the
bird, a
handbag, a
boat are annotated with the inaccurate bounding boxes; errors: one
backpack is wrongly classified as
suitcase, there is one mistaken
suitcase. Source:
http://images.cocodataset.org/val2017/000000001268.jpg, license: CC-BY, accessed on 19 December 2021.
Figure 11.
Determining the minimal Q for the correct detection. Two examples of the minimal quality level for detecting objects, or for avoiding a FP. (
A) Q = 10: a sheep’s head is misclassified as
bird. (
B) Q = 25: both GT
sheep objects are detected (minimal Q), but the head is still mistaken for
bird, and there is a hallucinated
bird. (
C) Q = 65: the central
sheep is correctly detected, but there is a chimera detection of the head and a pile of wool, there is also a
sportsball wrongly classified as
cow (this ball is detected, but never classified correctly, up to Q = 100; it can also be classified as
sheep). (
D) Q = 10: both
person objects are detected, but the
elephant is omitted. (
E) Q = 15: the minimal Q for detecting the
elephant. Source:
http://images.cocodataset.org/val2017/000000012062.jpg and
http://images.cocodataset.org/val2017/000000021903.jpg, license: CC-BY, accessed on 19 December 2021.
Figure 11.
Determining the minimal Q for the correct detection. Two examples of the minimal quality level for detecting objects, or for avoiding a FP. (
A) Q = 10: a sheep’s head is misclassified as
bird. (
B) Q = 25: both GT
sheep objects are detected (minimal Q), but the head is still mistaken for
bird, and there is a hallucinated
bird. (
C) Q = 65: the central
sheep is correctly detected, but there is a chimera detection of the head and a pile of wool, there is also a
sportsball wrongly classified as
cow (this ball is detected, but never classified correctly, up to Q = 100; it can also be classified as
sheep). (
D) Q = 10: both
person objects are detected, but the
elephant is omitted. (
E) Q = 15: the minimal Q for detecting the
elephant. Source:
http://images.cocodataset.org/val2017/000000012062.jpg and
http://images.cocodataset.org/val2017/000000021903.jpg, license: CC-BY, accessed on 19 December 2021.
Figure 12.
TPR, PPV and F1 as a function of Q at . Top row: the 101-layer backbone models, bottom row: the 50-layer backbone models. (A,F) RetinaNet exhibits high and constant PPV, (B,C,G,H) Faster R-CNN without FPN exhibits similar behavior regardless of using dilated convolutions, TPR and PPV are close in a wide range of Q, and recall is better than precision, (D,I,E) Faster R-CNN with FPN exhibits high PPV even for while TPR is above RetinaNets and below non-FPN Faster R-CNNs.
Figure 12.
TPR, PPV and F1 as a function of Q at . Top row: the 101-layer backbone models, bottom row: the 50-layer backbone models. (A,F) RetinaNet exhibits high and constant PPV, (B,C,G,H) Faster R-CNN without FPN exhibits similar behavior regardless of using dilated convolutions, TPR and PPV are close in a wide range of Q, and recall is better than precision, (D,I,E) Faster R-CNN with FPN exhibits high PPV even for while TPR is above RetinaNets and below non-FPN Faster R-CNNs.
Figure 13.
AP, mAP.5 and mAP.75 as a function of Q. AP50—mAP.5, AP75—mAP.75. Top row: the 101-layer backbone models, bottom row: the 50-layer backbone models. The results for all the models for AP and related metrics are approximately identical. Note how close mAP.75 is to AP. The models: (A,F) RetinaNet, (B,C,G,H) Non-FPN Faster R-CNN, (D,I,E) FPN Faster R-CNN.
Figure 13.
AP, mAP.5 and mAP.75 as a function of Q. AP50—mAP.5, AP75—mAP.75. Top row: the 101-layer backbone models, bottom row: the 50-layer backbone models. The results for all the models for AP and related metrics are approximately identical. Note how close mAP.75 is to AP. The models: (A,F) RetinaNet, (B,C,G,H) Non-FPN Faster R-CNN, (D,I,E) FPN Faster R-CNN.
Figure 14.
AP for large, medium and small objects as a function of Q. Top row: the 101-layer backbone models, bottom row: the 50-layer backbone models. The results for all the models for the AP metric for different object sizes are almost identical. The models: (A,F) RetinaNet, (B,C,G,H) Non-FPN Faster R-CNN, (D,I,E) FPN Faster R-CNN.
Figure 14.
AP for large, medium and small objects as a function of Q. Top row: the 101-layer backbone models, bottom row: the 50-layer backbone models. The results for all the models for the AP metric for different object sizes are almost identical. The models: (A,F) RetinaNet, (B,C,G,H) Non-FPN Faster R-CNN, (D,I,E) FPN Faster R-CNN.
Figure 15.
The derivative of the AP with respect to Q, smoothed with an averaging windows of size 5. Top row: the 101-layer backbone models, bottom row: the 50-layer backbone models. This derivative is very similar for all models: the range 40–100 is noisy with a constant running average, below 40, there is an increase in the rate of the AP degradation, with the maximum near Q = 15. The models: (A,F) RetinaNet, (B,C,G,H) Non-FPN Faster R-CNN, (D,I,E) FPN Faster R-CNN.
Figure 15.
The derivative of the AP with respect to Q, smoothed with an averaging windows of size 5. Top row: the 101-layer backbone models, bottom row: the 50-layer backbone models. This derivative is very similar for all models: the range 40–100 is noisy with a constant running average, below 40, there is an increase in the rate of the AP degradation, with the maximum near Q = 15. The models: (A,F) RetinaNet, (B,C,G,H) Non-FPN Faster R-CNN, (D,I,E) FPN Faster R-CNN.
Table 1.
Descriptive statistics of the compression ratio (as a function of Q) obtained for the COCO val2017 images.
Table 1.
Descriptive statistics of the compression ratio (as a function of Q) obtained for the COCO val2017 images.
Q | Mean | Std | Min | 25% | 50% | 75% | Max |
---|
5 | 100.51 | 36.36 | 1.45 | 75.61 | 97.69 | 125.08 | 229.79 |
15 | 51.34 | 21.22 | 1.44 | 37.01 | 47.77 | 61.89 | 210.04 |
25 | 36.45 | 16.08 | 1.42 | 25.97 | 33.52 | 43.63 | 194.23 |
35 | 28.96 | 13.29 | 1.41 | 20.43 | 26.51 | 34.51 | 174.91 |
45 | 24.31 | 11.38 | 1.40 | 17.13 | 22.15 | 28.77 | 154.25 |
55 | 21.46 | 10.05 | 1.39 | 15.26 | 19.60 | 25.37 | 144.08 |
65 | 17.64 | 8.29 | 1.38 | 12.54 | 16.13 | 20.67 | 121.30 |
75 | 14.78 | 6.76 | 1.36 | 10.61 | 13.57 | 17.38 | 103.31 |
85 | 9.60 | 4.04 | 1.32 | 7.05 | 8.90 | 11.18 | 53.47 |
95 | 5.56 | 2.20 | 1.19 | 4.15 | 5.13 | 6.52 | 32.92 |
Table 2.
Descriptive statistics of SSIM (as a function of Q) obtained for the COCO val2017 images.
Table 2.
Descriptive statistics of SSIM (as a function of Q) obtained for the COCO val2017 images.
Q | Mean | Std | Min | 25% | 50% | 75% | Max |
---|
5 | 0.6271 | 0.0898 | 0.1533 | 0.5741 | 0.6309 | 0.6860 | 0.9657 |
15 | 0.7738 | 0.0637 | 0.3593 | 0.7360 | 0.7802 | 0.8176 | 0.9694 |
25 | 0.8262 | 0.0546 | 0.3900 | 0.7945 | 0.8320 | 0.8643 | 0.9795 |
35 | 0.8549 | 0.0484 | 0.4191 | 0.8276 | 0.8600 | 0.8883 | 0.9760 |
45 | 0.8740 | 0.0436 | 0.4440 | 0.8498 | 0.8785 | 0.9039 | 0.9910 |
55 | 0.8868 | 0.0413 | 0.4674 | 0.8645 | 0.8915 | 0.9150 | 0.9931 |
65 | 0.9047 | 0.0358 | 0.5045 | 0.8865 | 0.9090 | 0.9287 | 0.9881 |
75 | 0.9211 | 0.0331 | 0.5768 | 0.9061 | 0.9256 | 0.9424 | 0.9972 |
85 | 0.9562 | 0.0249 | 0.7819 | 0.9432 | 0.9597 | 0.9757 | 0.9985 |
95 | 0.9947 | 0.0026 | 0.9808 | 0.9931 | 0.9947 | 0.9969 | 0.9996 |
Table 3.
The deep object detection models investigated in this study.
Table 3.
The deep object detection models investigated in this study.
Symbol | Description |
---|
R101 | RetinaNet [35] with ResNet-101 [30] + FPN [26] |
R101_C4 | Faster R-CNN [39] with ResNet-101 [30] |
R101_DC5 | Faster R-CNN [39] with ResNet-101 [30] + DC [40] |
R101_FPN | Faster R-CNN [39] with ResNet-101 [30] + FPN [26] |
R50 | RetinaNet [35] with ResNet-50 [30] + FPN [26] |
R50_C4 | Faster R-CNN [39] with ResNet-50 [30] |
R50_DC5 | Faster R-CNN [39] with ResNet-50 [30] + DC [40] |
R50_FPN | Faster R-CNN [39] with ResNet-50 [30] + FPN [26] |
X101 | Faster R-CNN [39] with ResNeXt-101 [31] + FPN [26] |
Table 4.
The baseline performance of all investigated deep models ().
Table 4.
The baseline performance of all investigated deep models ().
Model | AP | mAP.5 | mAP.75 | APl | APm | APs | TPR | PPV | TP | FP | EX |
---|
R101 | 33.6 | 47.0 | 37.2 | 46.3 | 37.5 | 15.3 | 51.7 | 81.1 | 18,769 | 4360 | 820 |
R101_C4 | 38.5 | 56.3 | 41.9 | 53.6 | 42.8 | 19.1 | 67.1 | 58.6 | 24,373 | 17,246 | 4463 |
R101_DC5 | 38.3 | 56.8 | 42.0 | 52.1 | 42.8 | 19.4 | 68.0 | 58.2 | 24,701 | 17,752 | 4504 |
R101_FPN | 38.4 | 55.5 | 42.6 | 51.2 | 42.1 | 20.8 | 64.9 | 68.7 | 23,593 | 10,751 | 2605 |
R50 | 31.6 | 44.3 | 35.2 | 44.3 | 34.8 | 14.1 | 49.7 | 80.7 | 18,043 | 4320 | 821 |
R50_C4 | 35.9 | 53.6 | 39.3 | 51.0 | 39.8 | 17.6 | 65.3 | 56.8 | 23,733 | 18,075 | 4586 |
R50_DC5 | 36.8 | 55.7 | 40.5 | 50.5 | 41.4 | 18.2 | 66.7 | 57.1 | 24,244 | 18,190 | 4668 |
R50_FPN | 36.7 | 54.1 | 40.7 | 49.4 | 40.1 | 19.1 | 63.8 | 67.0 | 23,170 | 11,428 | 2763 |
X101 | 39.6 | 57.0 | 43.9 | 52.1 | 42.9 | 22.6 | 66.3 | 69.7 | 24,073 | 10,472 | 2534 |
Table 5.
The performance of all investigated models at Q = 25 ().
Table 5.
The performance of all investigated models at Q = 25 ().
Model | AP | mAP.5 | mAP.75 | APl | APm | APs | TPR | PPV | TP | FP | EX |
---|
R101 | 25.2 | 36.2 | 27.8 | 36.7 | 28.0 | 8.8 | 41.0 | 80.9 | 14,900 | 3511 | 513 |
R101_C4 | 30.3 | 45.5 | 32.8 | 45.5 | 33.8 | 12.0 | 56.8 | 57.1 | 20,634 | 15,478 | 3253 |
R101_DC5 | 30.1 | 46.6 | 32.4 | 43.5 | 33.7 | 12.6 | 57.7 | 57.3 | 20,979 | 15,626 | 3363 |
R101_FPN | 29.2 | 43.7 | 31.9 | 42.1 | 31.9 | 13.1 | 53.1 | 68.8 | 19,309 | 8765 | 1779 |
R50 | 23.2 | 33.4 | 25.8 | 35.0 | 25.8 | 7.8 | 38.7 | 80.3 | 14,067 | 3458 | 470 |
R50_C4 | 27.4 | 42.5 | 29.6 | 41.1 | 30.1 | 10.7 | 54.6 | 52.1 | 19,842 | 18,225 | 3373 |
R50_DC5 | 28.4 | 44.7 | 30.5 | 41.0 | 31.5 | 12.0 | 56.5 | 54.7 | 20,516 | 16,974 | 3390 |
R50_FPN | 27.1 | 42.1 | 29.5 | 38.5 | 30.2 | 12.1 | 51.6 | 66.2 | 18,761 | 9575 | 1784 |
X101 | 30.4 | 45.6 | 32.8 | 42.4 | 33.5 | 13.7 | 54.8 | 68.2 | 19,914 | 9269 | 1736 |