*3.1. Local Square Feature Region Construction*

Due to the desynchronization attacks caused by the screen-cam process and user operations, we need to develop an appropriate synchronization method to locate the watermark. We test the feasibility of the Harris–Laplace, SIFT, and SURF operators, which are extensively employed to construct local scale-invariant feature regions (LFRs) as message embedding areas [37–45], in the screen-cam process. To select the most suitable operators for LFRs construction, the variations of feature point coordinates, feature scale, and feature direction are quantitatively analyzed under different shooting distances. The images we used here are shown in Figure 2. All host images are 1024 × 1024 pixels. Because of the blurring of the image edges caused by the low-pass filtering attack and the lens distortion, it is difficult to restore the captured image to exactly correspond to the original image. Inevitably, there will be a displacement between the coordinates of the corresponding pixels. Therefore, the feature points are considered to be repeated when the offsets of their coordinates are smaller than five pixels. Besides, considering the requirements of watermark synchronization, the feature scale variation should be below 10% at the same time. Furthermore, the feature points at the edge of the image are excluded. In order to reduce the impact of noise, we perform a Gaussian filter on both the original images and the captured images. As shown in Figure 3a, after a Gaussian function, we discover that the middle- and high-scale, which means the feature scale is greater than 15, feature points of the Harris–Laplace or SIFT operators can achieve high repeatability. The repeatability here refers to the ratio of the number of feature points extracted after a screen-cam attack and satisfying the above-mentioned pixel offset and scale variation criteria, compared to the original number of feature points. We also note that although the SIFT operator has a better performance at a long shooting distance, it does not work well at a close shooting distance, which indicates that it is more sensitive to moiré noise. Comparatively, the Harris–Laplace operator is more stable at different shooting distances, which is more suitable for watermark synchronization. Regarding the feature orientation descriptors, we note that the SURF orientation descriptor is more robust than the SIFT to the screen-cam process, where the orientation variations of repeated SURF feature points are predominantly less than five degrees, as shown in Figure 3b. We considered that the integral image and Haar wavelet-based SURF orientation descriptor is more robust to the blurring and luminance change in the screen-cam process than a Gaussian image and histogram-based SIFT orientation descriptor.

**Figure 2.** Host images: Lena, Baboon, Airplane, Peppers, Building, Pentagon, White House, and Naval Base.

**Figure 3.** The variations of feature points in screen-cam process.

Therefore, in our method, a modified Harris–Laplace detector and SURF orientation descriptor are integrated to construct the RST invariant LFRs. To increase the detection rate of feature points during the screen-cam process, we also employ a Gaussian function.

In previous LFR-based methods, circular feature regions are commonly constructed. These regions will involve zero-padding [43] or rearrangement [46] to a square region before message embedding, which will cause further distortions [47]. Therefore, we directly construct LSFRs. Figure 4 illustrates the subprocesses. The details are as follows.

**Figure 4.** The construction process of local square feature regions in 1024 × 1024 Lena image. (**a**) Luminance image of Lena after a Gaussian function. (**b**) The extracted feature points by modified Harris–Laplace. (**c**) The associated orientation. (**d**) The associated LSFRs.

#### 3.1.1. Gaussian Function Preprocess

In the embedding and extraction processes, the detection of feature points will be performed on the Gaussian filtered images. A two-dimensional Gaussian function *G*(*x*, *y*) is obtained by the product of two one-dimensional Gaussian functions and can be defined as:

$$G(x,y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}}\tag{1}$$

where σ is the standard deviation. The Gaussian kernel *HG*, whose sigma is set to 2 and window size to 7, is employed here. The image convolution process is defined as:

$$I'(\mathbf{x}, \mathbf{y}) = H\_{\mathbb{G}} \* I(\mathbf{x}, \mathbf{y}) \tag{2}$$

where *I* is the input image and *I* is the convolution result. ∗ denotes the convolution operator.

#### 3.1.2. Modified Harris–Laplace Detector

The Harris–Laplace detector proposed by Mikolajczyk and Schmid [48,49] has been extensively employed in different watermarking schemes. Therefore, here we give a brief description of it, and the modified part will be explained in detail.

First, Harris points are detected in the scale space. To obtain the invariance to scale variation, we built a scale-space representation with the Harris function for preselected scales. The Harris detector is based on a specific image descriptor, which is referred to as the second moment matrix, and reflects the local distribution of gradient directions in the image [50]. To make the matrix independent of the image resolution, the scale-adapted second moment matrix is defined by:

$$M(\mathbf{x}, \mathbf{y}, \sigma\_{\mathbf{I}}, \sigma\_{\mathbf{D}}) = \sigma\_{\mathbf{D}}^2 \mathbf{G}(\sigma\_{\mathbf{I}}) \ast \begin{bmatrix} L\_x^2(\mathbf{x}, \mathbf{y}, \sigma\_{\mathbf{D}}) & L\_x L\_y(\mathbf{x}, \mathbf{y}, \sigma\_{\mathbf{D}}) \\ L\_x L\_y(\mathbf{x}, \mathbf{y}, \sigma\_{\mathbf{D}}) & L\_y^2(\mathbf{x}, \mathbf{y}, \sigma\_{\mathbf{D}}) \end{bmatrix} \tag{3}$$

where σ*<sup>I</sup>* and σ*<sup>D</sup>* are the integration scale and local scale, respectively, and *L* is the derivative computed in an associated direction by a Gaussian. Given σ*D*, the uniform Gaussian multiscale space representation *L* is defined by:

$$L(\mathbf{x}, \mathbf{y}, \sigma\_D) = G(\mathbf{x}, \mathbf{y}, \sigma\_D) \* I' \tag{4}$$

where *G* is the associated uniform Gaussian kernel with a standard deviation σ*<sup>D</sup>* and a mean of zero.

Given σ*<sup>I</sup>* and σ*D*, the scale-adapted Harris corner strength *cornerness* used to quantitatively describe the stability under variations in imaging conditions can be computed. The original *cornerness* measure function needs an empirical parameter, which may float for different images. Therefore, in this paper, we will adopt another *cornerness* measure function, that is, the Alison Noble measure [51]:

$$\text{Cornerness}(\mathbf{x}, \mathbf{y}, \sigma \mathbf{l}, \sigma \mathbf{p}) = \text{det}(\mathbf{M}(\mathbf{x}, \mathbf{y}, \sigma \mathbf{l}, \sigma \mathbf{p})) / \left( \text{trace} \left( \mathbf{M}(\mathbf{x}, \mathbf{y}, \sigma \mathbf{l}, \sigma \mathbf{p}) \right) + \text{eps} \right) \tag{5}$$

where det(·) and trace(·) denote computation of the determinant of the matrix and the trace of the matrix, respectively. *eps* is the smallest integer to ensure that the denominator is nonzero. The feature points obtained by this method are more robust under variations in imaging conditions [51].

At each level of the scale space, the candidate points are extracted as follows:

$$\begin{cases} \text{corrness}(\mathbf{x}, \mathbf{y}, \sigma \mathbf{u}, \sigma \mathbf{D}) > \text{corrness}(\hat{\mathbf{x}}, \hat{\mathbf{y}}, \sigma \mathbf{u}, \sigma \mathbf{D}) \quad \forall (\hat{\mathbf{x}}, \hat{\mathbf{y}}) \in \mathcal{A} \\ \text{corrness}(\mathbf{x}, \mathbf{y}, \sigma\_{\mathbf{I}}, \sigma\_{\mathbf{D}}) > t\_{\mathbf{u}} \end{cases} \tag{6}$$

where A represents the points within the 3σ*<sup>I</sup>* radius neighborhood, and *tn* is the threshold, which is 0.1 · max(*cornernessI*).

The automatic scale selection of the feature points is performed. To select the characteristic scale of the local structure, a scale-normalized derivative LoG operator is defined as:

$$\text{LoG}(\mathbf{x}, \mathbf{y}, \sigma\_I) = \sigma\_I^2 \left| L\_{\text{xx}}(\mathbf{x}, \mathbf{y}, \sigma\_I) + L\_{\text{yy}}(\mathbf{x}, \mathbf{y}, \sigma\_I) \right| \tag{7}$$

where *Lxx* and *Lyy* are second partial derivatives with respect to *x* and *y*, respectively.

For each candidate point, we apply an iterative method to determine the location and scale of the feature points. Given the initial point *p* with the scale σ*I*, the iteration steps are presented as follows.


#### 3.1.3. SURF Orientation Descriptor

To obtain the invariance to the rotation, each feature point will be assigned a direction based on the SURF orientation descriptor. We calculate the Haar wavelet response on the selected circle region of the integral image, which is centered at the feature points and is six times the feature scale as the radius. The Gaussian weighting function, for which σ is two times the feature scale, is used to Gaussian weight the response of the Haar wavelet.

To obtain the dominant orientation, we calculate the sum of all responses within a sliding orientation window of size π/3. By summing the horizontal and vertical responses within the window, the vector (*mw*, θ*w*) can be obtained, which is defined as:

$$
\Delta m\_{\text{uv}} = \sum\_{\text{uv}} \text{d} \,\text{x} + \sum\_{\text{uv}} \text{d} \,\text{y} \tag{8}
$$

$$\theta\_w = \arctan\left(\sum\_w \text{d} \,\text{x} / \sum\_w \text{d} \, y\right) \tag{9}$$

where *mw* is the summarized responses, and θ*<sup>w</sup>* is the associated orientation. The dominant orientation θ is defined as:

$$
\Theta = \theta\_w |\mathsf{max}\{m\_w\}\tag{10}
$$

#### 3.1.4. LSFRs for Watermarking

Considering the severe distortion during the screen-cam process, the constructed LSFRs should have a sufficient range to ensure that information can survive. Thus, the feature points with appropriate scale and location are selected, and the side length *L*<sup>0</sup> of LSFR is designed as:

$$L\_0 = 2 \cdot \text{floor}(k\_1 \cdot s) + 1\tag{11}$$

where *k*<sup>1</sup> is a constant coefficient, and *s* is the feature scale value.

In Figure 5, are shown the LSFRs for the 8-image test set. Because the watermark information will be embedded in the DFT coefficients, according to its characteristics, the following two situations are also feasible. When a small part of the candidate LSFR exists outside the image, shown in Figure 5f, or when a small part of the two LSFRs overlapped, shown in Figure 5g, these LSFRs can also be utilized as embedding areas.

**Figure 5.** The selected local square feature regions (LSFRs) for watermark embedding of eight host images.

#### *3.2. Watermark Embedding*

#### 3.2.1. Selection of Embedding Operations

As discussed in Section 3.1, there is an inevitable shift in the corresponding positions between the corrected image and the original image. Fortunately, due to the nature of DFT coefficients in image translation, the coefficients of captured images can be corrected to correspond to the original image, if the four corners are carefully selected to avoid too much rotation and scaling distortions after perspective rectification. Therefore, it is advantageous to select the DFT domain for embedding a watermark message.

In order to use DFT coefficients as a watermark carrier, we need to analyze their variation rules in the screen-cam process first. The variations of the DFT magnitude coefficients with different shooting conditions were analyzed in detail. As mid-frequency coefficients are commonly employed as watermark carriers, we take the mid-frequency spectrum of 512 × 512 sized Lena image and the variation after screen-cam as an example to illustrate the details of their variation rules, as shown in Figure 6. The axis scale value is the coordinate in the spectrum of the original image.

We find that most of the magnitude coefficients with high values are well preserved. For example, (301, 299), (301, 300), (302, 304), and other points of deep warm color in Figure 6. Furthermore, the magnitude coefficients with low values commonly vary to become higher values. In general, the more blurred the image is, the higher the magnitude coefficients with low values will increase. Examples of this are the points (297, 305), (300, 296), and (302, 303) in Figure 6.

The changes can be summarized as blew. In the mid-frequency bands, the magnitude coefficients with high values are well preserved during the screen-cam process, while those with low values commonly become higher values to approximate their adjacent magnitude values. Therefore, we choose mid-frequency bands and embed the message by modifying the selected magnitude coefficients to higher values.

**Figure 6.** Detailed variations of discrete Fourier transform (DFT) coefficient magnitudes during screen-cam process. (**a**): Original image spectrum in the log domain. (**b**,**c**): Difference in log DFT magnitudes of original image and captured image at a vertical distance of 30 and 50 cm.

#### 3.2.2. Message Embedding

Figure 7 illustrates the embedding process. Each selected LSFR is treated as an independent communication channel, and the same watermark message will be embedded in every LSFR. Compared with the DCT-based method in [33], which embeds the message in the sub-blocks of feature regions, the proposed DFT-based method takes each LSFR as a whole, it has better robustness against cropping attacks.

To avoid the LSFR from being further distorted during the rotation process of orientation normalization, we designed a non-rotating embedding method based on the properties of the DFT coefficients. Furthermore, to improve extraction accuracy, a preprocessing method of DFT magnitude coefficients is proposed. Specific steps are as follows.

**Figure 7.** Framework of watermark embedding process.

First, the minimum square regions that contain the LSFRs to be embedded are extracted in order. The luminance band of this area is converted to the DFT domain.

Second, for the watermark information, the pseudorandom sequence *W*= {*w*(*i*) *w*(*i*) <sup>∈</sup> {−1, 1}, *<sup>t</sup>* <sup>=</sup> 0, ... , *<sup>l</sup>* <sup>−</sup> <sup>1</sup>} is generated by the secret key, where *<sup>l</sup>* is the size of the sequence. In order to achieve blind detection that can cope with the situation where the original size is unknown, the embedding radius *R* of *W* is set to a fixed value. Correspondingly, the embedding radius *R*<sup>1</sup> of *W*RS is defined as:

$$\mathcal{R}\_1 = \text{round}\left(\frac{L\_1}{L\_0} \cdot R\right) \tag{12}$$

where *R*<sup>1</sup> is the embedded radius of the square region and *L*<sup>1</sup> is the side length of the square region. According to the characteristics of the DFT coefficients, which is centrosymmetric, we can have 180 degrees as embedding region. The coordinates *W*RS(*xi*, *yi*) of the message embedding position in the square region are defined as:

$$\begin{array}{l} x\_{l} = \frac{L\_{0} + 1}{2} + \text{floor}\left[R\_{1} \cdot \cos\left(\frac{l}{l} \cdot \pi + \theta\_{d}\right)\right] \\ y\_{i} = \frac{L\_{0} + 1}{2} + \text{floor}\left[R\_{1} \cdot \sin\left(\frac{l}{l} \cdot \pi + \theta\_{d}\right)\right] \end{array} \tag{13}$$

where *j* is the j-th element of *W*. Therefore, the elements of the message *W*(*xi*, *yi*) are equally spaced around the center of the embedding region. θ*<sup>d</sup>* defines the angle between the feature orientation of the LSFR and the normalized orientation.

Third, to obtain a better detection rate, the magnitudes *M* need to be preprocessed before the signal embedding. In theory, the more obvious the difference between the magnitudes, where the watermark embedded information is "1" and "−1", the better the message extraction results. Considering the various rules of the magnitudes during the screen-cam process, we need to avoid high magnitude values at the positions that represent the watermark information "−1". Therefore, some extreme high magnitude values of these positions and their neighborhoods need to be reduced. For a normal distribution, nearly 84% of the values are less than the sum of the mean and one standard deviation. Hence, the preprocess is defined as follows:

$$m\_p(\mathbf{x}, y) = \begin{cases} \overline{m}\_p + \sigma\_p & \text{if } m\_p(\mathbf{x}, y) > \overline{m}\_p + \sigma\_p \\ \text{no change} & \text{otherwise} \end{cases} \tag{14}$$

where *mp*(*x*, *y*) defines all magnitudes of the positions that represent the watermark information "−1" and their eight neighbor magnitudes, and *mp* and σ*<sup>p</sup>* define the mean value and the standard deviation of these magnitudes, respectively.

The watermark signal is embedded in preprocessed magnitudes *M*<sup>P</sup> using the following equation:

$$M\_w(\mathbf{x}, \mathbf{y}) = \begin{cases} \overline{m}\_p + \boldsymbol{\beta} \cdot \boldsymbol{\sigma}\_p & \text{if } \mathcal{W}\_{\text{RS}}(\mathbf{x}, \mathbf{y}) = 1 \\ \text{no change} & \text{if } \mathcal{W}\_{\text{RS}}(\mathbf{x}, \mathbf{y}) = -1 \end{cases} \tag{15}$$

where *Mw*(*x*, *y*) define the watermarked magnitudes and β the embedding strength. We provide an initial value β = 0.1*R*, which is set based on experience and adjust the value by the calculated peak signal to noise ratio (PSNR) index. If the PSNR value is less than 42 dB, the β will be reduced by 0.2. Iterate this process until the PSNR value is higher than 42 dB.

Last, *M*<sup>W</sup> is combined with ϕ, which is converted to the watermarked luminance band of the square region and then transformed to the spatial domain. Only the pixel values within the LSFR are replaced. The result is a watermarked LSFR.

After all selected LSFRs are embedded, the embedding process is completed.

#### *3.3. Watermark Detection*

Figure 8 illustrates the watermark detection process, which can be divided into the following three steps: perspective correction, candidate regions locating, and message extraction.

**Figure 8.** Framework of watermark detection process.
