Next Article in Journal
A Trillion Coral Reef Colors: Deeply Annotated Underwater Hyperspectral Images for Automated Classification and Habitat Mapping
Next Article in Special Issue
Player Heart Rate Responses and Pony External Load Measures during 16-Goal Polo
Previous Article in Journal
Residential Power Traces for Five Houses: The iHomeLab RAPT Dataset
Previous Article in Special Issue
SocNav1: A Dataset to Benchmark and Learn Social Navigation Conventions
 
 
Data Descriptor
Peer-Review Record

Identifying GNSS Signals Based on Their Radio Frequency (RF) Features—A Dataset with GNSS Raw Signals Based on Roof Antennas and Spectracom Generator

by Ruben Morales-Ferre *, Wenbo Wang, Alejandro Sanz-Abia and Elena-Simona Lohan
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 6 January 2020 / Revised: 7 February 2020 / Accepted: 13 February 2020 / Published: 17 February 2020
(This article belongs to the Special Issue Data from Smartphones and Wearables)

Round 1

Reviewer 1 Report

The paper describes a large dataset provided to the public for testing fingerprinting algorithms for the identification of spoofed signals.
While the paper and the data associated to it could be a valid contribution for the scientific community, several aspects of the paper need to be clarified. Without proper clarification, it is not possible to judge the content of the paper and of the data. Without such clarifications, the paper is not mature for publication.

1) Please properly define the problem considered: from Figure 5, it emerges that you are performing multi-class classification, considering 6 classes. Two of them use data from real antennas whereas 4 are generated using the Spectracom signal generator. If this is the case, the comparison does not seem fair: in the real data case the noise will prevail, at this is why the number of satellites does not matter. In the other case, no noise is considered. Are you really considering the problem of distinguishing between noisy and noiseless signals?

2) Why noise is not considered for the simulated case? Noise will always be present and will hide the useful signal contribution even in a spoofing scenario. Why people should be interested in distinguishing datasets with 1, 5 or 10 satellites without noise? The fingerprinting application does not seem convincing. Are you fingerprinting the (transmitting/receiving) device or the content of a specific dataset? Datasets without noise or without any other special feature have little utility. Please provide a strong justification for the data you are providing.

3) The authors have to provide a clear interpretation of Figures 3 and 4: what are we observing? In the real data case (Novatel and Antcom), the input signal will look like wideband noise (not far from being wait). When applying the DWT, something close to the DWT of white noise should appear. In reality, the paper does not specify what it is provided in Figure 3 and 4. The DWT should provide the coefficients associated to the different wavelets: a time-scale or time-frequency representation of the input signal should be provided. This is not the case in Figure 3 and 4: no scale or axes are provided. In Figure 3, the plots seem to be the constellation diagrams of the raw data: (a), (b) and (d) seem Gaussian noise, (e) the diagram of a BPSK and (c) and (f) could be the results of combing different BPSK modulations. This point has to be clarified.

4) The classifier seems to have an easy life since the data are without noise and probably the distinction between the different classes is made on the simple received power and constellations shape.

5) The authors have to provide a description of the different datasets provided: size of each data sets (Mbytes and duration in seconds), noise characteristics (on/off), synchronization of TX and RX, …

6) Page 3: why connecting the simulator and the USRP: this would perfectly align the transmitter/receiver clock. This would lead to a zero clock bias/drift and would justify the perfect BPSK observed in Figure 3.

7) Please check the gain: 30 dB vs 30 dBm

8) Please check the paper for typos: Novatel with only one ‘l’

9) Logistic Regression: clarify the labels associated to the {-1, 1} output. Which approach are using to proceed from binary to multi-class classification? One vs all?

Author Response

Please see attached file.

Author Response File: Author Response.pdf

Reviewer 2 Report

The dataset is of great interest and it can for sure be used for many different types of analysis. Thank you for making it available in open access!

My main remarks are summarized below. The attached pdf file also contains more detailed comments.

1. It would help to provide more details on the setup:
* Antenna cable length and type (useful in RF FP to assess, e.g., standing wave effects)
* Location and time of each of the log files. It is said that it is possible to use a GNSS planner to know which satellites are visible, but this requires knowledge of the location and time.
* LO frequency of the USRP. "IF=0" is ambiguous as the different GNSS signals are not on the same carrier frequency. IF=0 can only be true for a given carrier (e.g. GPS L1 vs BeiDou B1).
* What is the Doppler for the single-satellite scenarios (Fig 3c and 3e seem to indicate that the Doppler was zero).
* Is there any frequency sync between the Spectracom and the USRP, or are they both running on independent frequency references? This is important to correctly interpret the Spectracom results.


2. The discussion of the results in Fig 3 and Fig 4 could be expanded. I miss a short explanation to correctly interpret the different shapes (e.g. are the two scatters in 3e linked to the BPSK modulation?).

3. As explained in the introduction, RF FP is about I/Q imbalance, amplifier non-linearities, phase noise, pass-band filter imperfections, etc. It would be good to show at least a result related to those aspects. The results shown in the paper actually tell more about the receiving antenna/cable (3a-b) or about the selected modulation (3c-3f). They actually don't tell a lot about the RF fingerprint of the transmitters. This is just a suggestion. The dataset is of interest independently of the example shown in the paper.

 

Comments for author File: Comments.pdf

Author Response

Please see attached file.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

While the authors have made some effort to improve the paper and respond to the problems identified during the previous round of review, most of the concerns previously indicated are still open.

First of all, the authors failed in justifying the utility of the data provided. As already said, the data provided could be useful but the machine learning/fingerprinting motivation is very weak and several problems with the approach described are still present in the paper.

The reviewer agrees with the authors when saying that the important contribution here is the availability of the data that could be used by the GNSS community. Thus, the paper should highlight it. This is still not the case for this manuscript. The machine learning approach is a possible application example and should be given less emphasis. This has to be reflected in a clear and substantial way in the introduction.

For example, it could be stated:

"Over the last decades GNSS receiver technologies has significantly evolved also for the availability of Software Defined Radio (SDR) approaches where basedband digital representations of raw GNSS data are used to test and tune new algorithms. SDR data require a large amount of memory and only few datasets are available openly.

We provide a set of SDR data with different scenarios …

In particular, we focus on two distinct applications: real data collected under open-sky conditions from two different antennas and simulated data collected from a simulator in the absence of noise.

These data can be used for several purposes. For example, real data can be used to test acquisition and tracking algorithms.

[if the data have been collected in a synchronous way, then you can even indicate that people can use them for antenna array experiments (then it would be good to provide the baseline between the two antennas)]

Simulated data are provided without noise and can be used to define more advanced scenarios. For example, the interested user can add different noise levels and/or jamming signal components to the recorded data. In the datasets provided, we considered different signal and constellation combinations which can be used for defining complex scenarios including spoofing attacks. This can be done using dedicated software.….”

Then, you can introduce the fingerprinting application.

“In this paper, we focus on fingerprinting applications…”

After this, you should really specify that you are targeting two applications, which are quite separate:

1) Distinguishing real data from different antennas [Figure 6]

2) Distinguishing clean data with different modulations and different number of satellites [Figure 7]

Sorry, but you are not demonstrating at all the ability to distinguishing different transmitters or spoofed versus real. THIS IS NOT A PROBLEM: just describe the contribution you are giving (it is sufficient!). Do not state unjustified claims!

While you can leave the machine learning application you do have to clarify your approach:

a) State the classification problems you are trying to solve:

Distinguish data from two antennas Distinguish simulated data with different constellation/signal combination

Describe the classes explicitly!

b) The DWT image and the interpretation of Figures 4 and 5 are still unclear see comment below. Before a machine understand the data, you do have to understand them.

Additional comments are provided below. This time please try to address the comments provided. Most of the previous comments were practically ignored.

Typos: check for typos – Now Novatel is ‘Novetel’ (check also figures)

Page 3: the newly added sentence:

“This synchronization is specially important to guarantee that two signals contain the same data, but received (or generated) with a different antenna.”

Is unclear and should be corrected. The sentence suggest that you are asking the Spectracom to generate the same constellation/data provided by the rooftop antenna.

Suggestion: “This setup allows one to collect data at the same time, with the same hardware and clock features minimizing the differences introduced by the reception platform”.

Also:

“The satellites in view might change during the time one recording channel is switched to the other.”

Not sure this is relevant.

“Recording with both channels at the same time (using the same clock) we guarantee that the satellites in view are exactly the same, and also that both recordings are sampled at the same time.”

Again: how could you guarantee this condition? Are you asking the Spectracom to generate the same constellation seen through the real antenna? This new paragraph is very confusing. If you are referring to the case where you are using two real antennas, please explicitly say so.

Spectracom clock synchronization: the “perfect” clock synchronization is not the worst-case scenario. It is actually one of the best-case scenarios: when the receiver will try to compute its position will found a flat clock bias, i.e. an unrealistically stable clock. Clock monitoring is one of the most used approaches for spoofing detection.

Page 5: problems with caption of Table 2: It appears that a note is present in the caption.

Section 4,
“Next a CERTAIN time-frequency transform of the data is carried out, in order to produce a CERTAIN image that will be post-processed. To the different images a feature extraction is performed in order to train a CERTAIN Machine Learning algorithm”

Please avoid using “CERTAIN”. It makes the sentence very unspecific.

Page 6, “outputs of the filtering” --> “outputs of filtering” - “with the filtering” --> “with filtering”

“cA and cD matrices” these matrices and the corresponding vectors are not defined with respect to equations (1), (2) and (3). Please provide a clear definition. The description provided in the updated version of the paper is still not useful. Figure 3 is clearer, but then cA and cD are two time-domain signals defining a 2x(N/2) image. N is the original size of the signal.

The results part is still very unclear. The original signals collected are complex valued since no IF was introduced and IQ sampling was adopted. If cA and cD are the high/low pass components of the original samples, then they are also complex valued. Then how the 6 plots in Figure 4/5 could be the plots of the cA vs cD components? No change was made to the figures. No effort to improve the readability of the images and to explain their content was made. The criticalities identified in the previous version of the paper have not been addressed. Please check again the comments provided in the previous round of reviews.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

The authors have significantly improved the quality of their paper by addressing the comments of the reviewer.

In particular, the paper now focuses more on the datasets themselves and highlight the fact that RFF is only one potential application.

Also the lack of clarity of the previous images is now solved (by removing the DWT plots). The TKEO operator provides a real signal component that can be plotted as a function of time as done in Figure 3. A done during the previous reviews, I reccomend to add boxes around each figure and to add scales and axes. While it is clear that the machine learning approach does not use such information, it will be useful for human interpretation (also to gain some insight about what the machine learning approach is doing).

Please proof-read carefully the paper:

Example:

“This paper is a data descriptor paper for a dataset of raw GNSS signals”-->“This is a data descriptor paper for a set of raw GNSS signals”

“such as the testing GNSS acquisition” --> “such as testing GNSS acquisition”

“The use cases of such data spread in multiple directions are left to the choice of the research community” --> “The use cases of such data spread in multiple directions, which are left to the choice of the research community”

“sub-set of rff problem” --> ” sub-set of RFF problem”

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop