Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

DEFS—Data Exchange with Free Sample Protocol

Electronics 2021, 10(12), 1455; https://doi.org/10.3390/electronics10121455

by Rafael Genés-Durán^1,*

, Juan Hernández-Serrano¹

, Oscar Esparza¹

, Marta Bellés-Muñoz²

and José Luis Muñoz-Tapia¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2021, 10(12), 1455; https://doi.org/10.3390/electronics10121455

Submission received: 16 May 2021 / Revised: 5 June 2021 / Accepted: 15 June 2021 / Published: 17 June 2021

(This article belongs to the Special Issue Security and Privacy for Data Decentralized Marketplaces)

Round 1

Reviewer 1 Report

The paper proposes setting and protocols to enable consumers to see a free random sample of data from providers. The general machinery developed seems reasonable. It fundamentally relies upon the notion of using Merkle hash trees (MHTs) and their proofs (MPs) for authentication, utilising a claim that the root MRK can be used as proof of correctness of whole tree. I did not have any major issues with the protocol.

However, questions/concerns about the overall setting would seem to need to be answered/addressed to ensure the method is fit for purpose and that the benefits and limitations are clearly indicated:

There needs to be off blockchain traffic exchanged in secure manner (L357, pg 9) and the consumer needs to undertake some formal verification of data formatting (L427, pg 12) – could these requirements reduce the number of possible consumers available?
There is no means to consider the dataset quality (L624, pg 17), so if a malicious provider selected good data to send (pg 18, L658) then could it be the case that a sufficient number of consumers may receive good data and proceed, even if some proportion received bad data and hence do not proceed? (somewhat similar to large scale phishing). How would this be mitigated against?
In terms of attacks, there are some assumptions made on MHTs (pg 18, L651, 663) – how reasonable are these, and is there evidence for this? In general, since the MHTs are crucial to the setup, it would be good to include formal statements/proofs relating to these or clear specific reference to any results used.
Why would it be unfeasible for a malicious consumer to create more than 50 identities (pg 20, L722)? There seems to be a high likelihood of obtaining a large proportion of the dataset (pg21, L733), which seems a highly significant issue? Are there examples of domains where this is not a concern? It seems that identifying any instances where the complete data set is required might help here, even if it limits the widespread applicability.
Mitigating identity replication by “strongly authenticating consumers through the off-chain channel” (pg 21, L735) seems rather a strong requirement - if there is a mechanism for this, would it in any way limit the need for the protocols proposed here?

Minor corrections:

Pg 1. complete neediness of -> need for

Pg 2: critic -> critical

Pg 7: check uses of notation thoughout the paper, such as that in L287 – eg use of subscript i, and decide whether one actually needs the natural number symbol here.

Pg 8: check consistency of notation throughout the paper, such as that in L312 and 315 – with i appearing in italic or bold font in various places.

Pg 9: L350, assure to the customer -> assure the customer

Pg 11, L413, should MHT(K) be MHT(K_i)? Also: some piece of example-> a partial example

Pg 14, fig 7: is the top instance of “>timeout 3” supposed to be “<timeout 3”?

Pg 14, L512: that has -> that the provider has

Pg 14, L535: is unable -> being unable

Pg 16, L588: delete unmatched “)”

Pg 19, L701 (and other places): delete “a” from “a 10% of samples” Improve the table 2 contents to clarify meanings (eg there is no indication of r in the related paragraph yet it appears as a column header).

Pg19, L718: the permutations -> the number of permutations (several times)

Author Response

We would like to thank the reviewer for the valuable comments, and we hope that we have addressed and clarified all the concerns and improved the quality of the final version of the paper.

There needs to be off blockchain traffic exchanged in secure manner (L357, pg 9) and the consumer needs to undertake some formal verification of data formatting (L427, pg 12) – could these requirements reduce the number of possible consumers available?

In our humble opinion, both requirements could be perfectly assumed by any consumer application, since they are the same requirements most Internet APIs hold.
The first requirement, a secure channel, can be met using the widely supported TLS protocol -- e.g. with HTTPS. TLS only requires the server to hold a valid certificate (and its complementary private key) in order to create the secure channel. That is to say that consumers just need a valid TLS client, which is implemented by default in most programming languages, application frameworks, and/or web browsers.The second requirement, related to the data format, is also relatively standard. The process that verifies that a data portion meets a predefined format is usually called a validator, which is used by many technologies to check received responses before processing them. In object-oriented programming this process is usually done by trying to parse the response as a given type of object, which will throw an error if it does not. There are also specific standards with well-known implementations, such as JSON-LD, that help define and validate specific data schemas.

As stated by the reviewer, we had not properly explained all this crucial information in the paper, and for such a reason we have added a new section of requirements (section 4.3.1) clearly stating the previous reasoning.

There is no means to consider the dataset quality (L624, pg 17), so if a malicious provider selected good data to send (pg 18, L658) then could it be the case that a sufficient number of consumers may receive good data and proceed, even if some proportion received bad data and hence do not proceed? (somewhat similar to large scale phishing). How would this be mitigated against?

As the reviewer properly states, this attack is very important, and we may have not addressed it clearly enough in the previous version of the article. The DEFS protocol is designed so that the provider commits the shuffled encrypted data at the very beginning of the protocol.
Since the consumer requests some random samples, the provider has no control over the samples that will be revealed later on. This way, providers can not really prepare the data, because they cannot predict in advance which samples are going to be sent to the consumer.

According to the reviewer, we have added a clarification of this matter in the explanation of the Protocol execution (step 2 in Section 4.1).

In terms of attacks, there are some assumptions made on MHTs (pg 18, L651, 1) how reasonable are these, and is there evidence for this? It would be good to include formal statements/proofs relating to these or clear specific reference to any results used.

The reviewer is very right. Many of our assumptions rely on the security of Merkle proofs and cryptographic hash functions, and we did not address these issues properly in the previous version of the paper. According to the reviewer's suggestion, in this version we include literature about MHT and hash robustness in Section 2.3.

In the paper we have included the following text to clarify the security of Merkle proofs:

"The security of a MP reduces to the collision resistance of the underlying hash function [8]. For this reason, we assume the hash function used to build MHTs is cryptographically secure. That is, that the probability of finding a preimage or a hash collision is negligible [9]."

In addition, we have added the following bibliographic references:

[8] Dahlberg, R.; Pulls, T.; Peeters, R. Efficient Sparse Merkle Trees: Caching Strategies and Secure (Non-)Membership Proofs.Cryptology ePrint Archive, Report 2016/683, 2016. https://eprint.iacr.org/2016/683.

[9] Al-Kuwari, S.; Davenport, J.H.; Bradford, R.J. Cryptographic Hash Functions: Recent Design Trends and Security Notions.Cryptology ePrint Archive, Report 2011/565, 2011 https://eprint.iacr.org/2011/565.

Why would it be unfeasible for a malicious consumer to create more than 50 identities (pg 20, L722)? There seems to be a high likelihood of obtaining a large proportion of the dataset (pg21, L733), which seems a highly significant issue? Are there examples of domains where this is not a concern? It seems that identifying any instances where the complete data set is required might help here, even if it limits the widespread applicability. Mitigating identity replication by “strongly authenticating consumers through the off-chain channel” (pg 21, L735) seems rather a strong requirement - if there is a mechanism for this, would it in any way limit the need for the protocols proposed here?

Definitely, we agree with the reviewer that identification is needed even if it limits the widespread applicability.
When we said "strongly authenticating consumers through the off-chain channel", we meant that there are off-chain ways of limiting the amount of identities an attacker could take. A known example is binding the identity to an e-mail account or a mobile phone.
Depending on the price of the traded data and the type of consumers a provider could have, the provider should decide the most suitable authentication method.
For example, in some cases, authenticating with an e-mail address can be enough. In other cases, providers might require authenticating with a mobile phone or even with both, e-mail and mobile phone. In some specific cases, authentication could involve more factors, such as physical key generators, smart cards, etc.
To address the reviewer's concern, we have added the previous reasoning to the paper so that it now becomes clear for the reader. Once again, all this clarification has been included in the new Requirements section (Section 4.3.1).

Regarding the ability of an attacker to generate more than 50 identities, in Section 5.3 we provide tools that guide providers to select the amount of samples to disclose for free while minimizing the risk of disclosing a representative part of the dataset when an attacker can hold k identities. k=50 was just an example to illustrate the potential of such an attack. In any case, providers should decide their own value for k based on the identification method in use and the risk they are able to assume.

Once again, in our paper, we have not been clear on the reasoning in the previous paragraph. For such a reason, we have added one paragraph in the protocol preparation (Section 4.3.3 line 318), when the provider decides the number of samples v to release. We added as well a sentence in the analysis (Section 5.3 line 736) to recall that the identification system in use should be able to minimize the likelihood of an attacker getting the maximum amount of identities k in use in the analysis (50 in the example provided).

According to the reviewer's suggestion, we have fixed the following minor corrections:

Pg 1. complete neediness of -> need for
Pg 2: critic -> critical
Pg 7: check uses of notation thoughout the paper, such as that in L287 – eg use of subscript i, and decide whether one actually needs the natural number symbol here. (Natural number symbols removed).
Pg 8: check consistency of notation throughout the paper, such as that in L312 and 315 – with i appearing in italic or bold font in various places.
Pg 9: L350, assure to the customer -> assure the customer.
Pg 11, L413, should MHT(K) be MHT(K_i)? MHT(K) refers to the whole tree of K. Also: some piece of example-> a partial example.
Pg 14, fig 7: is the top instance of “>timeout 3” supposed to be “<timeout 3”?
Pg 14, L512: that has -> that the provider has
Pg 14, L535: is unable -> being unable
Pg 16, L588: delete unmatched “)” -> It refers to the enumerate '7)'.
Pg 19, L701 (and other places): delete “a” from “a 10% of samples” Improve the table 2 contents to clarify meanings (eg there is no indication of r in the related paragraph yet it appears as a column header).
Pg19, L718: the permutations -> the number of permutations (several times)

Author Response File: Author Response.pdf

Reviewer 2 Report

In the paper the authors propose a new protocol for fair data exchange between data providers and consumers. The paper is well written and easily understandable.

Some comments:

Section 4.3.2 item 6: For me it seems that the data is generated in this step. But shouldn't it exist even before the protocol starts to be executed?

Figure 6: There is a big frame called alt. But I think items 9 to 12 should not be included into this frame.

Formula on top of page 12: There are some hash evaluations missing. (Or are they done by concat?) same for the formula on page 16.

Formula at the bottom of page 12: It should say for all I \in 0, ..., n-1

Section 4.3.4 1.: How can it happen that the consumer detects that a specific key is wrong? As far as I see, the consumer has only the set K of keys as well as the root MRK of the tree. By observing that building the MHT on top of K doesn't lead to the root MRK, the consumer can detect that K is wrong but not which specific key is wrong.

Table 2: What is r? (Shouldn't it be v?) What is r in the caption of figure 11?

Section 5.3: What happens if several consumers who are interested in the same data collaborate?

Typos:

line 280: producer --> provider

line 287: symmetrical --> symmetric

line 331: same than --> same as

line 339: complete key set --> complete set

line 363: then number --> the number

---

Overall, the paper is in my opinion a solid piece of work which deserves, after working on the above mentioned points, to be published.

Author Response

We would like to thank the reviewer for the valuable comments, and we hope that we have addressed and clarified all the concerns and improved the quality of the final version of the paper.

Section 4.3.2 item 6: For me it seems that the data is generated in this step. But shouldn't it exist even before the protocol starts to be executed?

We agree with the reviewer that the previous version of the paper lead to some misunderstanding about data preparation. To clarify, we would like to mention that the *data* to be exchanged exists before the execution of the DEFS protocol. When DEFS starts, what it does, is to generate the array of portions D from the pre-existing data to be traded.

In Section 4.3.2 item 6, we have clarified this as follows:

"Now, using the pre-existing data to be exchanged, the provider has to build
an array ($D=[d_0 ... d_{n-1}]$) with the portions where each $d_i$ has the corresponding data and the index as a header $d_i=concat(i,data_i$)."

Figure 6: There is a big frame called alt. But I think items 9 to 12 should not be included into this frame.

In fact, items 9 to 12 have to be included into the Alternative 2 frame. The misunderstanding probably comes from the explanation of Figure 6, which was not enough. In more detail, the frame in Figure 6 shows the two alternatives that are possible in the protocol execution phase once the consumer has made the payment:

Alternative 1 (seed not released): the consumer has paid for disclosing the seed that will be used to decrypt the data, but the provider does not publish the seed in the SC before Timeout1 expires. In this case, the consumer can ask for reimbursement (Step 8).
Alternative 2 (seed released): the provider releases the seed to the SC (step 9). This should be the normal operation. After that, the consumer is informed about that fact (step 10) and the provider gets paid for the data (step 12). Notice that steps 9, 10, and 12 only happen if the seed is released, so items 9 and 12 need to be included in this alternative.

In the current version of the paper we clarify that Alternative 2 is the happy path of the protocol (including steps 9 and 12) while Alternative 1 is an unhappy path.

Formula on top of page 12: There are some hash evaluations missing. (Or are they done by concat?) same for the formula on page 16.

The reviewer is right and we apologize for that. We have updated the formulas on pages 12 and 16 to include the hash functions that were missing:

hash(concat(h01,hash(concat(h2,hash(k3))))) == MRK
hash(concat(h01,hash(concat(h2,hash(c3))))) == MRC

Formula at the bottom of page 12: It should say for all I \in 0, ..., n-1

We totally agree with the reviewer, we have updated the formula.

Section 4.3.4 1.: How can it happen that the consumer detects that a specific key is wrong? As far as I see, the consumer has only the set K of keys as well as the root MRK of the tree. By observing that building the MHT on top of K doesn't lead to the root MRK, the consumer can detect that K is wrong but not which specific key is wrong.

Definitively, this case was not clearly presented in the previous version of the paper. In general, as the reviewer properly notices, there is no way a consumer can detect that a specific key is wrong from a Merkle root. What happens is that conflictK manages an edge case in which the wrong key is one from the samples or from a sibling of a sample. For these keys, the consumer has a Merkle proof that matches the committed MRK. Then, if one of these keys does not follow the agreed format (hash(i+s)), the consumer can send the hash of the wrong key, its Merkle proof, and the index in conflict to prove to the smart contract that the provider committed an incorrect MRK.

We would like to specially thank the reviewer for raising this issue and make us realize that the explanation in Section 4.3.4 needed to be rebuilt to explain the case more accurately. When explaining this case, we realized that the provider interaction with the smart contract was not needed, since the consumer already has all the necessary data to prove conflictK to the smart contract. We have rebuilt the figures accordingly.

Table 2: What is r? (Shouldn't it be v?) What is r in the caption of figure 11?

This is absolutely correct and according to the reviewer, we have updated the variable.

Section 5.3: What happens if several consumers who are interested in the same data collaborate?

The reviewer is right, we did not mention this attack in the previous version of the paper. The fact is that several consumers agreeing to collaborate to collect more samples for free is equivalent to a consumer creating several identities, which is already analyzed in Section 5.3. As we state there, the provider can control the amount of free samples she wishes to disclose to mitigate the impact of an attack of this type. Since the two attacks may not seem equivalent at first sight, we included a clarification about this in Section 5.3.

According to the reviewer's suggestion, we have fixed the following typos:

L280: producer --> provider
L287: symmetrical --> symmetric
L331: same than --> same as
L339: complete key set --> complete set
L363: then number --> the number

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The response and update addressed earlier concerns and improved the quality of the paper.

Article Menu

DEFS—Data Exchange with Free Sample Protocol

There needs to be off blockchain traffic exchanged in secure manner (L357, pg 9) and the consumer needs to undertake some formal verification of data formatting (L427, pg 12) – could these requirements reduce the number of possible consumers available?

In terms of attacks, there are some assumptions made on MHTs (pg 18, L651, 1) how reasonable are these, and is there evidence for this? It would be good to include formal statements/proofs relating to these or clear specific reference to any results used.

According to the reviewer's suggestion, we have fixed the following minor corrections:

Section 4.3.2 item 6: For me it seems that the data is generated in this step. But shouldn't it exist even before the protocol starts to be executed?

Figure 6: There is a big frame called alt. But I think items 9 to 12 should not be included into this frame.

Formula on top of page 12: There are some hash evaluations missing. (Or are they done by concat?) same for the formula on page 16.

Formula at the bottom of page 12: It should say for all I \in 0, ..., n-1

Table 2: What is r? (Shouldn't it be v?) What is r in the caption of figure 11?

Section 5.3: What happens if several consumers who are interested in the same data collaborate?

According to the reviewer's suggestion, we have fixed the following typos:

Further Information

Guidelines

MDPI Initiatives

Follow MDPI