Next Article in Journal
Open Geospatial System for LUCAS In Situ Data Harmonization and Distribution
Previous Article in Journal
Ecological Impact Prediction of Groundwater Change in Phreatic Aquifer under Multi-Mining Conditions
 
 
Article
Peer-Review Record

Geographic Knowledge Graph Attribute Normalization: Improving the Accuracy by Fusing Optimal Granularity Clustering and Co-Occurrence Analysis

ISPRS Int. J. Geo-Inf. 2022, 11(7), 360; https://doi.org/10.3390/ijgi11070360
by Chuan Yin 1,2, Binyu Zhang 1, Wanzeng Liu 3,*, Mingyi Du 1, Nana Luo 1, Xi Zhai 3 and Tu Ba 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
ISPRS Int. J. Geo-Inf. 2022, 11(7), 360; https://doi.org/10.3390/ijgi11070360
Submission received: 29 March 2022 / Revised: 24 May 2022 / Accepted: 20 June 2022 / Published: 23 June 2022

Round 1

Reviewer 1 Report

This study proposes a classification system for geographic features, i.e. using a community discovery algorithm to classify feature names. More specifically the authors assess a process that combines clustering and co-occurrence analysis for feature normalization, through the development of a semantic information model that has been trained to provide the optimal correspondence to geographic data. The topic of the article is interesting and the research is timely and worthwhile, as the authors provide a new insight into the field of geographic knowledge systems. Therefore, the article proposes an attractive topic for the academic community; however, I believe it requires some adjustments for greater comprehension and quality improvement.

Authors should improve the quality of the abstract and follow a structured style, which is based on the IMRAD structure of a paper. The abstract should state briefly the purpose of the research, the principal results, and major conclusions, as it must be able to stand alone.

The structure of the paper is clear, logical and academically challenging and authors introduced a short description of the issue and an extensive state of art with references on the specific topic. However, in my opinion the Introduction should be separated from the state of the art which, for its part, should be integrated to the related works section. Subsequently, authors should take into account more current and, also, broader work that has appeared in the published literature. 

In this context, it is suggested that they are encouraged to further discuss the outcomes and how they can be interpreted from the perspective of currently available studies

The diagrammatic presentation in this study may well be the strongest aspect of this work. I suggest adding a visual presentation of the results (where applicable) in section 4.3 EXPERIMENTAL RESULTS AND ANALYSIS, in order to further improve the presentability of the paper.

Overall, I consider the work adequate, but it can be improved by addressing the aforementioned issues. Also, the paper has to be proofread.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

There especially the following issues:

  1. The paper is not well written, it is rather hard to read it and understand  the novelty of the proposed method.
  2. The experiments does not include any comparison with other state of the art methods.
  3. There are some missing references.
  4. There a lot of grammar mistakes, especially missing articles.

More precisely:

  1. The paper is not well written. Some examples:
    • We therefore normalize the attribute of ...
      • Which attribute do you consider? the attribute means that you refer to a previous term 'an attribute'.
    • p2: Influenced 64 by the semantics of geography, researchers use different vocabulary when describing ...
      • It is unclear, rewrite the sentence.
    • p2: relation vs relationship?
    • p2: In this paper we focus on geo-semantic attribute alignment and classify semantic description vocabulary of spatial relationships to facilitate spatial computation and knowledge inference.
      • There are a lot of new terms, it is necessary to introduce them.
    • p3: ignoring the features on the structure of entity attributes
      • Where the structure is introduced?
    • p3: Word vector is not properly introduced. 
      • where 0.543 represents an explanatory feature of the word
        vector.
        • It is not clear.
    • p5, Equation 2: There is no reference or explanation to this Equation.
    • p5: According to the structure of the web encyclopedia data
      • Where the data is introduced?
    • p6: in the high-dimensional vector space, which has a greater
      • Where the vector space is introduced?
  2. The experiments does not include any comparison with other state of the art methods.
  3. There are some missing references. Some examples:
    • p2: Geographic knowledge graph is one major representation. KG (Knowledge Graph) ....
    • p2: In order to improve the quality and enhance the ...
    • p2: Existing methods for attribute normalization include ...
  4. There a lot of grammar mistakes, especially missing articles. Some examples:
    • of attributes and rule inference
      • of attribute and rule inferences?
      • of the attribute and rule inference?
    • by using ? manual discrimination
    • to achieve ? geographic knowledge service system
    • have ? flexible form
    • by calculating ? text similarity,
    • for ? attribute normalization
    • to compute ? string similarity
    • transformed ? attribute alignment

 

Other notices:

  • p2: [19]using overlap x [19] using overlap
  • p2: Liu[22] et al. x Liu [22] et al.
  • knowledge base .And x knowledge base. And
  • methods .But it relies x methods. But it relies
  • Sometimes long phrases are used:
    • geo-graphic attribute name representation
    • the target attribute name text  classification
    • geographic attribute name normalization
    • a data-driven fine-grained alignment method
    • the contextual relevance-based synonym mining method
  • p8: n and m are defined again after Equation 4.
  • And many others.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The article presents an approach to building a knowledge graph by geographical entities. Methods for searching for similar entities, clustering and combining entities are proposed.

According to the introduction and title, it is expected that the result of the work will be a knowledge graph obtained by applying the approach. However, the authors focused more on describing approaches for searching for entities, clustering entities and combining them. At the same time, there are a number of questions to the algorithms. For example, how is the center of the cluster manually selected? How is the normalized attribute value chosen? How to build relationships between entities and clusters?

The clustering method itself in section 3.2 is not entirely clear. It may be worth reworking the example to explain the idea more clearly.

It is also not entirely clear why existing ontologies and knowledge graphs containing geographical knowledge are not used? Why are they worse than the model obtained as a result of the proposed approach? Why shouldn't they be used as a ground truth or a skeleton for clustering and searching for similar entities?

9 references out of 32 is no later than 5 years. Only 2 of them are 2020 and later. It will be good to provide more latest research on the field of similarity in geographic knowledge.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

In the new revision the authors did very good work on editing the manuscript to make in more clear for understanding. Also, all responces are clearly explain the logic of several debatable point.

The revised version could be accepted for publishing.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

The paper describes the setup and conduction of an experiment in the area of Geographic Knowledge Graphs and focuses on the normalization of attributes. The approach is based on various standard technologies and illustrates, how these can be employed to produce improved results in terms of accuracy.

 

In general, the paper must be improved in the following areas, before it can be approved for publication:

 

  1. Structure

 

The paper does not have a clear structure in terms of introduction, state of the art, modeling, implementation, and experiment. The authors have to make clear, what the novelty of their approach is. The current structure mixes the approach with a survey of existing algorithms, their advantages and disadvantages, the introduction of formulae, which not clearly distinguished between existing and modeled formulae. Furthermore, in many sections, which (assumably) describe the novel approach of the authors, past tense is used, which lead to the conclusion, that this approach has been described somewhere else and the current paper just summarizes other work. Therefore, a revised document structure is recommended, which clearly outlines

  • The current state of the art (introduction of all the employed algorithms and approaches) including the already given advantages and disadvantages in form of a discussion
  • The modeling and novelty of the presented approach (what is the unique research contribution of the presented solution).
  • How is the presented approach implemented (currently missing completely)
  • And the corresponding experiments

 

Independent from these general comments, the current structure is not consistent and does not fit to the content of the corresponding sections. Examples:

  • Section 2.1. “Problem Definition” does not define any problem. It mererly gives some general definitions.
  • Sections 3.2.1 and 3.2.2. seem to be independent from section 3.2
  • All sections are missing a introduction and explanation, what the section is about and how it is structured (e.g., section 4, which immediately starts with 4.1. without further explanations)
  • Several Figures and Tables are not referenced in the text (e.g. Figure 1, Figure 2, Table 1). Table 3 is missing completely.
  • Each section should have a short summary, which outlines the most important points or discusses them.

 

  1. Language

 

Most of the paper is written in past tense. However, if there is a unique and novel approach, which should be published in this paper, present tense should be used. In this case, the authors should not say “we proposed” (e.g., as in line 95) – it would be better to use “here, we propose”, which clearly indicates the scope of the paper. This should be reworked throughout the whole paper.

 

Furthermore, a spell check is highly recommended, e.g.

  • Line 13 “en cyclipedic”
  • Line 39 “service” -> “services”
  • Line 45 “service” -> “services”
  • Line 46 “graph” -> “graphs”
  • Line 59 “LOD” -> please provide description and source of this abbreviation
  • Lines 70, 72, 73, 79: blanks after the citation. “[17]et al” -> “[17] et al”

 

In some sections, the used language could be improved, as well:

  • Lines 49-55: this single sentence is very complicated and does not make sense. Splitting it, would make it easier to understand and to outline the point of the authors
  • Lines 124/125: too many sub-sentences. Remove “respectively, in order”
  • Line 161: “to find out…” is not a scientific approach or language style
  • Line 165: why is the problem “urgent”?
  • Line 192: “the rules”. Which ones?
  • Line 233: the sentence “Initially, each node is independently a class” does not fit here. First, explain the two parts mentioned before, then check, if this definition fits somewhere.
  • Line 248: does this paper present an idea or an algorithm? Please provide this section in a more algorithmic or formal way.
  • Line 271: “In this paper, we only need to analyze…”. Does that mean, the rest of the algorithm has been presented somewhere else (please provide reference).
  • Line 283: “only the frequency of … is the most important”: improve the language here
  • Line 284: what does the term “both” refer to?
  • Line 358: “we learned…”: how, why, what conclusion?
  • Line 421: “it was compared…”: what was compared? The “effectiveness”, the “superiority”, the “experiment”?
  • Line 431/432: “which is associated with a high detection rate but a low detection rate” – makes no sense
  • Line 435: “it can be found” -> “it can be assumed” / “it can be stated” / “it can be calculated”?
  • Line 438: “The low accuracy rate”: which one? Not mentioned before.
  • Line 452: “researchers at home”: what should that mean?

 

 

  1. Presentation

 

The presentation (printing, drawing, etc.) has to be improved, as well in the following points:

  • the used formulae are sometimes printed top aligned (e.g., lines 134, 385)
  • Figure 2 is not self-explaining enough. A detailed description of the colour coding, its semantics and the difference between 2 (a), (b), and (c) has to be given in the text. Furthermore, Figure 2 is nowhere referenced.
  • Figure 3 needs further explanation.
  • Lines 237-241: why is this formula in here? It does not fit to the text above, nor explains the following sections or is used somewhere else. I also think, that there are mathematical flaws in it, but to confirm this, further explanation of the goal of this formula should be given. E.g., the definition (and language) in line 241 seems to be overcomplicated to represent 0 and 1.
  • Line 306, Figure 4: the authors say, “a concrete example is shown in Figure 4”. However, in Figure 4, I cannot see a concrete example for the above mentioned section. Further explanation or a rework of the section is required.
  • Line 317: “for the following four points”: itemize or number them and have them starting in new lines
  • Line 351: “the experiments show”… how and where?
  • Line 404: where does this “Note” belong to? Add a reference either in the table or in the explaining text
  • Line 425: implementation details of the experiment would be required to understand the numbers in the table.

 

 

 

 

  1. Content / Formal specification

 

In terms of the formal modeling of this paper, the following points have to be reworked:

  • Line 127: the triple <e, p, l> is introduced, but not explained or further used. The variable “e” is introduced, but what is “p” and “l”?
  • Lines 128-132: Sets E, C, and G are introduced without further information. Why is it called “E”, “C”, “G”? What is it used for? Only the experiment refers to these terms, but no further modeling or algorithmic explanation is given.
  • Lines 128-132: the formalization is not clearly explained. Example: why is G the set of P1…Pm, while E and C go from 1…n. It is nowhere stated, that m < n or m<=n.
  • Line 133: what is dataset A? It has not yet been introduced. It would be better to introduce every needed prerequisite in a state-of-the-art section and then focus on the modeling of the novel algorithm in this section
  • Line 166: “three main methods”…. But the following subsection only describes advantages and disadvantages of two methods. Where is the third one? Maybe subsections or items for each method or an advantages / disadvantages table might be helpful
  • Lines 249-257: please introduce all variables and their meanings. What is N? What is V1…Vn, etc. before explaining the overall algorithm. Some of these values are never introduced in the paper, therefore a review of the algorithm cannot be given.
  • Line 275: provide a complete formal definition. Do not mix mathematical formulae with languages, where symbols (for and, or conclusions) exist. Avoid misunderstandings by clearly using math. Or remove the formula and provide pseudo code or algorithmic diagrams instead
  • Line 288: why is a sub_i function introduced, what is f_i for? Why are they written in brackets? What is “n” on top of the sum?
  • Line 289: the “average word frequency” should be introduced, maybe by referring to the TFIDF algorithm
  • What should the formula in Line 291 mean? That’s a quite untypical notation
  • Line 293 requires further explanations. What is i, why is it 1,2?
  • Line 297: where does the score come from? Why is it score1 + score2?
  • Table 1: the rules have to be reworked completely. That’s neither math, nor algorithm, nor language. Decide for one thing. But these formulae are wrong. For example the formula “if type(p_i) != type(p_j) then p_i != p_j” could be formalized as “type(p_i) != type(p_j) => p_i != p_j”
  • Line 367: “a threshold value needs to be set”? Why is this? Is this a requirement of the similarity calculation? Is this a requirement for your algorithm? Is this a finding during the experiments?
  • Line 378: “The optimal threshold value = 0.75”. Why is this? Did you measure this? Is there a calculation? Or by experiment? Why not 0.76 or 0.74? Is this really “optimal” or just “a good selected threshold value according to our evaluation”?
  • Line 385-290: it would be recommended to use the typical variable names for true positives, false negatives, etc. when introducing Precision and Recall experiments. Also, this should be done in a state-of-the-art section

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper discusses the normalization  of Geographic Knowledge Graph Attributes. The paper deals with a problem that has been commonly dealt before (attribute matching in Knowledge Graphs (KGs)) but not specifically in Geographic KGs. In this work, there is also no particular demonstration on how the geographic part is taken into account. It is of course present since these are the data used for the experimental evaluation but it is not clear if the method is specifically using any geographic information. Additionally, it is not clear who defines which entity in the KG should be part of this investigation.

Additionally, the paper is lacking any comparison with other works, so we cannot easily conclude that this method works better that other existing methods. 

So, overall the paper is lacking a clear research contribution and the proper placement along similar efforts in the literature and thus it is difficult to recommend it for publishing in its current form.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The addressed topic is of utter importance.

Unfortunately it is very difficult to assess the paper's contribution because the experience does not present the attribute harmonization but rather, if I understood correctly, the exploitation of the harmonization.  Besides, it is not highlighted enough how the work is specific to geographic information (suggestion to move line 214 to 223 to the section 2). Last the writing is in general not very legible and not informative enough.

As a general comment, please revise the writing of the references, as many of them are not compliant with mdpi standard. Another general comment : the titles are very generic and maybe using more informative title could help the reader.

The 1st parag of the introduction does not state clearly why are KG so important and the distinction between KG and DB.  What does “the intelligent transformation from “data-information-knowledge-wisdom”” mean? The core scientific issue of geography scholars is not to express, organize and store geographic knowledge scientifically but rather to produce that knowledge. “knowledge service represented by geographic knowledge graphs” : what does it mean? The references to geographical knowledge graphs (6, 7 and 8) are from the same authors and it would be nice to be more representative of the field.

The 2nd parag of the introduction : what is Wikipedia knowledge base, do you mean DBpedia or wikidata? What do you mean by “structured information of attribute”s? Also, in the paper you also refer to “community” and “modularity”, and it would be nice to introduce these notions somewhere.

3rd parag : The presentation of existing works to harmonize attributes should be more detailed. What do you mean by geographic attribute here? Did you also consider the literature on the harmonization of geographical data schemas? This section could use a diagram to illustrate, maybe something close to figure 4.

Last parag of the introduction : the two first sentence are not very clear. Maybe write smaller sentence.

Section 2 :

In the definition 2, can word pairs have same meaning and not a high similarity? If not I suggest to remove “high similarity and” from the definition (2). “Same meaning” : is it a transitive property?

In the problem discussion it is difficult to connect the text with the figure, can you make explicit reference to items on figure 1 within the text? This would illustrate the text and would be very helpful. <e,p,l> : is it “a” triple or “the” triple?  The notions of “class” and “word vector” could also be introduced here as they will be used further in the paper. Word vectors are introduced very briefly latter on in the paper and with no example. 

In Question 1 : you write that two properties are clustered into 1 new class, do you rather mean their domains?

The distinction between section 2 and 3 is not always understandable. Some sentences from section 3 seems to rather be definitions and discussions like for example line 214 to 223, or line 265 to 275.

Figure 3 is too small and it does not need colour.

Line 251 : please specify how is the adjustment performed. What does “p” mean in rule 1 and 2?

Figures in Table 1 are difficult to read.

Line 318, I did not understand the points and what is meant by a single entity, do you mean single member?

4 Experiment : The whole paper was about attributes and in the presented dataset you do not mention attributes, are there in the metadata? What do you mean by ternary data points, and by ternary data? It would be appropriate to show some data sample there.

4.3.1: what are positive and negative samples here?

Line 445 : can you tell more about location attribute, do you mean the geometry and then which similarity measure did you useN

 Line 452 : what do you mean by “at home and abroad”

In the conclusion you may also discuss the interpretation of the Near-synonyms. It is a valuable asset to make connection between statements which do not use exactly the same attributes.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Dear authors,

thank you very much for your revised submission and the detailed comments on my previous review. I really appreciate your effort and the provided answers and corrections improved the paper, a lot. 

From my perspective, only a minor revision of the following points is required to approve your submission: 

1. Introduction:
- i have the feeling, that lines 47-106 belong to the State of the Art and could be placed into a separate section. Your could also name section 1 "Introduction and State of the Art".
- Lines 122-140: use present tense.... "We propoase a novel normalization process. The process combines the methods of..."
- Lines 126-128: i think, "solves" is the wrong term here. Do you mean "separates" or "distaches"?
- Lines 129-130: how can "semantic information" be trained? You can train a model to work with semantic information, but the information itself is only information. Also the phrase "for the geographical data to best fit it" is a bit strange. Better would be something like "trained to provide the best match for geographical data".
- Line 133: "found" > "discovered"
- Line 134: "corresponding algorithm process" > either "corresponding algorithmic process", or "corresponding algorithm"

2. Related Work:
- please use present tense, as well. You can't say "in this section, we introduced" at the beginning of a section. 
- Line 152: "... as follows." > "... as follows:" and itemize the (1), (2), and (3) 
- Line 163: "Community[26]"  > "Community: a community is a subgraph.... [26]". Place citations at the end of the sentence
- Line 167 (see Line 163)
- Line 168: the phrase "a relatively good result" needs some more scientific explanation, for example by saying, "This means, that the result has a high similarity of nodes...."
- Line 205 (see Line 163)
- Line 207, 209: seems to be a copy & paste error. "The whole algorithm is divided into two parts". Rephrase this paragraph
- Line 213: "clustering, where we need to find a way" > "clustering to find a way"
- Line 216: as this is the end of the section, using past tense would be appropriate "this section provided..."

3. Methods and Models:
- Line 221: "then, sections 3.1 and 3.2 introduce..."
- Line 242 "t result" > "result"
- Line 250: "be a triple corresponding to entity..." > "be such a triple. Then, 'e' is the corresponding entity, 'p' represents the attribute, and 'l' stands for the attribute value"
- Line 261: the label of Figure 2 is not under the Figure, but in Line 262. Please revise the formatting
- Line 288: the term "later" can be removed
- Line 288: "choice of" > "choice for"
- Line 292: "too small(it ..." > "too small ( it ..."
- Lines 285-299: could be rephrased to make the point clearer
- Lines 209 and 311 both start with "The idea of the algorithm is roughly as follows" > please revise this paragraph
- Line 321: "In summary the algorithm formula is summarized as" > use only one "summary" phrase
- Line 334: use present tense
- Lines 382ff: the paragraph is printed at 100% page width. Please correct the format
- Line 460: "... weak relevance, they have no value in the research,so" > "... weak relevance and thus no value in the research. Therefore, a threshold..."
- Line 465: use present tense
- Line 472: "figure5" > "Figure 5"

Some general notes: 
- in many cases, the english language and sentence flow could be improved. 
- the use of present tense / past tense should be harmonized through the document
- sections, subsections, bold, and italic use can be harmonized, as well

I'm looking forward for a revised version. 

Thanks in advance.

Reviewer 2 Report

The paper introduces a method for (Knowledge) Graph Attribute normalization, based on semantics and co-occurences of words.

The paper still suffers by the problem mentioned in the first round of the review: the method (while it has its own merit) is not particularly tailored or in any other way taking into account the geographic information part. In my view, there is a strong difference between integrating the geo-information in the solution and applying the solution to solve a problem that could appear in geodata, as well as in other data.

In that sense, my recommendation remains the same, to reject the paper on the grounds mainly on its suitability for the journal. The merit and contributions of the method can also be under discussion, since I consider them rather marginal overall but this is not to say that there is nothing new in the presented work. My suggestion would be that the authors look for another journal to submit their method or to clearly demonstrate the difference between applying the method on geographic KGs and "regular" KGs. Does it really work well only on geographic KGs because it takes into account the underlying geoinformation?

Based on this, I cannot recommend the acceptance of the paper.

Reviewer 3 Report

Thank you for considering my first comments and I noticed a real improvment in the paper. Unfortunately the form is still too low for publication, some sentences are lacking verbs or words, some expression (already highlighted in my review) still do not make sense, some figures are still illegible as I already pointed out. I recommend you take more time to prepare a paper. If it was the first submission I would go for a major revision but as it is a resubmission after a major revision I really suggest you take more time to work on the form. 

Some comments here below : 

Line 69 : these are not only crowd sourced, if you see Geonames for example it is an agregate of different sources but not much from the crowd

Height and altitude are not synonyms : altitude is related to the sea level and height to the ground level. 

You qualify the other approach as empirical but this is not the main distinction with your approach as you annotate positive and negative samples manually. Maybe you may highlight the scope of data used to run the Word2Vec algorithm (or similar) in other approaches. 

It would be too long to highlight all the syntactic issues. "How to express, organize, xxx" is not a quotation. When you introduce community and modularity you need to write a sentence to explain why you give these definitions. Also "evaluation index" is not introduced.

When I suggested to have more informative titles it was meant for the title level 2 and 3. In your paper these titles are missing information and there are long sections without title.  

Back to TopTop