Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Cluster-Based Machine Learning Ensemble Approach for Geospatial Data: Estimation of Health Insurance Status in Missouri

ISPRS Int. J. Geo-Inf. 2019, 8(1), 13; https://doi.org/10.3390/ijgi8010013

by Erik Mueller^1,*, J. S. Onésimo Sandoval², Srikanth Mudigonda³

and Michael Elliott⁴

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Reviewer 5: Anonymous

ISPRS Int. J. Geo-Inf. 2019, 8(1), 13; https://doi.org/10.3390/ijgi8010013

Submission received: 12 October 2018 / Revised: 19 December 2018 / Accepted: 20 December 2018 / Published: 28 December 2018

Round 1

Reviewer 1 Report

This work presents a classification ensemble approach to predict/estimate health insurance status. While the presented techniques seem sound, this paper lacks structure and is hard to follow. Consequently, the intellectual merit of this manuscript is difficult to judge. Furthermore, the experimental evaluation is very unclear, leaving many open questions that need to be clarified.

Major Weaknesses that need to be addressed:

- The paper needs more structure. A clear problem definition is missing. The paper elaborates how it solves a problem, but it is widely unclear what that problem is.

- The experiments require a more detailed description. In the current manuscript, details such as the evaluation metrics are undefined, leaving this section inconclusive. See detailed comments for details.

Detailed comments:

The introduction requires a clear problem definition. The introduction explains, in details, how “it” is being done, without elaborating what “it” is. The abstract does not help to understand the problem either, as health and health insurance is not mentioned in the introduction.

The concentration Equation is not understandable. What are the x_i? What is the range of the sum? What is the range of the max function?

The authors need to formally define the problem that they are solving. This requires to establish a formal notation of the aspects of the problem that are solved, including a formal definition of dependent and independent variables.Same applies to the Dispersion equation.

In the experiments, the user use Mean Squared Error (MSE) to evaluate their algorithms. However, it is not clear what MSE is/does in this scenario. MSE should be properly defined.
Also, as a classification problem, it appears more appropriate to use precision, recall F-Measure as evaluation metrics. The authors need to revise this manuscript to clarify what they are learning/modeling/predicting, in order to give the reader an idea of the quality of a MSE of “0.499”. Without any formal definition of the problem, of the algorithms and the experimental evaluation, it is impossible to assess the merit of this work. If I had to guess what the authors are doing in the evaluation, then I’d say that they are looking at a binary classification problem, (Insurance YES/NO).I’m guessing that the authors dummy code this dependent variable as 1/0.

If this is the case, then a MSE of 0.499 is terrible, and only marginally better than random guessing (which would yield an error of 1 half of the time, which is a squared error of 1^2=1, yielding an expected error of 0.5 for random guessing).

The experiments in Table 4 seems to imply that the proposed ensemble approach performs (much) worse than simply using SVMs instead of the ensemble.This appears to be intuitive, since all the proposed classifiers appear to yield nearly no information, as they have a classification accuracy only marginally larger than random guessing. To me, it appears that the experiments conclude that the proposed approach is not working.

The purpose of Table 7 is not clear. It appears to evaluate different independent variables, rather than different learning algorithms

In Table 8, it is not clear what the two Cluster Ensembles are. Why do they use different independent variables? Shouldn’t all ensembles use all variables?

In Table 10, the authors interpret the highest model coefficients as the most important independent variables. This may not be true. The purpose of Table 10 is unclear.

Minor comments:

- In the Introduction: Assumption 3 is not clear, what is the meaning of “Outside the development environment”?

- “ACS” should be defined as an abbreviation.

- Line 152: Rephrase: “Moving to THE selection OF baes learners”

- Line 163: “by each learner category”

- Line 206: The comment “which really isn’t aggregation at all” should be removed. The min-function is an aggregation function, too. Thus, taking the minimum is absolutely an aggregation.

At the beginning of Section 3, the authors refer to “the above table”. But there is no table above. The authors should employ proper referencing and labeling.

This is very trivial to do in LaTeX.

Author Response

Please see attached letter.

Author Response File: Author Response.docx

Reviewer 2 Report

First of all, the entire paper's presentation is not well organized. Too many excessive paragraphs but lack of the core content or focus. More importantly,

what are the issues this research tried to address and what are the contributions of this paper were not clearly illustrated and presented.

For readability, the authors are strongly recommended to incorporate more toy examples or flow charts to illustrate their approach/architecture/framework

and use figures/charts to clearly present the results as well as the findings. Since this research is related to geographical study,

using geographical plotting for the presentation on the results is essential.

The novelty of this research was barely acknowledged. All the approaches, ML models, statistical models used in this paper are the general methods.
Besides, comparing the results of the regression models with clustering models or even decision tree models does not make sense, especially, the

mixed usage of aggregated variables, such as counts and spatial variables, such as distances or location might cause confusions.

Aside from the above concerns, the design of the experiment also needs to be improved.

First, the evaluation matrix somehow confusion, sometimes MSE was used and sometimes the residual was also used to evaluate the models.

These two evaluation measurements should not be mixed since one is to measure numerical error and the other is to measure geographical/distance.s

As a result, the conclusion the authors drawn need to have more explanations on how the propose methods is better than the others.

Thus, a rejection is suggested based on above concerns.

Author Response

Please see attached letter.

Author Response File: Author Response.docx

Reviewer 3 Report

The paper is technically correct and with content relevant to the journal readers.

Generally, this paper is very well-written and clearly states the problem. But, related work covers the only part of existing related works. A significant portion of presented research relating only to the work of Trivedi, et. al.

Also, there are a lot of similar references with no significant importance for this paper (for example, references 10-12 and 13-16).

The second part of the paper presents authors research results. Obviously, a lot of effort has been invested in the research and realization of this study.

But, in first part of the paper authors noted that this paper “introduces an alternative approach to ensemble modeling…” – in the second part of the paper are not shown such details. Also, authors state “The resulting workflow not only outperforms independent base learners and traditional ensemble models, but also preserves inferential capability by manipulating the cluster parameters …”. There is not details about such workflow? In second part of this paper there is not enough information to confirm these facts.

Author Response

Please see attached letter.

Author Response File: Author Response.docx

Reviewer 4 Report

This paper presents an ensemble based machine learning approach for estimating of health insurance status. This study evaluated three different ensemble approaches with various base learners such as SVM. In general, the manuscript is well-organized and this study contains some merits. However, there are several remarks as shown below should be clearly addressed.

1. Line 101, “uses more recent machine learning regression techniques such as support vector machines, random forest modeling, lasso and ridge regression, and dimension reduction methods.”, note that these methods are not recent machine learning algorithms, they’ve been developed many years ago.

2. A map showing the study area would be helpful for readers.

3. The proposed framework is based on an ensemble machine learning algorithms, a diagram showing the proposed concept can be more illustrative.

4. The authors used 7 base learns, how were they ensembled? The mathematical aggregation for the globally weighted average approach and the minimum residual approach should be described with math theory in details.

5. How were the parameters in seven base learners set? Such as the number of components in PCA and the number of trees used in the Decision Tree algorithm? What is the strategy to obtain the optimal parameters?

Minor comments:

1. Line 489, the “GAM” should be abbreviated in here? The same for “PCR” in line 308.

2. Line 504, “wass” should be corrected as “was”

3. Line 416, the word “aspatial” is incorrect.

Author Response

Please see attached letter.

Author Response File: Author Response.docx

Reviewer 5 Report

This is a review for the manuscript ijgi-379491 title “A cluster-based machine learning ensemble approach for geospatial data: Estimation of health insurance status in Missouri“ submitted to the ”ISPRS International Journal of Geoinformation”. The manuscript describes a method to improve variable predictions using an ensemble of machine learning models. In the proposed approach the authors use unsupervised clustering algorithm to subdivide a dataset into several subsets and then train multiple learner to minimize overall error.

Proposed approach is novel and interesting for the purpose of improving performance and computational efficiency of machine learning. However, the manuscript in the current form cannot be recommended for publication due to multiple methodological and presentation concerns outline below:

The authors do not provide enough information to understand the input data. Only variable categories are listed (Table 1) with minimal details about actual variable used. It is very hard to judge the validity of the analysis without this information. Variable names with concise description and exact references to the data sources are needed with the descriptive statistics for the most important variables.

Same problem with the geographic background of the study: provide a map of the regions showing the geographic units that you are working with.

Line 131: describe the algorithm or formula behind “manually calculated” variables.

Table 2: Each learner has to be provided with a reference to its detailed description in literature and implementation that was used in the study. Also it would be beneficial to provide a short description for some of the learners as they may not be familiar to most of the readers (if space permits).

Show the clusters resulting from the unsupervised classification on the map.

K-Fold cross-validation needs more detailed information of how it was performed and why it would prevent overfitting. In addition, the general overfitting discussion for ensembles and this regional case is missing but necessary to understand the limitations of the approach.

For the metrics and evaluation: explain why you did not use 80%/20% approach for splitting the data into training/testing.

Concentration and dispersion (lines 268 and 269) need much more explanation with the details on what they do and why they are expected to work for what you are using them for. Provide references to the literature.

If possible make the dataset and the code used in the study available for the reviewers to reproduce the results and test some other hypotheses.

The style of the manuscript needs lots of editorial work: the language of the paper is hard to comprehend because of the overly long sentences. The paper needs significant style editing. Any sentence that is close to or longer than 20 words needs attention.

There is a number of minor corrections that will be needed in the final version: not all abbreviation being decoded, a number of improper capitalization in citation (search for et. al.).

Author Response

Please see attached letter.

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

In my opinion, my suggestions and comments are well-addressed.

Author Response

Please see attached .pdf for responses for all reviewers.

Author Response File: Author Response.pdf

Reviewer 4 Report

My comments are well-addressed. It can be acceptable.

Author Response

Please see attached .pdf for responses for all reviewers.

Author Response File: Author Response.pdf

Reviewer 5 Report

This is the second review for the manuscript ijgi-379491 title “A cluster-based machine learning ensemble approach for geospatial data: Estimation of health insurance status in Missouri“ (revised) submitted to the ”ISPRS International Journal of Geoinformation”. The manuscript was significantly improved and authors have addressed all issues brought up in the review. Some new material has been added to the paper but there is a need for several more fixes in the final version of the paper:

Table 1: most of the values in the column “Min” are zeroes. My understanding is that these zeroes mostly come from block groups with no population. How many block groups like that do you have? Did you include them in the analysis? How does the presence of such block groups affect the results?

Figure 2: addition of the study area map is very helpful but the map itself need some improvements:

The legend is too small to be readable when printed on paper

Place 1-3 major cities on the map for readers to easily grasp the location

Most readers of this journal are from outside of the U.S. To help them locate your study area add a very small inset showing location of Missouri in the U.S. like this: https://en.wikipedia.org/wiki/Missouri#/media/File:Missouri_in_United_States.svg (If space permits).

Table 7: In cluster 3 the mean hospital distance is negative (-0.592). Is it a typo or is it a more serious problem?

Minor corrections:

Line 234: missing citation reference

Line 430: you do not need “as” or rephrase

Author Response

Please see attached .pdf for responses for all reviewers.

Author Response File: Author Response.pdf

Article Menu

A Cluster-Based Machine Learning Ensemble Approach for Geospatial Data: Estimation of Health Insurance Status in Missouri

Further Information

Guidelines

MDPI Initiatives

Follow MDPI