Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Mutual Information Input Selector and Probabilistic Machine Learning Utilisation for Air Pollution Proxies

Appl. Sci. 2019, 9(20), 4475; https://doi.org/10.3390/app9204475

by Martha A. Zaidan^1,*

, Lubna Dada¹

, Mansour A. Alghamdi², Hisham Al-Jeelani², Heikki Lihavainen^3,4, Antti Hyvärinen³ and Tareq Hussein^1,5,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2019, 9(20), 4475; https://doi.org/10.3390/app9204475

Submission received: 4 September 2019 / Revised: 10 October 2019 / Accepted: 16 October 2019 / Published: 22 October 2019

(This article belongs to the Special Issue Air Quality Prediction Based on Machine Learning Algorithms)

Round 1

Reviewer 1 Report

General Comments:

This paper presents a method for developing a proxy air pollutant model, whereby the concentration of an unmeasured pollutant can be estimated using other measured variables at the same time. Two significant contributions are the use of the mutual entropy metric to identify informative input methods to the model, as well as the use of a Bayesian Neural Network scheme to both prevent overfitting and provide probabilistic rather than point estimates (i.e. to provide confidence intervals associated with model estimates). Overall the paper provides a good overview and demonstration of the approach. I believe a few additional improvements can and should be made.

First, while the use of mutual information to prioritize model inputs is a well-motivated approach whose utility is clearly demonstrated in the paper, I believe it could be even better used in the input selection process. In particular, it is indicated in the text (Section 4.2) that inputs are included in the order in which they share mutual information with the Ozone concentration. However, it should also be considered that, in order to minimize information redundancy, inputs should be chosen which have high mutual information with the ozone but low mutual information with other inputs which are already included in the model. This could also be left as a topic for future improvement in the approach, but I believe it is a worthwhile improvement which could be made in the method and should be discussed in the text.

Next, more quantitative analysis should be performed to determine if the uncertainty estimates provided by the BNN are accurate. For example, this can be done by computing the “Z-score” of errors (error divided by the standard deviation of the prediction uncertainty) and determining if these follow a standard normal distribution as would be expected. In my opinion, the provision of confidence intervals associated with model estimates is a major strength of the Bayesian approach being used, and capability has not been utilized and emphasized in the paper to the extent which I think it merits.

The usefulness of the model for forecasting is also another area which is not discussed extensively but which is a very promising possibility. For example, mutual information between variables at different times can be analyzed to determine the dynamic relationships between air pollution variables. This is mentioned briefly as a topic for future work in the conclusion, but could be elaborated in more detail.

In terms of practical applications, I am concerned that this approach will learn a site-specific relationship between the inputs and the modeled pollutant which may not generalize to other locations. For example, use of wind speed as an input may be very sensitive to particular site conditions. This is briefly alluded to in lines 325-328, but I believe a more complete discussion of the issue is warranted in the conclusion, or even a complete analysis of the model’s generalizability if there is sufficient time/data to do so. A possible practical solution I would propose is that site-specific proxy models be trained during a short-term full-scale deployment of multiple air quality sensors to a site, and then should be used with a small suite of sensors (possibly low-cost sensors) left running at the site for long periods. The proxy model should also be periodically updated, as changes in pollutant sources and climate may alter the basic atmospheric conditions from those under which the proxy was first trained. This problem of generalization is common to all data-driven approaches.

Finally, I would suggest some further English grammar editing before final publication. I have noted down some changes which should be made in my specific comments below, but a more extensive final check should also be made.

Specific Comments:

Lines 16-23: I would suggest adding a brief description of Ozone pollution here, as it is what you are estimating in this paper.

Lines 33-40: I would also consider mentioning low-cost air quality sensor networks, chemical transport simulation models, and satellite data as additional possible solutions to fill in missing ground-based measurements and expand the coverage of the traditional monitoring networks you have described. Each of these alternatives also have pros and cons associated with them, and your proposed method could be used to complement these as well.

Line 46: “well-attention” should be replaced with “much attention” or a similar phrase.

Lines 70-71: “using on an” should be “using an”.

Lines 73-74: It is unclear what “The finding through automatic input selector” refers to.

Line 91: “Linear” should be “linearly”.

Lines 118-137: It is not clear from this section how the BNN allows for probabilistic predictions, i.e., how the confidence intervals in predictions can be derived. This should be specifically described in this section.

Line 186: “to every measured variables” should be “on every measured variable”.

Figure 8: It is not clear how the performance is being evaluated here, i.e., is this the performance on the training data or on a separate testing data set.

Line 238: “very less” should be “much fewer”.

Line 239: “is” should be “are”.

Line 251: “for now onward” should be “from now on”.

Line 278-279: It is not that you are unable to plot the data, but you have chosen not to do so to improve the visual clarity.

Line 308-324: Low-cost sensor are discussed extensively here without being introduced earlier. Referring back to my previous comment, it may be a good idea to move all or part of this discussion to the introduction section. The use of machine learning methods to calibrate low-cost sensors is also an extensive research topic to which this work can be related.

Line 334: “the proxy based a” should be “the proxy based on a”

Line 349: Here is it stated that four input variables including TNX are used, while earlier (line 273) it was stated that results would be presented for models using only three of these inputs (excluding TNX).

Line 353: While in principle the method is applicable to any pollutant, in practice, could you comment on any potential difficulties when applying the method to pollutants which do not have the same strong diurnal patterns as Ozone?

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

The manuscript proposes a protocol to build a proxy of Ozone concentration by combining a mutual information approach to select the best predictors, and a machine-learning algorithm to make predictions. This protocol is applied to the case study of Jeddah, Saudi Arabia. Overall the topic is interesting and fits the scope of Applied Sciences. The purpose of the study is clearly stated. The methodology is robust, though some implementation is required. In my opinion, a more detailed specification of the Bayesian Network other than the reference to the specific literature would contribute to better comprehension and attraction to this interesting topic. Consequently, several issues are highlighted that need to be addressed.

Introduction: the authors should declare here the software used to perform the BNN analysis and give credit to it in the bibliography. Moreover, the paper outline should be added at the end of the section.

Par 2.2: the authors should describe the Bayesian configuration of the model and define the link between matrix W and the posterior distribution. Moreover, they should illustrate how model uncertainty is obtained.

Case study sections: I suggest the authors add a table with the summary statistics of variables involved in the analysis as well as some details on data preparation. For example, is the dataset centered and normalized?

Case study sections: the authors should give more details about the model set-up such as the dropout probability to initialize the model, the number of iterations, the burn-in, etc.

Case study sections: the time dynamic of O3 proxy is well reproduced. There is an unavoidable underestimation of the highest concentration values that are well covered by the results of the uncertainty modeling. Then, I strongly suggest to include predictive performance analysis of these highest values only.

Conclusion: since the more dangerous phenomena are linked to the highest values of O3, an evaluation of the potentiality of the proposed proxy in capturing these types of events could be treated in the discussion.

Specific comments.

Introduction: lines 41-43, a brief description of the “physics-based approaches” would be helpful to have a complete framework; Equation 6, define L in the text; Par 4.2: consider the possibility to simplify the text by recalling the MAE, RMSE, and R2 rather than make their formula explicit; Figure 10: what do light blue lines represent? Line 294-307: this part is not clear to me. I do not understand how to extract the link between O3 proxy and NOx from Figure 11 as stated in the text. Figure 11: how the authors explain the difference in the decay of concentration values between observed and proxy in the early afternoon of the day. Moreover, I suggest reducing the dimension of the circles in the plot to facilitate the analysis of the results.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Mutual Information Input Selector and Probabilistic Machine Learning Utilisation for Air Pollution Proxies

Further Information

Guidelines

MDPI Initiatives

Follow MDPI