Next Article in Journal
An Equivalent Non-Uniform Beam-Like Model for Dynamic Analysis of Multi-Storey Irregular Buildings
Previous Article in Journal
Hybrid-Recursive Feature Elimination for Efficient Feature Selection
 
 
Article
Peer-Review Record

Towards Piston Fine Tuning of Segmented Mirrors through Reinforcement Learning

Appl. Sci. 2020, 10(9), 3207; https://doi.org/10.3390/app10093207
by Dailos Guerra-Ramos 1,*, Juan Trujillo-Sevilla 2 and Jose Manuel Rodríguez-Ramos 2,3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Appl. Sci. 2020, 10(9), 3207; https://doi.org/10.3390/app10093207
Submission received: 7 April 2020 / Revised: 30 April 2020 / Accepted: 30 April 2020 / Published: 4 May 2020
(This article belongs to the Section Optics and Lasers)

Round 1

Reviewer 1 Report

I guess this is clearly written paper that can be published in the present form.

Author Response

Author response to reviewer - applsci-782080

 

Manuscript “Towards piston fine tuning of segmented mirrors through reinforcement learning”

We would like to thank the reviewer for the comments that helped us to improve the paper.

English language and style have been checked.

Reviewer 2 Report

I would say that I miss some information related to large scale metrology approaches that allow characterizing the position and orientation of each mirror segment. There are different techniques such as photogrammetry or laser tracker technology that are used for the active alignment system of large telescopes. I suggest to the authors to have a look into:

- Mutilba, U.; Kortaberria, G.; Egaña, F.; Yagüe-Fabra, J.A. Relative pointing error verification of the Telescope Mount Assembly subsystem for the Large Synoptic Survey Telescope. In Proceedings of the IEEE International Workshop on Metrology for AeroSpace, Rome, Italy, 20–22 June 2018.

- Rakich, A. Using a laser tracker for active alignment on the Large Binocular Telescope. Proc. SPIE 2012, 8444, 844454.

- Rakich, A.; Dettmann, L.; Leveque, S.; Guisard, S. A 3D metrology system for the GMT. Proc. SPIE 2016, 9906, 990614. [CrossRef]

16

- Gressler, W.J.; Sandwith, S. Active Alignment System for the LSST. In Proceedings of the Columbia Music Scholarship Conference (CMSC), Orlando, FL, USA, 3–4 February 2006.

 

English shall also be reviewed.

Author Response

Author response to reviewer - applsci-782080

 

Manuscript “Towards piston fine tuning of segmented mirrors through reinforcement learning”

We would like to thank the reviewer for the comments that helped us to improve the paper.

 

We added a paragraph about large scale metrology approaches at the first part of the manuscript together with references. These approaches can be used jointly with the one described in this paper.

The English language has been checked as well.

Reviewer 3 Report

Please find attached my comments to the Authors.

Comments for author File: Comments.pdf

Author Response

Author response to reviewer - applsci-782080

Manuscript “Towards piston fine tuning of segmented mirrors through reinforcement learning”

We would like to thank the reviewer for the comments that helped us to improve the paper.

Please find also attached a redlined version of the changes in the manuscript.

  1. The Authors mention the issue of atmospheric instabilities, which represents a problem for other approaches based e.g. on supervised machine learning. How do atmospheric conditions affect the proposed approach with RL? Can this be quantified numerically? Also, it is my understanding (although I did not find it in the manuscript) that the proposed approach can be more resilient since it is trained in real time – i.e. with the actual atmospheric conditions – and not with simulated data: is this one of the reasons? Can the Authors please expand and clarify this whole aspect in the first part of the manuscript?

 

A new paragraph about this issue was added to the first part of the manuscript. A Fried parameter of 0.2m was chosen, a standard value in atmospheric simulations. Worst atmospheric conditions would presumably degrade the learning process. It is going to be analysed in future work how can the seeing interfere numerically with the training of th RL agent. It has being added to future work section now.



  1. Lines 44 and 131: Can the Authors expand and clarify what is the output of the CNN? Is it a single action (sampled from a probability distribution P) or is it a probability distribution P? In the latter case, if the CNN outputs P, is the action sampled from P directly or is there any other process in between?

 

The output of the policy CNN are two single scalar values. These two values represent the mean of a bivariate gaussian distribution. The variance is a small value fixed in the algorithm. An action is sampled from that distribution and the agent performed it on the environment. Each action contains two movement orders to segments A and B. It then gets a reward for that action and adjusts the stochastic policy distribution accordingly.



  1. Lines 137-146: I find this paragraph rather unclear, and I strongly encourage the Authors to improve it. If the Authors want to go technical here (contrary to the general flavor of the manuscript), they should be more precise and clear in the definitions and descriptions they provide. If, instead, the Authors want to provide only some introductory information, the paragraph should be substantially simplified. I find the current version something in between, which is obscure for the non-expert and somehow rushed for experts. The Authors could either make it much shorter or much longer, or move it to a separate section in the Appendix, as they prefer. I think both would work well, just not in between.

 

The paragraph has been replaced for a shorter clearer one. A reference has been added as well.




  1. Figure 4:
  • Can the Authors describe in the captions what are the two lines (mentioned in lines 160-162).

Done.

 

  • How long did it take for the agent to learn in real time (not in steps)?

An exposition time of 2 secs sets a lower bound duration time of 11 hours for the agent to reach a rms = 50nm in capture range ±lambda/2 and 1 hour in ±lambda/4 case. The training though can be carried out in parallel in several intersections. There are 10 of them to be trained on in a 36 segmented mirror telescope. It makes the previous training times decrease to 1,1 hours and 0,1 hours respectively.

 

  • Are these curves an average over many agents (how many?) or is it just one agent?

The curves represent the training of a single agent.

 

  • If it is an average, what is the standard deviation around the curve?

It is not an average over many agents but the graphs have been smoothed out with a moving average of the last five steps. This information has been added to the manuscript.

 

  1. Can the Authors describe what is the “capture range”, and why it has that specific interval?

 

The capture range defines the interval that contains the piston step values that the algorithm is being trained to detect and act upon. Capture ranges considered here are suitable for fine tuning the piston positions after a previous coarse piston alignment stage has been carried out. It has been added to the text.






Additional relevant comments and suggestions

 

  1. Line 41: Can the Authors expand this sentence “It is then suitable for scenarios where ground truth labeled data are scarce or difficult to obtain such as in the optical phasing problem…”, giving an intuition on why this is true? This will strengthen the work. 

Done.

 

  1. Line 43: Please specify “convolutional”. 

Done.

 

  1. Lines 46-48: Isn’t this true also for supervised approaches? Can the Authors clarify the difference?

The differences between supervised and reinforcement approaches are in the training process and the kind of data needed to train. It has been explained more clearly in the text.

 

  1. Figure 1 seems to be never referenced in the main text. 

Done.

 

  1. Lines 56-59: Please add here a reference to Figure 2. 

Done.

 

  1. Line 64: Can the Authors clarify why “Four different wavelengths are considered”? The reason is already briefly mentioned in the text (e.g. “to disambiguate”), but it would be useful to expand it a few lines. For instance, what would happen with only one?

Done.

 

  1. Line 81: I believe that this is not precise. Please change it with “…that acts to maximize the long-term expected reward from the environment”, if you agree. 

Done.

 

  1. This is only a suggestion, that I believe could make the manuscript clearer and more accessible: since the literature of RL uses “a” and “A” for actions, isn’t it better to use them instead of “u” and “U”? Is there a reason for the choice of “u” and “U”? 

“u” are commonly used for “actions” in policy gradient based methods in the literature whereas “a” is sued in value based methods. Ours is policy based (although a value function is used as a critic) so we decided to use the former notation. Uppercase “U” denotes the “utility” of the policy, i.e. the long-term expected reward from initial states.

 

  1. Current Figure 3 should come before Figure 2. 

Done.

 

  1. Please expand caption of Figure 3. 

Done.

 

  1. Please expand caption of Figure 2. How can we see the “piston error” in the image? Please mention again why A and B are labeled, but there is no C (it is already explained in the text). 

Done.

 

  1. Lines 97-98: This seems to me ambiguous and should be clarified. Is the transition probability P the “probability of sampling one action given one state” (u_t given s_t) or is it the probability that takes into account also the unknown stochastic behaviour of the environment? The two are not the same, as the Authors know, but I cannot understand it from the text. 

It is the probability of ending at a final state conditioned on both, the initial state and the action taken, p(st+1 | st , ut). It has been rewritten now to be more precise. 

 

  1. Line 100: (related to my previous comment) Is the reward issued deterministically or stochastically? If I perform the same action in the same state, do I get the same reward? In Line 104 it is written that “The reward is considered to be in general stochastic”, but still I am not sure I understand.

The reward signal is deterministic in the simulations. However, in the more general RL setup, that signal can be considered stochastic. It means that performing the same action in the same state can produce slightly different reward values, as long as the long-term expected reward still defines the task the agent has to learn. These minor variations could be seen as noise sources added to the reward signal.

 

  1. Line 118: Can the Authors briefly comment the parameters of the CNN (e.g. weights and/or bias)? 

Done.

 

  1. Line 120: Please add a reference on, or very briefly expand, how “bivariate normal probability distributions” come into play. 

Done.

 

  1. Line 129: Why si that true, that it is gaussian? 

A gaussian model is used to describe a stochastic policy over the continuous action space. The mean of the gaussian is where the agent thinks that lies the action that most probably gives the highest long-term expected reward from the current state. The variance of the gaussian quantifies the uncertainty about that prediction. The variance also helps the agent to explore better actions and improve the policy. This explanation was added to the text.

 

  1. Algorithm 1, line 4: I think it is “sample”, not “compute”. In the same line, the left arrow seems to suggest that “u” is replaced by the policy “pi”, which of course is not true. Maybe it would be useful to clarify that “u” is sampled through the use of the policy. 

Done. The left arrow has been replaced by sampling operator ~.

 

  1. Algorithm 1, line 7: Can the Authors clarify what happens here, and how it is done? 

It has been changed the line caption and expanded the algorithm step-by-step description in the text.

 

  1. Lines 163-164: The RL agent does not predict the MSE, rather it learns to minimize it. If the Authors agree, this is a key conceptual point to clarify. 

Done.

 

  1. Figure 5: Please modify the caption according to the above comment, if applicable. What is the continues horizontal line? 

Done.

 

  1. A very simple suggestion: in the abbreviations, RL and CNN could be added for convenience.

Done.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

I believe the Authors have performed an effective revision of the manuscript, addressing all my concerns. The overall presentation is now much stronger and clearer. I am thus happy to recommend it for publication.

Back to TopTop