Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessFeature PaperArticle

Peer-Review Record

Automated Hyperparameter Tuning in Reinforcement Learning for Quadrupedal Robot Locomotion

Electronics 2024, 13(1), 116; https://doi.org/10.3390/electronics13010116

by MyeongSeop Kim¹

, Jung-Su Kim²

and Jae-Han Park^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2024, 13(1), 116; https://doi.org/10.3390/electronics13010116

Submission received: 28 November 2023 / Revised: 18 December 2023 / Accepted: 21 December 2023 / Published: 27 December 2023

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper is interesting, although the described method is presented at some points rather briefly and in too little detail.

The ideal situation, of course, is one in which the article is self-explanatory. Here, the authors touch on a large number of topics: from robotics and control to machine learning. It is clear that not everything is (and not everything should be) explained from scratch, but some information is missing. E.g., there is no information on neural networks modeling value and policy (what was the type, structure of these networks, etc.).

There should be a little more information about the robots used in the research. Was the research conducted only by simulation? Did you manage to perform tests on real robots?

Minor remarks:

Were other reward functions than the one described by formula (1) tried?

A discount factor of 0.98 (line 98) was used. Why was such a value used?

Are the numbers in parentheses in Figure 1, the numbers of state variables?

Comments on the Quality of English Language

There are some minor errors in the paper, such as rewardwe (line 119), controllerdc (caption under Figure 1) and others. The language needs slight improvement.

Author Response

First of all, thank you for reviewing the paper in detail and thank you for your kind words.
We were able to revise the paper based on the points you mentioned and review many of the deficiencies.

Major remarks:

*** there is no information on neural networks modeling value and policy (what was the type, structure of these networks, etc.).
There should be a little more information about the robots used in the research. Was the research conducted only by simulation? Did you manage to perform tests on real robots?

First, we recognize that not mentioning the size and shape of the neural network was a clear mistake on our part.
The relevant part has been strengthened with Figure 3 and a little explanation, and the modified part is marked in red.

Also, we completely agree that the description of the robot is lacking.
More information about the robots has been added to table 1.
We also added an explanation of the changed values depending on the difference in robot size.

The algorithm in the paper was tested only in a simulation environment.
Since its practicality has been proven in papers such as [1] or [2] borrowed from this paper, this paper focuses on reward shaping according to changes in the robot model.
However, guaranteeing performance in the real world is important, so we added some texts to future works.
The large robot used in the paper has currently been developed as a prototype.

Minor remarks:

*** Were other reward functions than the one described by formula (1) tried?

We tried several different reward functions along with the auto reward shaping algorithm, but they had a lower impact on gait performance than the reward functions mentioned.
Additionally, the search space has become too wide and too complicated to express the relationship between reward weight and gait, so it was not added. However, it is not difficult to further explore other reward weights.

*** A discount factor of 0.98 (line 98) was used. Why was such a value used?

As you know, in most cases, it is common to use a value like 0.99.
In this paper, we used a value of 0.98, which is the same value used in [1].
The discount factor clearly has an effect on the agent's training, and when the value is lowered, it focuses more on the current reward rather than the future.
However, as a result of testing, the performance difference between 0.99 and 0.98 was not significant, so the values from the paper [1] were used as is.

*** Are the numbers in parentheses in Figure 1, the numbers of state variables?

You're right. The values in the figure indicate the size of the state vector.
We added the information in Figure 3 to help the reader understand in relation to the neural network size.

Once again, I really appreciate your detailed review and thank you for mentioning in detail so that we can improve any areas that are lacking.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents a novel automated reward shaping approach to reinforcement learning (RL) of locomotion control of quadrupedal robots. The approach defines a gait score function evaluated automatically for every combination of reward function weights in the learning loop of the PPO algorithm. For this, a number of models (policy and value function) equal to the number of all considered combinations of reward weights are trained simultaneously and the model with the highest gait score is selected for further learning or evaluation. The experimental results show that the approach is able to find an optimal reward weight configuration for two simulated quadrupedal robots on rough and stair terrains.

The work is interesting, novel, and technically sound. The paper is fairly written with a few presentation issues (see minor points below). However, a number of questions remain unanswered and should be addressed. I see the following major points:

1. The paper lacks relevant literature on reward shaping beyond auto-RL. In particular, reward shaping with intrinsic motivation based on force measurements [1] and behavior-matching [2] as well as variance aware reward smoothing [3] and Lyapunov function based reward shaping [4] have shown to improve RL efficiency for robotic control tasks. A discussion on these approaches would address the gap in the literature and provide richer context to the paper.

2. How is gate score evaluation performed? Eq. 4 shows that the score is computed over a trajectory of length N. However, there is no information on how this trajectory is collected. Is it the same trajectory used to train the policy, or is it a trajectory collected after the policy parameters are updated? The "N" is used in the paper sometimes as the number of training epochs and sometimes as the length of the evaluation trajectory. Avoid overloading notations.

Minor points:

- Sec. 3: Add a distance between the caption of Figure 1 and the main text.

- Sec. 3: "design the rewardwe functions" -> "design the reward functions"

- Sec. 3: The text mentions "PPO is an off-policy". This is wrong, PPO is on-policy. The actions are always sampled from the stochastic policy.

[1] Improved Learning of Robot Manipulation Tasks via Tactile Intrinsic Motivation
[2] Behavior Self-Organization Supports Task Inference for Continual Robot Learning
[3] Variance aware reward smoothing for deep reinforcement learning
[4] Principled reward shaping for reinforcement learning via lyapunov stability theory

Author Response

Thank you for reviewing the paper in detail and thank you for your kind words. We were able to revise the paper based on the points you mentioned and review many of the deficiencies.

Major remarks:

- Answer : First of all, we would like to thank you for looking at our paper in detail from the perspective of this field and interpreting it favorably.
We completely agree with the lack of prior research on reward shaping and auto-RL, and thank you very much for this detailed suggestion.
Based on the papers you mentioned, we have strengthened the related work section. The added parts are marked in red.

- Answer : Thank you so much for pointing out something we missed.
First of all, we fully acknowledge that the use of the same notation caused confusion. We also recognize that it is very unclear what N specifies when calculating the gait score. The length of the entire epoch was defined as L, and N, the average size of the gait socre, was specified separately.
The algorithm part and its explanation have been modified and marked in red.

Minor points:

*** - Sec. 3: Add a distance between the caption of Figure 1 and the main text.
- Answer : Thank you for your detailed pointing. The distance between the figure and the paragraph was adjusted.

*** - Sec. 3: "design the rewardwe functions" -> "design the reward functions"
- Answer : Thank you. We corrected the parts you mentioned and re-examined the entire paper to see if there were any typos.

*** - Sec. 3: The text mentions "PPO is an off-policy". This is wrong, PPO is on-policy. The actions are always sampled from the stochastic policy.
- Answer : What you pointed out is absolutely correct. PPO is on-policy, and our algorithm also operates in the on-policy structure. And feel kind of ashamed for making this mistake.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript is much better than its earlier version and is conceptually convincing. My concerns were addressed.

Article Menu

Automated Hyperparameter Tuning in Reinforcement Learning for Quadrupedal Robot Locomotion

Further Information

Guidelines

MDPI Initiatives

Follow MDPI