Next Article in Journal
Extended Depth-of-Field Imaging Using Multi-Scale Convolutional Neural Network Wavefront Coding
Next Article in Special Issue
MSGL+: Fast and Reliable Model Selection-Inspired Graph Metric Learning
Previous Article in Journal
CPL-Net: A Malware Detection Network Based on Parallel CNN and LSTM Feature Fusion
Previous Article in Special Issue
Embedding-Based Deep Neural Network and Convolutional Neural Network Graph Classifiers
 
 
Article
Peer-Review Record

Efficient Hyperbolic Perceptron for Image Classification

Electronics 2023, 12(19), 4027; https://doi.org/10.3390/electronics12194027
by Ahmad Omar Ahsan 1, Susanna Tang 2 and Wei Peng 3,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2023, 12(19), 4027; https://doi.org/10.3390/electronics12194027
Submission received: 30 July 2023 / Revised: 20 September 2023 / Accepted: 21 September 2023 / Published: 25 September 2023
(This article belongs to the Collection Graph Machine Learning)

Round 1

Reviewer 1 Report

This is an interesting study, I think it's necessary to supplement the experiment, the author only use three methods.

Minor editing of English language required

Author Response

1.This is an interesting study, I think it's necessary to supplement the experiment, the author only use three methods.

We extend our sincere appreciation to the reviewer for recognizing the significance of our research endeavor. In accordance with their valuable suggestions, we have incorporated additional comparative methodologies into our study. This now includes a meticulous analysis of prominent state-of-the-art models, such as the cutting-edge Vision Transformer model and the highly efficient Networks, EfficientViT, alongside our previously established methods. Under a fair comparison setting, we can notice the superiority of our method on all of the datasets here. This augmentation not only enriches the depth of our analysis but also enables us to offer a more holistic and robust assessment of our research findings.

 

2.Minor editing of English language required

We greatly appreciate your observation and feedback. In response to your comments, we have conducted a thorough review of the entire paper, specifically targeting language-related issues to ensure that the paper is now presented in a manner that enhances its overall clarity and readability, making it more accessible and comprehensible to our readers.

Reviewer 2 Report

1. The abstract is lengthy; the authors should succinctly convey the main research context and provide a concise overview of the proposed method.

2. The Contribution needs revision to effectively highlight the primary innovations of the proposed method.

3. The HR-Block network architecture in Figure 4 is too small; the authors should enhance readability by adjusting its size. Similarly, the problem arises in Figure 7.

4. The manuscript lacks recent comparative methods from the past two years; the authors should contrast their approach with state-of-the-art methods to demonstrate its superiority.

 

5. Ablation experiments are missing to demonstrate the superiority of the individual components (in Figure 4) in the proposed method.

Minor editing of English language required

Author Response

  1. The abstract is lengthy; the authors should succinctly convey the main research context and provide a concise overview of the proposed method.

We appreciate your feedback regarding the length of the abstract. In response to your suggestion, we have revised the abstract to ensure a more concise presentation of the main research context and a succinct overview of our proposed methodology. This modification aims to enhance the clarity and accessibility of our paper, making it easier for readers to grasp the core contributions and objectives of our research. The concise version is like this:

Deep neural networks, often equipped with powerful auto-optimization tools, find widespread use in diverse domains like NLP and computer vision. However, traditional neural architectures come with specific inductive biases, designed to reduce parameter search space, cut computational costs, or introduce domain expertise into network design. In contrast, Multilayer Perceptrons (MLPs) offer greater freedom and lower inductive bias than convolutional neural networks (CNNs), making them versatile for learning complex patterns. Despite their flexibility, most neural architectures operate in a flat Euclidean space, which may not be optimal for various data types, particularly those with hierarchical correlations. In this paper, we move one step further to introduce the Hyperbolic Res-MLP (HR-MLP), an architecture extending the attention-free MLP to non-Euclidean space. HR-MLP leverages fully hyperbolic layers for feature embeddings and end-to-end image classification. Our novel Lorentz cross-patch and cross-channel layers enable direct hyperbolic operations with fewer parameters, facilitating faster training and superior performance compared to Euclidean counterparts. Experimental results on CIFAR10, CIFAR100, and MiniImageNet confirm HR-MLP's competitive and improved performance.

 

  1. The Contribution needs revision to effectively highlight the primary innovations of the proposed method.

We thank you for your feedback regarding the contribution section. We recognize the importance of effectively highlighting the primary innovations of our proposed method. In light of your suggestion, we have diligently revised the contribution section to ensure it provides a clear and comprehensive overview of the key innovations that our research brings to the field. The modified version is like this:

  • A fully hyperbolic deep neural architecture for image tasks, which is called hyperbolic ResMLP (HR-MLP), is presented to explore the potential of the hyperbolic Perceptron to high dimensional data.
  • The proposed HR-MLP has a Lorentz cross-patch and cross-channel layer, which is a manifold-preserving neural operator.
  • Results on CIFAR10, CIFAR100, and MiniImageNet demonstrate comparable and superior performance with their Euclidean counterpart while having much better interoperability.

 

  1. The HR-Block network architecture in Figure 4 is too small; the authors should enhance readability by adjusting its size.

Thanks very much for this suggestion. We are totally agree with the reviewer. Therefore, the details of the HR-Block in the Figure 6, where we provide architecture details about the Lorentz cross-channel layer and the Lorentz cross-patch layer, as well the transformation of the tensors in between. Once again, we extend our gratitude for this valuable suggestion, which has further refined the quality of our research.

  1. The manuscript lacks recent comparative methods from the past two years; the authors should contrast their approach with state-of-the-art methods to demonstrate its superiority.

We greatly appreciate your suggestion, we have expanded our comparative analysis within our study. Our enhancements include a thorough evaluation of prominent state-of-the-art models, specifically the cutting-edge Vision Transformer model and the highly efficient EfficientViT, in addition to our previously established methods, such as ResMLP—an attention-free method gaining recognition in the field.

Like in table 1, Under a fair comparison setting, we make models with similar parameter.  We have observed the superior performance of our method across all the datasets examined. Furthermore, it is noteworthy that our method outperforms Vision Transformer models, such as ViT and EfficientViT, while utilizing only half the parameters (Even the FLOPS higher, this is caused by the non-linear operations). This also enriches the depth of our analysis, allowing us to provide a more comprehensive and robust assessment of our research findings. We believe that these additions strengthen the validity and relevance of our work, further underscoring its significance in the domain.

 

  1. Ablation experiments are missing to demonstrate the superiority of the individual components (in Figure 4) in the proposed method.

Thanks for this suggestion. One thing we want to highlight is that our method builds upon the foundation of ResMLP, which serves as our automatic baseline, forming the basis for our ablation studies. As indicated in Table 1, our approach demonstrates a remarkable performance improvement, outperforming its Euclidean counterpart by 13.67% on CIFAR10 and 13.26% on CIFAR100. These results constitute compelling evidence of the effectiveness of the proposed module, highlighting its substantial contributions to our research.

Reviewer 3 Report


Comments for author File: Comments.pdf

  • There are some minor typos throughout, like extra spaces or misspellings. Carefully proofread to catch these.
  • Be consistent in verb tense - generally stick to present tense when describing methods and past tense for results.

Author Response

Introduction

The introduction could provide more background and motivation by elaborating on the limitations of current computer vision techniques operating purely in Euclidean space. What specific issues arise from not correctly capturing hierarchical relationships? How could hyperbolic geometry help address these? More comparisons between hyperbolic and Euclidean geometry would help readers better understand the key differences and benefits of the non-Euclidean approach. Clarify earlier why images can possess inherent hierarchical structure. Provide some examples or cites to support this claim.

Response:

We appreciate the reviewer's valuable feedback, which underscores the importance of a thorough introduction to our work. In response to these suggestions, we aim to provide a clearer context and motivation for our study.

Current computer vision techniques predominantly operate in Euclidean space, which assumes that data points are embedded in a flat, geometrically Euclidean environment. While this approach has been successful in many cases, it has limitations, especially when dealing with complex data types like images. One notable limitation is the inability to effectively capture hierarchical relationships within image data.

Images, by their nature, often possess inherent hierarchical structures. For example, in an image of a person, there are hierarchies ranging from pixels to body parts (e.g., limbs, face), and finally to the whole person. These hierarchical relationships are not easily modeled in a flat Euclidean space, which treats all distances equally. When using Euclidean geometry, important contextual information and relative hierarchical importance may be lost, making it challenging to recognize and understand complex image features. Prior research, such as Khrulkov et al. (2020)~\cite{khrulkov2020hyperbolic}, has compellingly demonstrated the presence of distinct hierarchical relationships within image features learned by widely adopted neural networks like VGG. Simultaneously, it's worth noting that hierarchy and tree structures constitute common paradigms for human cognition in comprehending and recognizing the world.

Hyperbolic geometry, on the other hand, offers a solution to this limitation. It provides a curved, non-Euclidean space that naturally accommodates hierarchical data relationships. In hyperbolic space, the distance between points varies depending on their position in the hierarchy. This curvature enables more faithful representation and modeling of hierarchical correlations present in data like images. By leveraging hyperbolic geometry, we can enhance our ability to capture, understand, and manipulate hierarchical structures within images.

Thank you once again for your insightful feedback, which has guided us in refining the introduction to provide a more comprehensive and motivated context for our research. We have added these into the introduction part.

 

Methodology

Explain more intuitively what the Lorentz model represents and why it was chosen for this application. Provide more details on the Lorentz linear layers - how do they work? What are the mathematical operations involved? Possibly include equations or algorithms. Similarly, explain the Lorentz crosspatch and cross-channel layers more clearly - what do they computationally do in the model?

Response:

Thanks, the method is trying to provide an efficient neural operator that is manifold-preserving. This means that, we want to efficiently learn the feature but need to make sure the learnt feature still on the manifold. The most challenging part is that most the neural operations in Euclidean space are not manifold-preserving (in Euclidean space we do not need to take this into consideration). Therefore, we provide Lorentz cross-patch and cross-channel layers that decoupe the operation into two steps, first mapping feature like in the Euclidean space and then second boost the feature such that the feature will again on the manifold.

 

Results

Include quantitative results beyond accuracy, such as model size, computational complexity, convergence rate, etc. This will better highlight the advantages of Euclidean networks. Perform ablation studies to validate design choices like the Lorentz model, layers used, etc. Visualizations like embeddings projected into 2D could help give more insight into what the model has learned.

Response:

Thanks for this suggestion, we add more information about the model size, computational complexity (using FLOPS), and also two more comparation models, i.e., the vision transformer model and the EfficientVIT model. our method builds upon the foundation of ResMLP, which serves as our automatic baseline, forming the basis for our ablation studies. As indicated in Table 1, our approach demonstrates a remarkable performance improvement, outperforming its Euclidean counterpart by 13.67% on CIFAR10 and 13.26% on CIFAR100. These results constitute compelling evidence of the effectiveness of the proposed module, highlighting its substantial contributions to our research.

From table 1 we can see that our model can even better than the current ViT model at the compact setting, with only half of the parameters. But one drawback is not its non-linear property force it need more cost~ We also have feature visualization in Fig 7. You can see that even with only 1M parameters, our method can provide a relatively clear cluster for different classes while its very challenging for its Euclidean conterparts.

Discussion

Provide more analysis into why the proposed model performs better than Euclidean counterparts. What specifically about the hyperbolic geometry helps? Discuss any limitations or downsides observed compared to Euclidean networks. Suggest future work to further improve hyperbolic networks for computer vision. Overall, try to motivate the hyperbolic approach more strongly, intuitively explain the methodology, and thoroughly analyze the results to bring out the benefits. This will strengthen the evidence and importance of the proposed technique.

Response:

We sincerely appreciate your insightful question. The strength of hyperbolic geometry lies in its inherent non-linearity, which endows neural networks with the potential to capture richer information when compared to their Euclidean counterparts of similar architecture. This non-linearity results in a distinctly different distance metric within hyperbolic space compared to the flat Euclidean space. Consequently, neural networks operating in hyperbolic space are compelled to learn more intricate and fine-grained feature representations, as what may be considered 'near' features in Euclidean space can be considerably distant in the hyperbolic space.

However, it's essential to note that this non-linearity, while advantageous, also brings about certain limitations. For instance, it increases the computational demands of the model. Unlike in Euclidean space, where networks can often be decoupled into linear layers and non-linear activations, hyperbolic networks require a more intertwined and challenging learning process. Additionally, there is a lack of definitive evidence supporting the superiority of hyperbolic neural networks when there are no parameter constraints. In other words, it remains unproven whether larger hyperbolic models consistently outperform their Euclidean counterparts, raising questions about the scalability and practicality of hyperbolic models in certain scenarios.

 We have added this to the discuss part, thanks again.

 

There are some minor typos throughout, like extra spaces or misspellings. Carefully proofread to catch these.

Response:

We wish to express our gratitude for your observation and constructive feedback. In light of your meticulous assessment, we have undertaken a comprehensive review of the entirety of our paper, with a specific focus on rectifying any linguistic intricacies, thereby enhancing its overall coherence and facilitating a better reading experience for our audience.

 

Be consistent in verb tense - generally stick to present tense when describing methods and past tense for results.

Response:

Thanks for this suggestion, we modified accordingly.

Round 2

Reviewer 2 Report

Compared with the published papers, the comparative test of this paper is also insufficient, and it is suggested that the author strengthen the recently proposed VIT.

Minor editing of English language required.

Author Response

Thanks very much for the suggestions. We have already compared with two ViT models, for which we can see a clear evidence that our method is better in this compact setting.  As in the literature, the advantage of ViT comes from big models that trained from a huge dataset. However, the multi-head attention doesnot show any advantage when the dataset is relatively small and the model is extreme compact. We think that is the reason why it performance worse here at the benchmark tasks. 

Back to TopTop