CuFP: An HLS Library for Customized Floating-Point Operators
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsSee attachment file
Comments for author File: Comments.pdf
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors The paper introduces the cuFP, an HLS-based library for customized floating-point operators. Overall, the paper is well written and easy to follow with several evaluation and validation results. The main contribution is that the authors provide an open-source tool which allows the users to customize floating-point types for their calculations, providing users with enhanced control over precision. In addition, authors provide the implementation of basic operations (vector sum, dot-product, and matrix-vector multiplication operations) which are crucial for numerous applications, while the design a template-based recursive function to ensure the operations are consistently optimized. On the downside, although authors provide an open-source library with a GitHub link, they recommend the users to use Vivado HLS 2019.1 suite. However, it is an outdated version of Xilinx tools. Which modifications are required in order to use more recent versions of Xilinx tools (e.g. Vitis 2024.1)? In addition, on GitHub link you refer how to configure your tool in order to produce the IP. I think that you have to add some instruction about the usage of the produced IP. Moreover, the most FPGA-based applications have memory and computing IO limitations. For this reason, 512-bit interfaces pack/unpack mechanisms can be used in order to minimize the memory access and improve the performance. Does your tool support this feature? In Section 3.4.3 you present your approach about mvm. However, a wide-range of applications (especially neural network) use matrix-multiplication of 2 two-dimensions arrays. Is it supported efficiently from your tool (without multiple access in the input arrays)? In all paper sections you refer the “Vendor IP”. However, you do not refer which vendor IP is used for experimental results (link/citation) in order to reproduce the reader the numbers which mentioned. In addition, the presented work is not compared with other HLS libraries/tools with customized operations such as TrueFlot, [1], [2] (at least for the basic operations). The only comparison is with Vendor IP (single precision) and Flopoco which is VHDL based tool. I think that the comparison with other Templated-Based libraries is required in order to demonstrate the performance difference of your work or at least to demonstrate the novelty with those tools. Finally, the novelty in the performance is in vsum, dp and mvm operations. However, you have used very small numbers of arrays in Experimental Section (32x32) which are stored in BRAM. Are there any experiments for larger arrays that do not fit in BRAMs? Minor comment: In Tables 2-5 in # of Stages what do you mean? The number of cycles? If yes, I think that it is better to write “# of Cycles”. [1] https://doi.org/10.3390/jlpea12040056[2] https://doi.org/10.1109/FCCM.2019.00038
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for addressing all of my comments and suggestions. The improvements have significantly enhanced the clarity and quality of the manuscript. I have no further comments and recommend acceptance of the paper.