Chip Design of Multithreaded and Pipelined RISC-V Microcontroller Unit

Yen, Mao-Hsu; Lin, Yih-Hsia; Lin, Tzu-Feng; Chen, Yu-Hui; Ku, Yuan-Fu; Kao, Chien-Ting

doi:10.3390/engproc2025089031

Open AccessProceeding Paper

Chip Design of Multithreaded and Pipelined RISC-V Microcontroller Unit^†

by

Mao-Hsu Yen

¹,

Yih-Hsia Lin

²,

Tzu-Feng Lin

¹,

Yu-Hui Chen

^1,*,

Yuan-Fu Ku

³ and

Chien-Ting Kao

⁴

¹

Department of Computer Science and Engineering, National Taiwan Ocean University, Keelung 202301, Taiwan

²

Department of Electronic Engineering, Ming Chuan University, Taoyuan 333321, Taiwan

³

Taiwan Testing and Certification Center, Taoyuan 333011, Taiwan

⁴

Bureau of Standards, Metrology and Inspection (BSMI), Taipei 100026, Taiwan

^*

Author to whom correspondence should be addressed.

^†

Presented at the 2024 IEEE 7th International Conference on Knowledge Innovation and Invention, Nagoya, Japan, 16–18 August 2024.

Eng. Proc. 2025, 89(1), 31; https://doi.org/10.3390/engproc2025089031

Published: 28 February 2025

(This article belongs to the Proceedings of 2024 IEEE 7th International Conference on Knowledge Innovation and Invention)

Download

Browse Figures

Versions Notes

Abstract

:

Multithreading is widely used in microcontroller unit (MCU) chips. Multithreaded hardware is composed of multiple identical single threads and provides instructions to different threads. Using the concept of thread-level parallelism (TLP), pauses are compensated for during single-thread operation to increase the throughput at the same unit. The principle of pipelined management is to use instruction-level parallelism (ILP) to split the MCU into multiple stages. When an instruction is given in a certain stage, other instructions are provided to operate in other idle stages and improve their execution efficiency. Based on the four-thread and pipelined RISC-V MCU architecture, we analyzed the instruction types of three benchmarks, i.e., Coremark, SHA, and Dijkstra. A total of 94% of the instructions use the arithmetic logic unit (ALU). Based on the executable four-thread architecture, we developed two to four RISC-V architectures with different numbers of ALUs and a dispatch algorithm. This architecture allows for the simultaneous delivery of multiple instructions, enabling parallel processing of instructions and increasing efficiency. Compared to the traditional RISC-V architecture with only one ALU, the test results showed that the instructions per clock (IPCs) of RISC-V architectures with two, three, and four ALUs increased efficiency by 76, 128.9, and 154.3%, while the area increased by 12, 22.3, and 32.6% and the static power consumption increased by 5.1, 9.2, and 13.3%. The results showed a significant improvement in performance with only a slight increase in the area. Due to the limited area of chips, a two-thread microcontroller architecture was used for the IC design and tape-out. TSMC’s 180nm process with a chip area of 1190 × 1190 μm at 133 MHz was used in this study.

Keywords:

multithreading; microcontroller unit (MCU); thread-level parallelism (TLP); arithmetic logic unit (ALU)

1. Introduction

A microcontroller unit includes a core, memory, input/output interfaces, and external devices for multiple control and coordination capabilities. The instruction set architecture adopted in this study is the open-source RISC-V [1], originating from the University of California, Berkeley. It is designed based on the principle of reduced instruction set computer (RISC), which has a simplified instruction set, neat instruction format, and easy implementation of the pipeline architecture, all of which contribute to the concise code, fast computation speed, and small architecture volume. The instruction set of RISC-V includes the base integer instruction set and the standard extended instruction set. We implemented the 32-bit base integer instruction set and standard extension for integer multiplication and division instruction set (RV32IM) according to our needs, with 32 registers, 48 instructions, and 6 instruction formats. The instructions were divided into immediate addressing, register addressing, jump instructions, and load/store instructions [2,3].

To improve the speed and efficiency of the microcontroller unit (MCU), the design approach completes instructions in a single cycle to utilize pipelined instruction-level parallelism (ILP) in the same instruction set [4]. Although this improves the efficiency of execution significantly, it cannot meet the needs of current rapid developments. Therefore, hardware multithreading technology allows for breakthrough developments. A single thread is a production line that independently runs instructions, from fetch and decoding to execution. When the workload becomes heavy, a single production line cannot handle it, resulting in work fatigue, low efficiency, and other problems. Multiple threads, on the other hand, combine multiple identical single threads to disperse tasks into each thread, doubling the throughput at the same time. When applied to an MCU, multithreading utilizes the concept of thread-level parallelism (TLP) to compensate for the pause time or waiting for memory access that is caused by single-thread operation. This allows for the execution unit to remain operational without causing idle situations. Finally, the dispatch algorithm is used to determine which thread can send instructions to the execution unit for computation [5] for each thread to operate and determine the running speed.

A four-thread MCU with RISC-V is an instruction set containing one arithmetic logic unit (ALU) [6]. The efficient hint-based event (EHE) algorithm is combined with a new efficient event-based issue scheduling algorithm and is applied to the RISC-V multithreaded pipeline. Its concept is a combination of round-robin and event-driven. In general, round-robin is used for dispatch, but when the next instruction that is executed is a jump instruction, it is changed to an event-driven mode for execution. This method is more efficient than simple round-robin and event-driven algorithms. Yuta and Nobuyuki proposed a real-time system-specific multithreaded RISC-V processor [7]. The proposed processor utilizes ILP and TLP, and if vector operations are added, data-level parallelism (DLP) is obtained, completing all forms of parallelization; Abdullah et al. proposed that the hardware conforms to the Pulpino processor and simultaneously supports internet of things (IoT) applications with interleaved multithreading [8]. Sylvain put forward a large-scale multithreaded RISC-V processor core called Simty and proposed an open platform with the same name for studying GPU microarchitectures and hybrid CPU-GPU microarchitectures [9].

We tested the architecture proposed in Ref. [6] using three benchmarks: Coremark, SHA, and Dijkstra. After converting them through the compiler into assembly and analyzing the types of instructions, 94% of the instructions were used in the ALU, which only accounted for 9.2% of the total area of the RISC-V architecture (Table 1). In a multithreaded processor, having only one ALU limits the efficiency of each thread’s execution. Therefore, increasing the number of ALUs improves performance. Therefore, we created a RISC-V multithreaded pipeline architecture, which includes two to four ALUs, and individual benchmarks are used for area, power, and performance analysis. Based on the EHE algorithm for a single ALU, modifications were made to develop a dispatch algorithm for two to four ALUs. Multiple instructions were dispatched at the same time in parallel, thereby increasing efficiency. According to the test results, IPCs with two to four ALU architectures showed significant performance improvements, with slight increases in the microarea and static power consumption.

In this article, Section 2 elaborates on the multithreaded hardware architecture and pipelined RISC-V design; Section 3 introduces the patch algorithm; Section 4 presents the architecture implementation of this paper, which is validated using FPGA and implemented using very-large-scale integration (VLSI); Section 5 concludes this study and presents future research directions.

2. Architecture Design

Although the utilization rate of ALUs exceeds 90%, they only occupy less than 10% of the area; thus, it is speculated that increasing the number of ALUs can effectively improve the IPC. Therefore, we developed a four-thread RISC-V MCU with an N ALU architecture and dispatch algorithm. This architecture consists of three stages of pipelining, four hardware threads, and two to four ALUs. The RISC-V architecture including four hardware threads with one ALU is used as a control group to compare the performance, power consumption, and area. The four-thread RISC-V MCU with N ALUs is shown in Figure 1. The MCU design was used in different stages of three-stage pipelining.

The instruction fetch stage is the first stage of the three-stage pipeline in the RISC-V MCU, which in most cases fetches instructions from the Program Memory based on the addresses stored in the program counter (PC). The instruction fetch stage is usually one of the key bottlenecks in processor performance, as it involves memory access and is relatively slow. Therefore, current processors use techniques such as branch prediction and instruction caching to enhance the speed of the instruction fetch stage, reduce pipeline stall, and improve processor throughput [7]. The PC in the instruction fetch stage of this architecture is the address of the instruction to be executed, which is fetched from the program memory and stored in the IR (instruction register), waiting for decoding in the next stage.

The instruction decoding stage is important in the RISC-V architecture. Due to its special streamlined design [10], there are more neat types of instruction formats, so there is no need to spend too many hardware resources in the decoding stage, and it executes quickly and concisely. At this stage, the processor decodes the instructions that are stored in the IR to distinguish the operation code (OPCode) and related information on the instructions, generate control signals, and read the value of the register file (Reg_file) [11] for subsequent instruction execution stages. Instruction decode/execution (ID/EX) stores disassembled instructions by category, such as the read register values, immediate values, etc., waiting to enter the instruction execution stage.

The instruction execution stage is the final stage in the processor, with various computing units for performing data operations on the data that were generated during the instruction decoding stage based on the corresponding OPCodes. The instruction includes addition, subtraction, multiplication, and division; data access; and other actions. The increase in the number of ALUs is achieved by modifications during the instruction execution stage, combined with the dispatch algorithm for multiple execution units. By combining round-robin and event-driven scheduling methods, appropriate instructions and the number of instructions are allocated to the corresponding execution units, enabling the MCU to execute normally and faster.

3. Dispatch Algorithm

The four-thread RISC-V MCU with one ALU adopts a configuration of multiple threads and a single execution unit. Each execution requires selecting one of the threads, and the algorithm used is the EHE algorithm. This is a combination of round-robin and event-driven triggering. In general, round-robin is used for dispatch, but when there are jump or branch instructions in the thread, it is executed in event-driven mode. This method was verified in Ref. [6] and was proven to be more efficient than using only round-robin algorithms or event-driven algorithms.

The architecture in this study is the four-thread RISC-V MCU with N ALUs, in which N = 1, 2, 3, or 4. A dispatch algorithm, that conforms to the new architecture was developed, and a maximum of N instructions was dispatched simultaneously. Firstly, the architecture fetches instructions from the Program Memory in advance to the ID/EX register, a process called prefetch, which requires two to four clock cycles before instruction dispatch. The dispatch algorithm uses the round-robin method to initialize the weights of four threads. The higher the weight value is, the higher the priority is.

3.1. Algorithm Steps

The round-robin method is used to assign initial weights to four threads, with initial weight values including 4, 2, 1, and 0. The higher the weight value is, the higher the priority is. The initial weight values for the first round of the four threads are 4, 2, 1, and 0. The second-round values are 0, 4, 2, 1; the third-round values are 1, 0, 4, 2; and the fourth round consists of 2, 1, 0, and 4, and the weight values are iteratively initialized in this order, ensuring that each execution order receives a fair chance of execution.
The DIV instruction for division is checked. The jump or branch instruction with control hazard is executed in the four threads. Therefore, the event-driven mode is switched to add 4 to the weight of the thread as the new weight to increase priority. If it is a DIV instruction, it increases the utilization rate of empty DIV units, and the completion time of the instruction is advanced. If it is a jump or branch instruction, it allows for more time to fetch the next instruction. It executes prefetch first and uses a countdown counter to record the number of clock cycles that are required for the thread to prefetch the next instruction.
This dispatch algorithm limits the ability to dispatch up to N instructions at the same time. Among them, there is only one MUL, DIV, and CSR unit, so these three types of instructions dispatch up to one each, and the rest must be ALU-type instructions. In addition, if the executing unit is a DIV unit, it takes three clock cycles to complete the execution. Therefore, the countdown counter is used to record the required number of clock cycles, while other executing units only need one clock cycle to complete the execution.
At this point, the dispatch is completed. Then, step 1 is repeated to update the initial weights. However, if the prefetch countdown counter in step 2 does not reach 0, the instruction prefetching has not been completed. Therefore, the weight of this thread is set to 0, priority is given to threads that have been fetched but have not yet been executed, and steps 2, 3, and 4 are repeated to improve hardware resource utilization.

3.2. Pseudo Code of Dispatch Algorithm

The pseudo code in Algorithm 1 corresponds to the above algorithm steps and demonstrates how the dispatch algorithm operates.

Algorithm 1. Dispatcher for 4-thread RISC-V MCU with N ALUs

Input: N = 1, 2, 3, or 4
prefetch = 2, 3, or 4 // clock number of prefetch
// Initializing weight
thread_weight [1] ← 4
thread_weight [2] ← 2
thread_weight [3] ← 1
thread_weight [4] ← 0
DIV_used_cnter ← 0
// Dispatching
while !four_threads_all_empty do
for i = 1 to N do
ALU_used[i] ← false
end for
MUL_used ← false
CSR_used ← false
if DIV_used_cnter = 0 then // Executing division needs 3 clocks
DIV_used ← false
else
DIV_used_cnter ← DIV_used_cnter − 1
end if
if thread_wait_cnter[num] ! = 0 then
thread_wait_cnter[num] ← thread_wait_cnter[num] − 1
end if
for num = 1 to 4 do
// Event-driven
if thread_instruction[num] = Branch_instruction then
thread_weight[num] ← thread_weight[num] + 4
thread_wait_cnter[num] ← prefetch
end if
end for
for i = 1 to N do
num ← max(thread_weight [1‥4])
thread_weight[num] ← 0
if thread_instruction[num] = MUL_instruction and MUL_used = false then
MUL ← thread_instruction[num]
MUL_used ← true
else if thread_instruction[num] = DIV_instruction and DIV_used = false then
DIV ← thread_instruction[num]
DIV_used ← true
// Executing division needs 3 clocks
DIV_used_cnter ← 3
else if thread_instruction[num] = CSR_instruction and CSR_used = false then
CSR ← thread_instruction[num]
CSR_used ← true
else then // ALU_instruction
for j = 1 to N do
if ALU_used[j] = false then
ALU[j] ← thread_instruction[num]
ALU_used[j] = true
end if
end for
end if
end for
// Round-robin
thread_weight [1] ← thread_weight [4]
thread_weight [2] ← thread_weight [1]
thread_weight [3] ← thread_weight [2]
thread_weight [4] ← thread_weight [3]
// Checking whether prefetch is done
for num = 1 to 4 do
if thread_wait_cnter[num] ! = 0 then
thread_weight[num] ← 0
end if
end for
end while

3.3. Illustrations

For the four-thread RISC-V MCU with two ALUS, a maximum of two instructions are dispatched simultaneously, where A represents general instructions, B represents jump and branch instructions, M represents multiplication instructions, D represents division instructions, and C represents CSR (control and status register) instructions.

As shown in Figure 2a, during the first dispatch, the initial weight of thread_1 is 4, thread_2 is 2, thread_3 is 1, and thread_4 is 0. No B instruction is detected, and the two highest priority instructions are both A instructions. Therefore, both thread_1 and thread_2 can be dispatched simultaneously.

As shown in Figure 2b, in the second dispatch, the initial weight of thread_1 is 0, thread_2 is 4, thread_3 is 2, and thread_4 is 1. However, thread_1 is a B instruction, and the weight of that thread is increased by 4 and does not include the M, D, and C instructions. Therefore, the highest weighted thread_1 and thread_2 are dispatched. In the third dispatch, the initial weight of thread_1 is 1, thread_2 is 0, thread_3 is 4, and thread_4 is 2 (Figure 2c). No B instruction is detected, but the highest weighted thread_3 and thread_4 are M instructions. Only the thread with the highest weight is selected, so thread_3 is dispatched.

4. Architecture Validation and Analysis

4.1. FPGA Verification

After the completion of the algorithm architecture design, it was implemented using the SystemVerilog hardware description language. The Segger Embedded Studio integrated development environment (IDE) V8.10a was to generate a RISC-V machine code for testing to verify whether the core architecture of the MCU meets the designed functions. After the register transfer-level (RTL) simulation using ModelSim 10.5b software, the architecture code was burned into Altera DE10 FPGA, and further validation was performed using the built-in SignalTap logic analysis in ModelSim. The verification was conducted as shown in Figure 3 and Figure 4.

4.2. Architecture Analysis and Comparison

We implemented the four-thread RISC-V MCU with N ALUs using the 180 nm process of Taiwan Semiconductor Manufacturing Company (TSMC). Three key points of chip design were used as the benchmark for the analysis results, namely power consumption, performance, and area.

In the static power consumption analysis, the RISC-V architecture of one ALU was used as a benchmark to calculate the power consumption of the RISC-V architecture for other ALU quantities. As shown in Table 2, the static power consumption of the two-ALU RISC-V architecture was only 5.1% higher than that of the one-ALU RISC-V architecture, 9.2% higher than that of the three-ALU RISC-V architecture, and 13.3% higher than that of the four-ALU RISC-V architecture, with an increase of less than 31 mW.

In terms of the performance analysis, individual analyses were conducted with a required prefetch of two to four clock cycles, that is, the number of clock cycles that were required to prefetch instructions to the register. The analysis results are shown in Table 3. The architecture in Ref. [6] was set as the benchmark to calculate the performance multiples of different ALU architectures, as shown in Table 4. A best-case prefetch of two was used to illustrate that compared to the one-ALU RISC-V architecture, the two-ALU RISC-V architecture improved the IPC by 94.8%, the three-ALU RISC-V architecture improved the IPC by 162.3%, and the four-ALU RISC-V architecture improved it by 212.8%. In the worst-case prefetch, the two-ALU RISC-V architecture improved the IPC by 76%, the three-ALU RISC-V architecture improved it by 128.9%, and the four-ALU RISC-V architecture improved it by 154.3%. Increasing the number of ALUs effectively improved the efficiency.

Table 5 lists the cell areas of each architecture and also sets the architecture of [6] as the benchmark to compare the cell area sizes of each them. The two-ALU RISC-V architecture only increased by 12% in area, the three-ALU RISC-V architecture increased by 22.3%, and the four-ALU RISC-V architecture increased by 32.6%.

Finally, the prefetch was divided into two, three, and four to calculate the IPC increase per unit area (Table 6). The architecture proposed in this study effectively improved the IPC, and the area of improvement was not significant. From the perspective of efficiency and area, regardless of the prefetch, the architecture containing two ALUs was the best; from the perspective of high efficiency, the architecture that included four ALUs was the best.

4.3. Chip Implementation

We adopted the educational chip provided by the Taiwan Semiconductor Research Institute (TSRI) for the implementation of this architecture, which is produced using TSMC’s T18 process. As it is an educational chip, the chip area is limited to 1190 × 1190 μm, which is not enough to achieve an MCU with four threads. Therefore, it was reduced to a version with two threads combined with two ALUs, and the cell-based design flow was performed using the EDA cloud provided by the TSRI. The chip layout is shown in Figure 5 and the chip specifications are shown in Table 7.

5. Conclusions

We developed an architecture to improve the performance of a four-thread RISC-V MCU. By increasing the number of ALUs and using the dispatch algorithm, the performance was improved in a limited area using the concepts of ILP and TLP. Compared to the original infrastructure, the IPCs with two to four ALU architectures increased by 76, 128.9, and 154.3%, with the area increasing by 12, 22.3, and 32.6%. The static power consumption increased by 5.1, 9.2, and 13.3%, achieving a significant improvement in performance with only a slight increase in area. In verification, the four-thread RISC-V MCU with N ALUs that was designed in this study operated correctly at a speed of 133 MHz through FPGA and cell-based design flow testing.

Algorithm modifications are necessary to find more appropriate dispatch methods for multiple threads. On the other hand, the RISC-V instruction set is currently continuously updating its functional modules to reserve space for self-defined instructions, such as adding a vector extension instruction set or customizing instructions and increasing the vector operation capabilities. By performing vector operations, convolution operations, and other operations with the same instructions and a large amount of data, the data-level parallelism (DLP) increased [7], while achieving ILP, TLP, and DLP. The MCUs had more comprehensive application upgrades and were optimized for applications in IoT and artificial intelligence.

Author Contributions

Conceptualization, M.-H.Y., Y.-H.L. and T.-F.L.; methodology, M.-H.Y., Y.-H.L. and T.-F.L.; software, T.-F.L.; validation, T.-F.L.; formal analysis, M.-H.Y., Y.-H.L. and T.-F.L.; resources, M.-H.Y. and Y.-H.L.; data curation, T.-F.L.; writing—original draft preparation, T.-F.L. and Y.-H.C.; writing—review and editing, M.-H.Y., Y.-H.L. and Y.-H.C.; visualization, Y.-H.C.; supervision, M.-H.Y., Y.-H.L., Y.-F.K. and C.-T.K.; project administration, T.-F.L.; funding acquisition, Y.-F.K. and C.-T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Bureau of Standards, Metrology, and Inspection (BSMI), Taiwan, under Grant Number 1D161121218-117.

Data Availability Statement

The data from this study can be obtained upon request from the corresponding author. They are not publicly accessible due to privacy concerns.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Waterman, A.; Lee, Y.; Avizienis, R.; Patterson, D.A.; Asanovic, K. The RISC-V Instruction Set Manual Volume II: Privileged Architecture Version 1.9; University of California: Berkeley, CA, USA, 2016. [Google Scholar]
Lim, D.X.; Smitha, K.G. Pipelined MIPS Simulation: A plug-in to MARS simulator for supporting pipeline simulation and branch prediction. In Proceedings of the 2019 IEEE International Conference on Engineering, Technology and Education, Yogyakarta, Indonesia, 10–13 December 2019; pp. 1–7. [Google Scholar]
Curran, B.W.; Jacobi, C.; Bonanno, J.J.; Schroter, D.A.; Alexander, K.J.; Puranik, A.; Helms, M.M. The IBM z13 multithreaded microprocessor. IBM J. Res. Dev. 2015, 59, 1–13. [Google Scholar] [CrossRef]
Mendoza Escobar, J. A Multithreading RISC-V Implementation for Lagarto Architecture; Universitat Politècnica de Catalunya: Barcelona, Spain, 2020. [Google Scholar]
Das, A.; Jose, J.; Mishra, P. Data criticality in multithreaded applications: An insight for many-core systems. IEEE Trans. Very Large Scale Integr. Syst. 2021, 29, 1675–1679. [Google Scholar] [CrossRef]
Eni, Y.; Greenberg, S.; Ben-Shimol, Y. Efficient Hint-Based Event (EHE) Issue Scheduling for Hardware Multithreaded RISC-V Pipeline. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 69, 735–745. [Google Scholar] [CrossRef]
Nojiri, Y.; Yamasaki, N. A Design of Multithreaded RISC-V Processor for Real-Time System. In Proceedings of the 2023 Eleventh International Symposium on Computing and Networking Workshops (CANDARW), Matsue, Japan, 27–30 November 2023. [Google Scholar] [CrossRef]
Cheikh, A.; Cerutti, G.; Mastrandrea, A.; Menichelli, F.; Olivieri, M. The Microarchitecture of a Multi-threaded RISC-V Compliant Processing Core Family for IoT End-Nodes. In ApplePies 2017. Lecture Notes in Electrical Engineering; Springer: Berlin/Heidelberg, Germany, 2019; Volume 512. [Google Scholar]
Sylvain Collange. Simty: Generalized SIMT execution on RISC-V. In Proceedings of the CARRV 2017-1st Workshop on Computer Architecture Research with RISC-V, Boston, MA, USA, 14 October 2017. [Google Scholar]
Hennessy, J.L.; Patterson, D.A. Computer Architecture: A Quantitative Approach; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]
Lai, J.-Y.; Chen, C.-A.; Chen, S.-L.; Su, C.-Y. Implement 32-bit RISC-V Architecture Processor using Verilog HDL. In Proceedings of the 2021 International Symposium on Intelligent Signal Processing and Communication Systems, Hualien City, Taiwan, 16–19 November 2021. [Google Scholar]

Figure 1. Architecture diagram of 4-thread RISC-V MCU with N ALUs, N = 1, 2, 3, or 4.

Figure 2. Dispatch algorithm examples, where (a) is the first dispatch, (b) is the second dispatch, and (c) is the third dispatch. The instruction marked with a cross has been executed, while the instruction enclosed in a circle is about to be executed.

Figure 3. Coremark simulated waveform.

Figure 4. Coremark test waveform.

Figure 5. Layout diagram.

Table 1. Analysis of instruction types for benchmarks.

	Coremark	SHA	Dijkstra
ALU	94.3%	96.3%	95.7%
CSR	0.2%	0.5%	0.2%
MUL/DIV	1.7%	0%	0.5%
Others	3.8%	3.2%	3.6%

Table 2. Static power consumption analysis of 4-thread RISC-V with N ALU architecture: N = 1, 2, 3, or 4.

	1 ALU [6]	2 ALUs	3 ALUs	4 ALUs
RISC-V(mW)	230.37	242.27	251.55	261.06
Power consumption comparison	100%	105.1%	109.2%	113.3%

Table 3. Performance analysis of 4-thread RISC-V with N ALU architecture: N = 1, 2, 3, or 4.

Prefetch	1 ALU [6]	2 ALUs	3 ALUs	4 ALUs
2	0.991	1.931	2.6	3.1
3	0.989	1.826	2.41	2.7
4	0.983	1.731	2.25	2.5

Table 4. Performance comparison of 4-thread RISC-V with N ALU architecture: N = 1, 2, 3, or 4.

Prefetch	1 ALU [6]	2 ALUs	3 ALUs	4 ALUs
2	1	1.948	2.623	3.128
3	1	1.846	2.437	2.73
4	1	1.76	2.289	2.543

Table 5. Comparison of cell areas for 4-thread RISC-V with N ALU architecture: N = 1, 2, 3, or 4.

	1 ALU [6]	2 ALUs	3 ALUs	4 ALUs
Area (μm²)	926,442	1,037,361	1,133,061	1,228,539

Table 6. Improvements in IPCs and area ratio.

Prefetch	2 ALUs	3 ALUs	4 ALUs
2	8,474,653	7,787,280	6,981,201
3	7,546,047	6,877,392	5,663,743
4	6,743,660	6,132,059	5,021,566

Table 7. Chip Specification.

Item	Specification
Process	TSMC 180 nm
Area	1190 μm × 1190 μm
Pin	40 pin
Core Power Pads	4 sets
Pad Power Pads	4 sets
Core voltage	1.8 V
Pad voltage	3.3 V
Power consumption	13.3 mW
Frequency	133 MHz

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yen, M.-H.; Lin, Y.-H.; Lin, T.-F.; Chen, Y.-H.; Ku, Y.-F.; Kao, C.-T. Chip Design of Multithreaded and Pipelined RISC-V Microcontroller Unit. Eng. Proc. 2025, 89, 31. https://doi.org/10.3390/engproc2025089031

AMA Style

Yen M-H, Lin Y-H, Lin T-F, Chen Y-H, Ku Y-F, Kao C-T. Chip Design of Multithreaded and Pipelined RISC-V Microcontroller Unit. Engineering Proceedings. 2025; 89(1):31. https://doi.org/10.3390/engproc2025089031

Chicago/Turabian Style

Yen, Mao-Hsu, Yih-Hsia Lin, Tzu-Feng Lin, Yu-Hui Chen, Yuan-Fu Ku, and Chien-Ting Kao. 2025. "Chip Design of Multithreaded and Pipelined RISC-V Microcontroller Unit" Engineering Proceedings 89, no. 1: 31. https://doi.org/10.3390/engproc2025089031

APA Style

Yen, M.-H., Lin, Y.-H., Lin, T.-F., Chen, Y.-H., Ku, Y.-F., & Kao, C.-T. (2025). Chip Design of Multithreaded and Pipelined RISC-V Microcontroller Unit. Engineering Proceedings, 89(1), 31. https://doi.org/10.3390/engproc2025089031

Article Menu

Chip Design of Multithreaded and Pipelined RISC-V Microcontroller Unit^†

Abstract

1. Introduction

2. Architecture Design