1. Introduction
In modern power systems, the ability to assess and monitor power quality in real time has become a vital requirement as networks evolve into complex, digital smart grids. The IEC 61850 standard [
1] plays a key role in enabling high-speed, standardized communication between intelligent electronic devices (IEDs) in substations. In particular, the IEC 61850-9-2 Sampled Measured Values (SMV) protocol allows fast and accurate transmission of analogue measurements in digital form, providing the foundation for real-time monitoring of voltage and current waveforms. Because the standard offers concrete solutions for data transmission, encoding, and decoding, smart grids are becoming increasingly efficient at monitoring and controlling energy quality in line with consumer-driven requirements. Within these networks, intelligent electronic devices (IEDs) located in electrical substations are the first to receive digitalized measurement data, encapsulated in SMV packets sent by merging units or sensors. These IEDs act as SMV subscribers and serve as the entry point for quality monitoring in the broader smart grid hierarchy. To enable real-time control and protection of electrical equipment, critical delay thresholds must be respected throughout the measured data processing and organization process, making low-latency solutions essential for reliable operation.
Most existing SMV subscriber solutions focus primarily on the timely decoding of the incoming sampled data, but are realized in software running on CPU platforms. This software-centric approach offers advantages such as ease of development, flexibility for algorithm updates, and simplified debugging. However, there are significant drawbacks in time-critical and high-throughput scenarios: software processing introduces additional latency, incurs high memory and bus bandwidth consumption, and often results in reduced timing determinism. These issues are exacerbated as the number of subscribed data streams grows, since a general-purpose processor may struggle to sustain hundreds of Ethernet frames per cycle with consistent low latency. The lack of hardware acceleration for SMV handling and PQ computation in many IEDs thus represents a notable research gap. Ensuring real-time performance with increasing data rates and stream counts remains challenging, prompting an investigation into dedicated hardware-based approaches that can offload processing from CPUs.
This paper builds on the foundation established in prior work, where an FPGA-based SMV subscriber subsystem was developed to achieve low-latency, deterministic decoding of sampled value streams. In that prototype (termed HS3), the entire SMV parsing and distribution pipeline was implemented in programmable logic on a Zynq-7020 system-on-chip, yielding decoding latencies under 3 μs for configurations up to 512 subscribed SV streams while using under 8% of the available FPGA resources [
2]. This demonstrated that parallel hardware processing can deliver orders-of-magnitude faster and more predictable performance than traditional software parsing [
3]. Despite such progress, the pool of fully integrated IEDs leveraging FPGA-based acceleration remains very limited. For example, National Instruments’ CompactRIO platform with the NI-Industrial Communications for IEC 61850 Toolkit [
4] supports SMV, GOOSE, and MMS message handling within a LabVIEW environment. Still, its power quality calculations are executed on the CPU, outside the FPGA subscriber pipeline. This separation adds processing delay and reduces determinism for real-time PQ monitoring. One notable exception is the SoC-e RELY SV-PCIe card [
5], an FPGA-based network interface that, when paired with SoC-e’s dedicated SMV IP core, functions as a high-performance SMV subscriber. This platform supports up to 256 concurrent SV streams and includes built-in fundamental frequency and RMS measurements for each stream, integrating specific power quality analytics directly in hardware. Its reported sub-7 μs SMV decoding latency (including the basic PQ computations) represents one of the most competitive benchmarks currently available and serves as a helpful reference point for our work. However, direct comparison between such commercial solutions and research prototypes is difficult, as we are comparing production-hardened devices with a purpose-built experimental subsystem.
Table 1 summarizes the main features of these existing solutions versus the proposed system, highlighting the key performance metrics and architectural differences.
The limited adoption of in-hardware SMV processing and analysis in deployed IEDs underscores several essential gaps. Firstly, there is a lack of open, flexible architectures that integrate power quality metric computations into the SMV subscriber at the FPGA level. As noted, most available toolkits either leave PQ computations to software [
6] or provide only fixed-function firmware solutions [
7]. This leaves system designers with suboptimal choices between flexibility and real-time performance. Secondly, supporting a high number of simultaneous SV streams with low latency remains challenging; few works have demonstrated scaling beyond a handful of streams while maintaining microsecond-level processing times. The need to handle many streams in parallel (for multi-bay or substation-wide monitoring) drives up resource usage and demands careful hardware design to avoid bottlenecks. Thirdly, ensuring the accuracy of the computed PQ parameters under varying signal conditions is a non-trivial challenge when implementing algorithms in fixed-point FPGA logic. Power system signals can be distorted (harmonics, noise, etc.), so the hardware algorithms must be robust to non-sinusoidal conditions and adhere to accepted measurement standards. Prior research has shown that it is feasible to meet strict accuracy requirements in FPGA implementations—for instance, ref. [
6] achieved full compliance with IEC 61000-4-30 Class A using parallel custom processors in an FPGA-based PQ analyzer—but this often requires significant design effort and optimization. Lastly, there is a need for real-time integration of these metrics into the control loop. Even if raw measurements are delivered quickly, any delay in producing actionable metrics (frequency, RMS, etc.) could hinder fast control responses. The challenge is to compute these metrics continuously in streaming fashion without adding more than a few microseconds to the overall data pipeline latency.
Several recent research efforts have begun to address aspects of these challenges by developing FPGA-based systems for SMV subscribing and power quality analysis. In [
5], an analyzer for power quality applications that directly uses IEC 61850-9-2 SV frames as input is depicted. Their system can characterize instrument transformer behavior and PQ parameters from the received SV data, demonstrating the viability of on-site PQ assessment using digital substation measurements. Similarly, ref. [
6] investigates the feasibility of power quality meters based on digital inputs, i.e., using only sampled value streams from non-conventional instrument transformers. They present a prototype SV-based PQ meter focused on voltage dip detection, with a combined hardware/software architecture that was validated against a commercial PQ analyzer. The results in [
7] highlight the potential of SV-driven PQ monitoring and also underscore the importance of SV stream integrity (sampling rates, packet losses, etc.) on measurement accuracy. Beyond the substation context, FPGA technology has been applied to general power quality monitoring with an emphasis on real-time performance. In [
8], an FPGA-based online PQ monitoring system for distribution networks that computes standard PQ indices (e.g., THD, voltage magnitude, etc.) on-chip and transmits the results over the network in real time is introduced. Their design embeds signal processing algorithms as hardware functions in the FPGA, achieving continuous sample-by-sample evaluation of PQ parameters in compliance with IEC standards [
3]. The measured data are sent via a UDP/IP stack implemented in the same FPGA, enabling wide-area monitoring with minimal latency added by external processing. In another notable work, ref. [
5], a comprehensive PQ calculation engine using multiple soft-core processors inside an FPGA is presented. Each processor core was tailored to compute specific PQ metrics (voltage, current, frequency, harmonics, etc.) following the IEC 61000-4-30 Class A methodology, and ran in parallel to cover all required parameters. The prototype by Luiz et al. was shown to calculate all Class A PQ parameters within the FPGA and met the accuracy and response time requirements of the standard, while optimizing logic and memory utilization through customizable hardware processors [
4]. These studies [
5,
6,
7,
8] collectively indicate a clear trend towards integrating power quality analysis with digital substation data streams and exploiting FPGA acceleration for speed and determinism. However, each also has limitations: for instance, refs. [
6,
7] focus on specific metrics or use cases (like instrument diagnostics or voltage dips) rather than a broad set of PQ indices, ref. [
8] implements PQ analysis on FPGA but does not directly interface with the IEC 61850 process bus (instead using local ADC measurements), and [
8] demonstrates full PQ compliance in hardware but not in the context of subscribing to external SV streams. This leaves room for further innovation in combining these aspects into one system.
In light of the above challenges and gaps, the current research extends the earlier HS3 subscriber architecture to incorporate real-time power quality calculations as an integral part of the SMV processing pipeline. The goal is to enable an FPGA-based SMV subscriber not only to decode incoming measurement streams with low latency, but also to immediately evaluate key power quality indicators from those streams in hardware before handing off data to any higher-level applications. By integrating these functions directly into the FPGA logic, the need for separate processing units or software-based post-processing is significantly reduced, resulting in faster and more efficient system operation. Concretely, this work implements dedicated RTL modules for calculating fundamental frequency, true RMS, and active power for each subscribed channel in parallel with the decoding process. The frequency estimation leverages a single-bin DFT technique with a configurable reference window, providing a continuous measure of system frequency deviation and stability. True RMS values are computed through the accumulation of the squared samples over a moving window, which allows constant monitoring of signal magnitude while maintaining accuracy under non-sinusoidal conditions. In parallel, intermediate values such as instantaneous active power and average rectified voltage/current are calculated and time-synchronized per cycle. These intermediate hardware results are organized and made accessible to a lightweight software co-processor (running on the MPSoC’s ARM CPU), which can perform any remaining scalar operations with negligible overhead. The division of labor ensures that all heavy repetitive computations (requiring per-sample processing or data accumulation) are handled in the FPGA fabric. In contrast, the processor only needs to combine the already-aggregated results. This approach drastically lowers the CPU load compared to a traditional solution and minimizes the latency between measurement reception and availability of higher-level PQ metrics.
A modular design approach is adopted to extend the original SMV decoder with the PQ calculation blocks, carefully managing hardware resources to meet performance targets. Each analytical function (Frequency Estimator, RMS calculator, etc.) is implemented as an independent module that interfaces with the SMV stream decoder via well-defined data conduits. This modularity preserves the throughput of the decoding pipeline, as new functions can be enabled or disabled per configuration without altering the critical path. FPGA resource utilization and timing have been optimized so that the enhancements do not compromise the 5 μs end-to-end latency goal or the ability to handle a large number of streams simultaneously. The complete subsystem is implemented and tested on a Xilinx Zynq 7000 MPSoC platform, leveraging the parallelism and deterministic execution of FPGA logic alongside embedded processing. Experiments demonstrate that the hardware-accelerated subscriber can reliably decode SV packets and compute real-time frequency and RMS values for each channel within a few microseconds, even under high throughput conditions. Preliminary evaluations also indicate that the hardware results closely match offline calculations (e.g., Octave/MATLAB) for a variety of test waveforms, confirming the accuracy of the approach.
This paper is structured as follows:
Section 2 reviews related work in more detail and recaps the previous SMV decoding architecture that our design builds upon.
Section 3 describes the principles of the power quality algorithms (frequency estimation, RMS, active power) and outlines their straightforward implementation in hardware.
Section 4 presents the detailed design of the integrated subscriber, highlighting the interaction between components, data flow, and resource utilization. This section also includes a discussion of processing latency for the augmented pipeline and an analysis of how the added functions impact system performance.
Section 5 reports evaluation results, including the accuracy of the hardware computations, error analysis against reference computations, and a comparison between the proposed system’s outputs and those of a conventional approach. Finally,
Section 6 offers conclusions and discusses directions for future work, emphasizing the potential of embedded power quality monitoring in FPGA-based smart grid IEDs.
3. Basic Energy Quality Parameters
In electrical engineering, especially in the context of power systems and energy quality monitoring, basic energy quality parameters are the fundamental metrics that describe the condition and efficiency of voltage and current waveforms. These parameters are crucial for identifying issues such as inefficiencies, equipment stress, and power delivery problems. Particularly within the domain of power systems and substation automation, basic energy quality parameters refer to a set of core electrical measurements that describe the stability and efficiency of voltage and current waveforms over time. These parameters include RMS values, frequency, active and reactive power, power factor, and waveform shape descriptors such as crest and form factors. They are essential for assessing whether power delivery meets operational and regulatory standards, and are critical for real-time monitoring, protection, and control functions in smart grids. Monitoring these parameters enables early detection of abnormalities such as voltage sags, frequency deviations, harmonic distortion, and phase unbalance, all of which can lead to equipment malfunctions, reduced energy efficiency, or even system-wide disturbances. Accurate and timely calculation of these metrics is therefore vital for ensuring power system reliability and optimizing asset performance in increasingly dynamic electrical networks. According to the IEEE Standard 1159-2019 [
9], these parameters form the cornerstone of effective power quality assessment, providing both utilities and industrial users with actionable insights into the performance and integrity of their electrical infrastructure.
Table 2 presents the core energy quality metrics that have been incorporated into the system architecture and shows the distinction between the metrics handled in hardware and those that are suitable for software processing. Parameters which are not bolded are computed entirely within a dedicated FPGA module named Energy Quality Accumulator (EQA) Engine, illustrated in
Figure 1. This hardware block is tightly integrated into the SMV decoding pipeline, with direct access to the filtered and validated data stream. By performing these calculations at the hardware level, the system achieves real-time response with minimal latency and avoids the performance penalties typically associated with repeated memory access in CPU-based architectures.
The parameters marked with bold symbols represent values that are either derived or refined in software, based on intermediate results provided by the hardware. This hybrid processing approach balances the computational load between the programmable logic and software, optimizing resource utilization while maintaining flexibility for future expansion or algorithm refinement.
A common requirement shared by most energy quality parameters and intermediate values—such as RMS, average, and active power—is the need to process a defined number of consecutive voltage and current samples. These calculations often involve simple mathematical operations such as accumulation, squaring, or averaging over a fixed window. In conventional CPU-based systems, such operations typically rely on repeated memory transactions, introducing latency and consuming valuable bandwidth, especially under high-throughput conditions. The in-development SMV Subscriber prototype offers a strategic advantage by enabling the direct integration of these computations into the data processing pipeline. By calculating key energy quality metrics in hardware immediately after decoding the last sample from SMV data, the system can deliver pre-processed values without additional memory access or CPU intervention. This significantly reduces data movement overhead, minimizes processing delays, and ensures that performance remains deterministic, which is an essential requirement for time-sensitive substation automation and protection applications.
Figure 2 illustrates a simplified version of the EQA Engine block and its contents. The main component of this major block is represented by the SMV Accumulator, which has the role of organizing the intermediate data required for parameter calculation and sending it to the DMA module. Each data stream to which the prototype subscribes has a dedicated memory space in this system’s dynamic memory, used to store only the parameters and intermediate values obtained using data accumulation. We will refer to this memory space as stream-accumulated metadata (SAM). The SAM contents will be presented at the end of this section, after the data storage requirement for each algorithm is showcased.
An initial implementation of the frequency estimation algorithm based on zero-crossing detection was analyzed as a potential solution due to its conceptual simplicity and minimal hardware requirements. However, the resulting frequency precision proved to be insufficient for the accuracy standards required in substation automation and power quality monitoring applications. While various enhancement techniques—such as signal interpolation or digital filtering—could improve measurement accuracy, these approaches introduce additional computational overhead and latency. Given the time-critical nature of protection functions, the increased delay rendered this method unattractive for real-time deployment. Consequently, a more robust and low-latency alternative was sought.
Zero-crossing detection is widely utilized for frequency measurement due to its simplicity and ease of implementation. Fundamentally, frequency can be estimated by identifying points where a signal crosses the zero-voltage threshold and calculating the time interval between consecutive crossings. In hardware implementations, this is commonly achieved using analogue voltage comparators synchronized with a reference threshold. In digital systems, the same function is typically realized by examining the sign changes between consecutive samples. By counting the number of samples between zero-crossings and referencing the known sampling rate, the signal period—and thus the frequency—can be determined [
9,
10].
However, a major limitation of this approach arises from the fixed number of samples per period, typically 80 or 256, as defined in IEC 61850-9-2. This discretization imposes a quantization limit on the frequency resolution, since each measurement can only resolve frequency changes in steps determined by the sampling interval. While interpolation techniques can be applied to estimate sub-sample zero-crossing points and enhance accuracy, these methods introduce additional computational latency, which may not be acceptable in time-critical protection systems. As presented in
Table 3, the minimum frequency variation per sample (Δf) is dependent on the Least Significant Bit (LSB) of the sample counter, which in turn is directly dependent on the maximum number of samples used to quantize one period of the signal (N).
Another significant drawback is the susceptibility to false zero-crossings caused by noise, harmonics, or transient signal distortions. In digital systems, such spurious crossings may lead to incorrect frequency estimates. Although digital filtering can mitigate this issue by smoothing the input waveform, achieving sufficient attenuation of high-frequency components requires a filter with a relatively high number of coefficients. This, in turn, increases the processing delay and can easily undermine the responsiveness of the system in fast protection applications. Therefore, while zero-crossing detection is efficient, its limitations in terms of resolution and noise resilience make it less ideal for meeting low-latency measurement requirements.
The DFT is a fundamental mathematical tool used to analyze the frequency content of discrete signals. By transforming a signal from the time domain to the frequency domain, the DFT enables us to identify the individual frequency components that comprise a complex signal. The DFT converts a finite sequence of equally spaced samples into a set of complex numbers, each representing a specific frequency component [
11,
12]. Its computationally efficient implementation, known as the FFT, makes frequency analysis practical for real-time and large-scale data processing. The general formula of the DFT can be written as in the equation below:
Parameter k corresponds to the target frequency bin, and N is the number of samples in the analysis window. A conventional DFT or FFT calculation is performed for the entire frequency spectrum, yielding several frequency bins equal to the number of samples. The frequency resolution can be obtained using the following formula:
The fs parameter corresponds to the sampling frequency value, and N corresponds to the number of samples used to define the signal window.
Table 4 presents the frequency resolution calculated for the standardized sampling frequency values with respect to the number of signal samples used to define the calculation window.
As can be observed from
Table 4, the DFT algorithm needs at least 10 periods of the signal to obtain a frequency resolution of 5 Hz. For certain applications, it may be sufficient to calculate just the fundamental frequency bin, and setting the signal window to a single signal period would not impose any disadvantage. However, for calculating different harmonic content, the signal window needs to be extended to obtain a finer frequency resolution. One important aspect to be noted is that increasing the calculation window will implicitly increase the calculation effort required and the total latency of the result. It is implied that if we set the signal window to 4 signal periods, the result will be available after 4 signal periods have been fed to the algorithm.
Table 5 presents the variation between the calculation effort and the number of samples required to obtain the results for the different popular DFT methods. The number of samples is inherited from
Table 4. Each value below the different DFT columns represents the number of complex multiplications required.
The calculation effort for a conventional DFT algorithm can be easily estimated as N2, representing the number of complex multiplications required. Other options to calculate the DFT are using the Radix-2 (R2) or Radix-4 (R4) simplifications of the algorithm, which reduce the calculation effort to N * log2 (N), respectively N * log4 (N). However, the difference can be easily observed in comparison with conventional DFT; the most notable disadvantage is the limitation of the sample window. Both Fast Fourier Transform (FFT) algorithms can be calculated for all the sample numbers using mathematical tricks, such as zero-padding; however, the resulting frequency resolution will differ significantly for the values presented in
Table 4. Another disadvantage shared by the fast methods is the requirement to have the entire window of data samples available at once; thus, the algorithm cannot progress until this requirement is fulfilled. The most notable advantages of the single-bin DFT (SbDFT) algorithms are the ability to store a single intermediate value throughout the calculation steps and the capability to calculate the next intermediate value as soon as a new sample becomes available. Even if several frequency bins are calculated simultaneously, this method offers significantly better latency, as it can progress with each new available signal sample from the target window. The SbDFT algorithm used by the prototype offers significantly higher accuracy than zero-crossing detection, providing more precise data about the signal’s phase evolution at a specific frequency, which makes it less sensitive to noise and waveform distortion. Unlike zero-crossing methods, SbDFT provides sub-sample resolution without requiring heavy filtering or interpolation of the signal samples. An additional advantage of this method is its compatibility with accumulated sample data, aligning seamlessly with the design of the EQA Engine. This stands in contrast to conventional full-spectrum DFT implementations, which typically require access to the entire raw sample buffer. A simplified conceptual overview of the algorithm is presented in
Figure 3, which illustrates the main steps and the required data flow.
The data samples and sample rate inputs are transferred from the Filtered Data Bus to the Accumulated Data Bus (see
Figure 1). The Accumulated Data Bus serves as a shared interface for all modules within the EQA Engine, including the Frequency Estimator, which contains multiple single-bin Estimator instances configured for specific frequency bins. For each received sample, the Counter 2 Angle sub-block represents the translation logic between the current sample index and the input angle for the CODRIC module. This module plays a crucial role in the presented algorithm, as it can be used to generate sine and cosine values without requiring any additional data, except for the input angle. There are several variants available for implementing this module, but in our case, a pipelined approach is the best solution, as it can also be used for future configuration and arrangement of the SMV protocol.
As mentioned earlier, the algorithm requires accumulated data, so the Current Sample Index and the Old Accumulated Data will be added to the contents of the SAM elements. On each iteration, the incoming data samples will be multiplied by the corresponding complex coefficient, then added to the Old Accumulated Data received. The SAM elements have a dedicated field for the Current Sample Index and Old Accumulated Data, as well as for the final result, which will be written when the algorithm completion is flagged. Compared to a conventional full-spectrum DFT approach, the data storage requirements for this option are minimal, and the required twiddle factors for calculations are generated on the spot after the stream data are available at the Frequency Estimator’s inputs. In terms of memory space in the SAM elements, this module will require 4x 40-bit fields for the accumulated data and the final result, and an 8-bit field to keep track of the Current Sample Index. These values are set to facilitate the algorithm’s compatibility with sample rates of 80, respectively 256 samples per second, and the final result is always reported after a full signal period has been accumulated. The Multiply–Accumulate (MACC) logic is a common component for all of the algorithms encapsulated into the EQA Engine, and its main purpose is to calculate the sample summation while not having access to all the required samples at once. The intermediate result for a data stream can be stored for the current sequence and accessed when a new sample of the same stream is received again by the IED. For each algorithm, the MACC has different configuration for the input/output size and the type of the mathematical operations—e.g., the SbDFT requires a MACC that operates with complex numbers, while the other MACC components operate only with integers.
For the RMS estimation algorithm, a similar approach is presented in
Figure 4, where the incoming data samples are synchronized with the accumulated data and a sample tracker.
The RMS Estimator module hosts multiple instances of the RMS calculation algorithm, enabling simultaneous processing of data streams containing measurements for all three phases. Unlike fixed-size implementations, the maximum number of samples used in the estimation is user-configurable and must be a power of 2. This requirement simplifies and accelerates the final division step by enabling the use of a right-shift operation through a dedicated RShift component, which efficiently replaces division when the sample count is a perfect power of 2.
Although RMS estimation is computationally simpler than complex-domain algorithms—since it involves only real values—the primary source of latency in this module is the SQRT component, which consumes a significant number of clock cycles to compute the square root of the accumulated average value. The maximum number of samples allowed primarily influences the memory space requirements of the RMS Estimator. For each incoming measurement, the initial requirement for storing a single squared sample is 64 bits, in line with the IEC 61850-9-2 standard’s maximum data representation. However, this is only sufficient for storing one squared value.
To manage the accumulation over a potentially large number of samples, the Current Sample Index is defined as a 32-bit value, allowing up to 232 samples to be included in the RMS calculation. To accommodate the resulting growth in bit width during accumulation, the accumulated data memory space is expanded to 96 bits, ensuring that no overflow occurs even at the maximum configured sample count. For the final result quantization, the number of required bits is significantly less than the number of bits required by the accumulation metadata. This happens because the division by the number of samples reduces the quotient size to the same number of bits as the representation of the squared samples, which is 64, representing the input to the SQRT component. The final result of the RMS value is represented on 32 bits, as the SQRT component’s output size is half of its input size.
Active power (P) estimation is slightly simpler than the previous algorithms, primarily because it does not require a square root calculation. This also improves the overall algorithm latency, as the square root algorithm is more time consuming compared to other mathematical operations.
Figure 5 presents the concept of the algorithm.
In comparison with previous algorithms, a key difference for active power estimation is that two data streams are required to initiate the calculation. As the standard stipulates, one SMV data packet contains data streams for currents and voltages. The algorithm is not calculating a result for each data stream, as in the previous cases presented, but is calculating a result corresponding to two distinct data streams. The maximum number of samples used to compute the active power is shared with the RMS Estimator, so both modules’ results will be available almost simultaneously. As explained in the previous case, having several samples equal to a perfect power of 2 simplifies the division step. In terms of memory space requirement, the intermediary accumulated data will require a total of 96 bits for proper representation: 64 bits resulting from multiplying one voltage value by one current value, and an extra 32 bits to accommodate the summation for the maximum possible number of samples. The final result will be reduced to 64 bits, as the added 32 bits are eliminated with the division by the number of samples, which is performed by right-shifting the accumulated data. The apparent power (S) and the power factor (PF) metrics can be easily calculated on the software side after all the required parameters are available to the processor. In the case of S, even though a simple multiplication between RMS voltage and RMS current is required, the parameter is not accommodated in the EQA Engine as it requires additional memory space, which must be kept to a minimum size for each stream. As for the case of PF, calculating the division between P and S requires a minimum level of accuracy to obtain a relevant result. The division by a non-constant value can be a challenge in the FPGA, even when using dedicated digital signal processors. The easiest way to achieve a high level of accuracy for this division without overcomplicating the FPGA design is to send the required parameters to the CPU and perform the division at the software level.
The last three parameters are calculated over a fixed number of samples, more specifically, a full signal period. The RMS value over a signal period (U
RMST) could not be covered by reusing the RMS approach discussed earlier. Even though we can fix the sample counter threshold to one of the standard values, the main incompatibility problem resides in the fact that the number 80 is not a power of 2, and the division by the number of samples cannot be performed with binary shifting.
Figure 6 presents the RMS estimator concept adapted to cover this obstacle.
Apart from a fixed number of samples received from the Filtered Data Bus, the accumulated result is sent to the RShift or MultShift component, depending on the data stream sample rate. Although counterintuitive, a better alternative to performing division by 80 is to multiply the accumulated value by 1/80. From a mathematical point of view, it is a completely equivalent operation, but from a digital design perspective, it is equivalent to multiplication by a constant, rather than division. From the perspective of memory space requirements in the SAM elements, the accumulated data are represented on 72 bits to cover the maximum sample rate value. For streams with 256 samples per period, the division by the number of samples can be performed as an arithmetic shift. For streams with 80 samples per period, a 16-bit constant is defined to represent the quotient for the 1/80 result. Since this multiplication will always result in an output value that is lower than the input value, the resulting 16 LSBs can be discarded from the start. By doing this, we can guarantee that the first 8 MSBs of the result will always be zero and can be discarded, resulting in a final value of 64 bits required to quantize the input value of the SQRT component.
For the final two energy quality parameters—average rectified value (U
ARVT) and peak absolute value (U
peakT)—the conceptual implementation approach is presented in
Figure 7.
Since both algorithms require the absolute value of the signal samples, they have been incorporated into the same module. The peak detection is performed by saving the absolute value of the last received sample and comparing it with the absolute value of the current sample. The accumulation of the absolute value is performed on a full signal period, and the average value calculation requires division by the number of accumulated samples. Therefore, the same approach is adopted to handle the division, as shown in
Figure 6. For the ARV Estimator, the latency value is significantly smaller than that of any of the RMS Estimators, as it does not require a square root operation. In terms of memory space required for the SAM elements, first, the absolute value of the last identified peak value must be stored, resulting in 32 bits being required for this part. The accumulation can be easily stored in a 40-bit memory space to cover the intermediate value for the maximum sample rate, which is 256 samples per period. After the division is performed, one way or another, the final result of the ARV is quantized on 32 bits.
Table 6 provides a detailed overview of the memory requirements for each independent data stream. Each SAM element is responsible for storing intermediate values needed to derive various energy quality parameters, as well as the final results. To optimize memory transaction efficiency, it is essential to understand the memory footprint of each parameter. This table summarizes the data fields stored per parameter, the bit widths allocated to each field, and the total memory consumed per stream, providing a clear reference for evaluating and comparing the resource demands of individual application data compatible with IEC 61850-9-2-LE.
The RMS and active power parameters share a single index tracking configurable counter. The SbDFT sample counter can be set to 4096 samples, allowing the calculation of frequency bins for up to 10 signal periods at the highest standard sampling frequency presented in
Table 3. The remaining three parameters utilize a shared 8-bit counter to perform calculations on a signal period of up to 256 samples. Frequency and coarse harmonic content can be calculated for the measured data streams using the seven user-defined frequency bins. If the ASDU contains a full 3-phase measurement, there are seven frequency bins available for calculation for each phase. This feature enables the calculation of the fundamental frequency component, along with the other two harmonics [
13], allowing the user to split the target harmonic components between the measured data streams. Although not the most accurate or efficient method, the approach focuses on obtaining the lowest possible latency value for calculating certain critical parameters, which is much quicker than conventional software implementations that depend on receiving most or all of the data samples before starting the calculation steps.
As presented in
Table 6, the total number of bits required for a single data stream is the sum of the accumulated data size for each parameter, the final results size, and 52 bits for the index counters. Considering that the active power memory requirements are split between two streams, the final requirement to store a stream’s metadata and results amounts to 1732 bits, or 216.5 bytes, without accounting for the index counters. The maximum number of streams contained in an application-specific data unit (ASDU) is 8, representing four voltage measurement data streams (3 + N) and four current measurement data streams. For all eight streams, a 32-bit configuration register is stored in each SAM element to set the sample threshold for RMS and active power calculations. Each SAM element encapsulates the entire metadata required for all the data streams of a single ASDU, totaling 13,856 bits, or 1732 bytes. The absolute maximum values presented are calculated for the scenario where the ASDUs are built in the most inefficient mode, containing a single sample per stream and eight distinct streams, and the SbDFT calculation is enabled for the maximum number of bins for all of the voltage streams. Memory dimensions like the one presented are not suitable to map into the programmable logic’s block memory. For our target of 512 ASDUs actively decoded, the block memory requirement reaches ~139.5 blocks to process the data, and the target FPGA has a total of 140 blocks. Even if we reduce the number of pre-processed ASDUs, the memory resource requirement will still be incompatible with the hardware platform, resulting in very high routing congestion and affecting the overall system timing. This can make design synthesis and implementation very difficult, and a significant amount of FPGA resources will be used only for routing, which is undesirable.
All of the SAM elements are defined in the system’s dynamic memory, which is shared with the embedded ARM cores. Using the dedicated DMA module, the FPGA logic can have access to the system’s dynamic memory without any interaction from the system’s microprocessor. The SMV Scatter–Gather (SSG) module, as presented in
Figure 2, has the role of scattering the received accumulated data from the DMA module to the rest of the EQA Engine’s components. Every component used to calculate an energy quality parameter has the output ports connected to the SSG module. As the DMA has the read channel logic working independently of the write channel logic, the SSG module is built similarly, as the scatter logic is synchronized with the DMA’s read channel, and the gather logic is synchronized with the DMA’s write channel.
4. Implementation Results
This section presents the components of the EQA Engine, as introduced by
Figure 2, and highlights the data formats of the algorithms, along with their data flow and dynamic memory interaction. A high-level diagram of the implemented design is also presented to illustrate the configuration in relation to the SoC components. The latency of each presented hardware component is also discussed, from the point of feeding the raw data samples to the algorithms, to the points of obtaining the intermediate and final results. In the last subsection of this chapter, a resource report is presented to showcase the most intensive configuration of the subsystem, with the decoding part set for the maximum count of svIDs.
4.1. Data Flow Analysis of the Implemented Modules
This subsection presents each significant module encapsulated by the EQA Engine and its role in the data ordering and processing, as it can be found in
Figure 8. Some of the modules, such as the register access and DMA components, are not detailed in this section because they represent semi-generic components with no role in the actual data processing steps.
The SSG module represents the first key component of the EQA Engine. The primary function of this module is to synchronize the incoming data from the Filtered Data Bus with the data received from dynamic memory, then distribute it across the other components of the EQA Engine. Note that the diagram is not covering all the connections between the components, as many synchronization and monitoring flags are not yet in their final stage.
All of the arithmetic modules use accumulation components to calculate the target parameters. The RMS Estimator and Active Power Estimators can be configured on a large number of points (up to 2
31), and they use similar accumulation components that are inferring DSP units into the RTL design. The right-shift component covers the division of the accumulated value by the number of samples. These components also add a rounding constant before performing the arithmetic right shift and obtain a rounding to the nearest integer value [
14]. For these two modules, the right-shift parameters are chosen by using the internal SMV ID, which provides the user-configured window length. In the current implementation, these modules are only working with sample windows which are perfect powers of 2, as the integration of other window dimensions requires more development and validation work. The SQRT components represent one important architecture difference between these two modules. The best solution for this component’s functionality is represented by a non-restorative and fully pipelined algorithm [
15], which uses a fixed number of subtractions and additions to calculate the result. The sample counters for these modules are handled in the SSG module, which uses two synchronization signals to indicate the beginning and the end of a sample window.
The Frequency Estimator ramps up the development complexity as it requires several intermediate steps to prepare the data for the accumulation step. The implementation version of the SbDFT stands out as it does not use any pre-coded sine and cosine values. An Angle Translator component is used to calculate all the required angles with the steps defined by the user for the desired frequency bins, as presented in Equation (3).
F represents the target frequency bin, and N represents the DFT sample window. Parameter n from the DFT generalized formula is equal to 1 for calculating the angle step. The Step value is multiplied by the current index of each incoming measurement data sample to obtain the final angle required to process the current element. In the current state of development, the user can set up to seven distinct Step values, corresponding to seven target frequency bins. A fully pipelined CORDIC module [
16] is used to calculate the values of the trigonometric functions using a fixed number of additions, subtractions, and arithmetic right shifts. The number of iterations is determined by the desired output quantization, which in this scenario is 18 bits for sine and cosine. If the module performs more than 18 iterations, the error correction is saturated, and the final result will not get any additional accuracy. After the cosine (for real part) and sine (for imaginary part) are generated by the Angle Translator for the current measured sample, a complex multiplication is performed and the result is normalized (because the sine and cosine are sub unitary), then accumulated. The complex multiplication is also built using DSP units, using guidelines from Xilinx datasheets [
17], and the same guidelines are used to build the MACC units from the RMS, RMS
T, and Active Power Estimator modules.
Although very similar to the RMS Estimator module, the main difference of the RMST Estimator module resides in the ability to operate with sample windows which are not a perfect power of 2. This feature is used to perform a division using multiplication and a bit shift instead. Implementation for the RMS and Active Power Estimator modules has its technical challenges, as the algorithm requires considerable FPGA resources for the extended data formats of 96 bits.
The last module presented is the ARV Estimator, which is essentially a version of the RMST Estimator with disabled features. Instead of squaring the incoming measured samples, the module will just calculate the absolute value, and the final result is not passed through a SQRT block. Because the absolute value is already calculated in this module, it is also stored and compared to the next incoming sample, to keep track of a peak value over a signal period.
4.2. Hardware Implementation Overview
The HS3 prototype, originally designed for high-speed, deterministic decoding of Sampled Measured Values (SMV), has been architecturally expanded to accommodate real-time energy quality analytics via the Energy Quality Accumulator (EQA) Engine. This enhancement required substantial adjustments to the system’s hardware infrastructure, particularly in memory management, synchronization logic, and inter-module communication. At the architectural level, the addition of a second DMA interface to the dynamic memory controller ensures that the EQA Engine and the SMV decoding pipeline can operate independently, reducing memory contention and ensuring deterministic throughput. This separation is crucial when scaling to higher SMV stream densities, especially when each ASDU may include up to eight distinct data streams. The updated infrastructure, shown in
Figure 9, accommodates this by enabling dual-port access and higher bandwidth utilization without increasing AXI interconnect congestion.
As in the previous version of the project, the native data width of the Processing System AXI port is 64 bits. The AXI Interconnect 0 block has two 32-bit input ports. In earlier configurations, it was not possible to saturate the full data bandwidth due to the underutilized port width. However, with the latest enhancements to HS3, bandwidth saturation may occur if memory transactions are not properly synchronized. The newly added features of HS3 operate safely on a 200 MHz clock domain and can be adapted to higher clock speeds when deployed on more advanced FPGA architectures.
Until extensive field testing is conducted, the prototype continues to rely on a dedicated Data Generator module, which stores raw test data in local block memory. This method is particularly useful as it allows data to be delivered at high rates—often exceeding those observed in real-world scenarios—allowing future module testing under worst-case timing conditions.
Integration with the existing HS3 decoder is seamless. The energy quality results are handled along with the original decoded SMV data blocks via a dedicated DMA controller, ensuring that both decoded measurements and analytics results are correctly synchronized. The separation of the DMA ports for decoding and processing can enable features such as parallel memory access, at the cost of an additional high-performance memory port. This co-processing approach allows higher-level applications running on the embedded ARM cores to access pre-processed quality metrics alongside raw decoded data, simplifying post-processing or fault classification tasks. In terms of platform compatibility, the system is mapped to an MPSoC device, leveraging the parallelism of the FPGA fabric alongside ARM cores for supervisory and configuration functions. The prototype’s RTL architecture is portable to other FPGA or SoC platforms supporting AXI and DMA interfaces, making the architecture a strong candidate for future deployment on embedded substation devices. To achieve optimal performance and minimize LUT/FF requirements, the target architecture must also have DSP components available. Even though the multipliers, which are dependent on these embedded components, are optimized for the DSP48E2 [
18], adaptation to other DSP models can be performed directly in the multiplier modules without affecting the rest of the design.
In terms of modularity, the subscriber prototype can be deployed only as a stream decoder, and the EQA Engine can be partially deployed to support the specific features required by the application. This feature enables the conservation of FPGA resources. It reduces memory transactions, but it also requires future work to develop predefined configuration options for faster deployment and ensure that each feature can operate independently of the others. Some of the modules also require future refinement, specifically to extract the shared components into independent modules, thereby removing the interdependency of the parameter estimators.
As presented in the resource reports, it can be observed that multiple instances of the modules can be used on the same hardware, provided the physical infrastructure allows it (e.g., an additional Ethernet port for a second process bus connection). The prototype also supports native data output, enabling the integration of powerful features that the hardware platform can offer in specific scenarios, depending on the physical components accommodated within the SoC’s printed circuit board.
4.3. Latency Analysis
In contrast to the decoding section of the prototype, the main sources of latency in the EQA Engine are less dependent on the maximum number of subscribed streams, and more influenced by ASDU formatting, the number of samples per ASDU, the number of frequency bins configured for calculation, and the sample rate of the measured data.
Table 7 breaks down the latency components associated with each parameter calculation module in the EQA Engine.
Each module follows three iteration modes:
Initial iteration, where no accumulated data are expected and processing begins immediately;
Nominal iteration, which depends on accumulated data from the previous iteration;
Final iteration, which also depends on prior accumulations but includes additional computations to produce the final result.
In all modules, the first computational stage has zero latency relative to input during the initial iteration, followed by SSG read operations in both nominal and final iterations.
For the RMST and ARV Estimators, the double blue line separates the final iteration latency for standardized sample rates of 80 and 256 samples/period. As mentioned in the previous section, non-power-of-two sampling rates result in slower performance due to the higher delay required to perform the division steps.
As observed, the Frequency Estimator module is the only case where the latency of a nominal iteration is equal to the latency of the final one. This occurs because no additional mathematical operations are performed when the final iteration is reached. Another aspect to be observed is that latency is independent of the sample window, as the latency values are calculated based on the last SMV raw sample received by the EQA Engine.
Another important component in determining the maximum latency of a specific configuration is the SSG memory transactions and data alignment. The main role of this module is to deliver the required data to the modules presented in
Table 7. Data transactions performed by this module are presented as follows:
In the formulas presented above, SSG
Read represents the number of clock cycles required to perform a memory read operation in order to retrieve the complete temporary accumulated data associated with each ASDU. These values are derived from
Table 6, which details the memory space allocation for both intermediate accumulation and final result storage, collectively referred to as a single stream-accumulated metadata (SAM) structure. The added delay components—D
RD, D
WR, and D
Resp—represent the average memory transaction initialization delays and write confirmations. Depending on the AXI infrastructure utilization, their range can vary between 4 and 15 clock cycles, and the highest value will be considered for calculation. Because the SSG
Read operation starts at the same time as the data alignment, we can consider only this value for latency analysis.
Similarly, the number of cycles required to complete a memory write operation for storing the final results of each ASDU is denoted as SSGWrite2. This value constitutes the final component in the latency analysis, representing the cost of outputting the calculated energy quality parameters back into the dynamic memory.
To obtain the maximum latency of the energy quality parameters, we can consider the following configuration:
Sample Size = 32 bits
Subscribed ASDUs = 512
S = 8 streams/ASDU
F = 7 frequency bins
N = 8 samples/ASDU
The T
SMV Filter component represents the total decoding delay required by the SMV Filter module to deliver data to the dedicated DMA from the moment it receives it at the physical interface. To obtain this value, we can refer to the previous work [
4] and calculate it by subtracting the T
prep and T
write from the Total CLK Core Cycles value, resulting in a total of 2.266 μs. This value represents the delay between the data arrival at the physical interface and the data arrival at the EQA Engine’s input. The maximum Delay to Result Ready from the energy quality estimator modules is represented by the RMST Estimator with a value of 112 clock cycles for the streams sampled at 80 samples/period. For streams with a sample rate of 256 samples per period, the greatest value is represented by the Frequency Estimator—89 clock cycles. Taking into consideration that the clock period is 5 ns, we can substitute the values in (8) and calculate the worst-case latency as follows:
After performing the calculation, we obtain a total worst-case scenario latency value of 5.171 μs. Although very close to dropping below the proposed 5 μs threshold, the main latency components are represented by the SQRT components and the Frequency Estimator module. There are good chances of reaching the proposed initial goal after several optimizations are performed. Until the optimization step, extensive validation must be performed, as a considerable number of behavioral problems may appear when simulating a high volume of process bus data traffic and corner cases. The mitigation of these problems might require additional changes in the design data ordering and the insertion of additional data buffers to align the data flow properly. All of these actions will have an impact on the final system latency and resource requirement.
4.4. Resource Reports
The integration of energy quality monitoring functions within the HS3 prototype adds significant complexity to the overall hardware resource footprint, particularly due to the introduction of the Energy Quality Accumulator (EQA) Engine and its associated computation modules. Unlike the baseline SMV decoding pipeline, which primarily relies on basic sequential logic, finite state machines, and straightforward packet filtering, the EQA Engine introduces multiple mathematically intensive blocks that require careful balancing between logic, memory, and dedicated arithmetic resources such as DSP slices.
Table 8 presents the reported post-implementation resource requirements for the EQA Engine, the dedicated DMA module, and the maximum resource count of the decoding pipeline, highlighting the overall usage of the FPGA when all subscriber features are deployed.
The current development state of the SMV Decoder supports a maximum svID density configuration of 512 IDs. This value can correspond to up to 4096 distinct data streams, depending on the user configuration. Practical implementations of the SMV protocol often use a number of streams per ASDU less than 8 (contrary to the standard LE specification). For these cases, the data processing blocks are less constrained by latency issues, as the number of streams per ASDU is reduced for data transfer efficiency. Scalability up to 1024 svIDs and beyond is considered in future development stages, but will require a lot of synchronization work between the Data Decoding section and the Data Processing section, as presented in
Figure 1.
One of the most resource-demanding components is the RMS Estimators. Their reliance on pipelined square root (SQRT) units, high-width accumulation logic, and wide input multipliers directly impacts the FPGA’s available LUTs, flip-flops, and DSP blocks.
The Active Power Estimator adds further DSP utilization, as it performs continuous point-wise multiplication of synchronized voltage and current streams. While its architecture is slightly simpler than the RMS Estimators—due to the absence of a SQRT stage—it still consumes an equivalent number of DSP blocks and comparable logical resources for Multiply–Accumulate operations. Its tight synchronization with the RMS Estimator allows the design to reuse certain configuration and control resources, reducing redundant logic but increasing the coupling level of the EQA Engine’s modules.
The Frequency Estimator stands out due to its extensive use of the pipelined CORDIC engine for generating sine and cosine functions. This approach was chosen to minimize the need for storing large lookup tables for trigonometric coefficients. The CORDIC implementation completely removes the need for BRAM usage at the cost of a considerable number of logic elements.
A significant factor influencing resource usage is the extensive reliance on dynamic memory for storing intermediate results and stream-accumulated metadata. For each ASDU processed, the SAM holds accumulation registers, current sample counters, final result buffers, and configuration values. Depending on the enabled algorithms, the metadata footprint of each data stream can reach up to 1.732 kilobytes per ASDU. With the maximum configuration of 512 concurrently decoded ASDUs, this quickly scales to almost 900 KB of total dynamic memory when fully populated. Due to this large footprint, it is infeasible to rely solely on on-chip BRAM for metadata storage. Instead, the design utilizes the MPSoC’s shared DDR memory, which is accessed through a dedicated DMA channel. This ensures that the programmable logic fabric remains flexible and scalable, while the ARM cores can perform supervisory tasks without blocking the memory bus. However, this design places extra emphasis on ensuring that the AXI interconnect can handle high-bandwidth, low-latency transactions for both decoding and energy quality operations in parallel.
The design was implemented with no major problems in the routing process and no failed paths, using a clock period constraint of 5 ns, corresponding to a frequency of 200 MHz. Unfortunately, higher frequencies are not achievable on the target hardware platform due to the limitation of the FPGA fabric; however, newer technologies should be able to implement the prototype at considerably higher values. While the prototype’s post-implementation results show manageable resource usage for a moderate-density device like the Zynq 7020, future deployments on higher-capacity MPSoCs (such as Xilinx Zynq Ultrascale+ or Intel Agilex SoCs) will unlock additional performance. These modern devices offer significantly more LUTs, higher clock speeds, abundant BRAM, and more DSP slices with advanced features such as floating-point support, which could allow improvement of certain SQRT and division modules, by increasing the resulting precision.
4.5. Performance Evaluation
To summarize the obtained results and showcase a quick performance evaluation of the current state of the presented HS3 subsystem,
Table 9 presents the total latency of the subsystem and FPGA resource requirement for different configured values for the maximum svIDs which can be processed in hardware. More details about the data decoding latency and resource requirement can be found in our previous work [
4].
The presented values are obtained when the data samples are defined on 32 bits. The latency values represent the total delay required to decode, store, and process data packets containing a single ASDU, equivalent to the initial latency obtained when the packets contain multiple ASDUs. Additional memory operation cycles have been considered in the calculations to cover the worst-case scenarios, which can easily be considered scenarios where the memory infrastructure lacks optimization. Suppose multiple ASDUs are received in the same data packet. In that case, the initial latency is represented in
Table 9, and the subsequent latency for each additional ASDU is defined only by the data processing latency, which does not exceed 40% of the worst-case scenario. The resource variation is given mainly by the data decoding section, and the data processing section is accounted for with all of its features enabled.
5. Error Analysis
The accuracy of the energy quality parameters calculated within the HS3 prototype depends primarily on the design and implementation of its hardware arithmetic units, the word lengths used in fixed-point representations, and the quantization or rounding steps embedded in each calculation stage. Unlike purely software-based implementations that can rely on double-precision floating-point operations for intermediate results, FPGA-based arithmetic must balance accuracy against available resources, propagation delay, and area constraints. Therefore, understanding the potential sources of computational error is essential for validating the system’s suitability for real-world deployment in critical substation environments.
The first significant source of error arises from the fixed-point representation of signal samples and intermediate values. For example, the RMS Estimator accumulates the squares of voltage or current samples with a 96-bit accumulator. Although this wide bit-width ensures that overflow is virtually eliminated even for high sample counts, the final output must be right-shifted to normalize the average value, then quantized to 32 bits when passed to the SQRT block. The SQRT operation itself introduces a second quantization stage. Since the implemented non-restoring SQRT algorithm produces a result with finite precision, any residual error from the intermediate division or right-shift stage can be amplified or reduced by the root operation. In typical scenarios, this error remains bounded by the quantization step.
The RMS and Active Power Estimators share the same Multiply–Accumulate (MAC) logic for summing the squared values (for RMS) and the product of voltage and current pairs (for active power). Each multiplication uses the FPGA’s embedded DSP slices, which deliver near-exact results for operand widths up to 25–35 bits. The primary numerical limitation, therefore, comes not from the multiplication itself but from the final division step. When the divisor is a perfect power of two, the hardware implements this as a simple right shift with a simple rounding to nearest integer, which induces an error equal to ½ LSB of the resulting value. However, when computing parameters such as RMST or ARV over a fixed sample window that does not align with a power-of-two length (e.g., 80 samples/period), the division must be implemented using a multiplication by a pre-computed constant factor (the reciprocal of the sample count). This multiplication by a fractional constant is also carried out in fixed-point arithmetic and therefore introduces additional rounding noise, typically on the order of ½ LSB.
The multiplication of two quantized values adds a new rounding error at each step. As each SbDFT bin accumulation progresses, these small rounding errors accumulate linearly with the number of samples. However, since the CORDIC implementation is pipelined and the same quantization model is used consistently across iterations, the dominant source of error is typically the static quantization rather than a dynamic noise build-up. In practical configurations with typical window lengths (80 to 512 samples), the cumulative effect is expected not to exceed 0.5% relative error for the target frequency bin.
To properly illustrate the calculation errors between the double precision of the Octave IDE and the hardware-calculated values, the prototype has been fed with real signal captures obtained from a public GitHub (ID 0d6760c) project [
19].
Figure 10 presents a single period of the signal for voltage and current extracted from the recorded samples for phase A.
The recorded samples have a scaling factor of 103 for both voltage and current measurements, as this represents one of the rules defined by IEC 61850. As can be observed, the recorded signal has a nominal frequency of 60 Hz and a sample rate of 80 samples per second. Because of this, the RMS Estimator will be skipped from error analysis, as it does not yet support non-power-of-two signal periods. The RMST Estimator, on the other hand, has the capability to process these sample rates, and we can safely presume that the results are not significantly different, as the two estimators have a very similar architecture.
To properly highlight the difference between the calculations performed by the hardware modules and Octave’s double-precision format, the same signal samples fed into the prototype have been processed in Octave’s IDE.
Figure 11 presents the RMS
T calculations for 30 signal periods (or 0.5 s), highlighting the difference between Octave output (blue) and the hardware output (orange).
It can be observed that the absolute error is almost proportional, but it is entirely generated by the division step, combined with the SQRT error. For the hardware values, the scaling is performed in Octave, as the hardware calculations never remove the scaling factor. As observed from the hardware results, the maximum observed relative error does not exceed 1.115 × 10−2 [%] and has a variation of 1 × 10−5 [%]. For most applications, the value is not concerning, but improving it would require extensive optimization and re-design of the division and square root steps of the RMS Estimators.
Figure 12 presents the active power (P) calculations and the differences between the two calculated sets (streams).
As can be observed, the relative error between the hardware values and the software values is on the order of 7 × 10−12 [%], which we can be safely considered negligible for substation automation applications. The plotted values from hardware almost perfectly overlap with the software values. This result is obtained using sample windows of 128 samples, which is a perfect power of 2. We can easily conclude that the best precision is obtained for sample rates that are a perfect power of 2, as the division step can be performed with minimal errors.
Figure 13 presents the average rectified values for 30 periods of the extracted signal, for both voltage and current.
The maximum relative error does not exceed 2.4 × 10−2 [%] and has a variation of 1 × 10−5 [%]. Compared with the RMST Estimator, both modules share the division error, but in the case of the ARV Estimators, the division of the accumulated values by the number of samples is the final result. In the case of RMST Estimator, the final result is the square root of the ratio of accumulated values to the number of samples; thus, the maximum relative error value is almost half in comparison to ARV Estimators.
To compare the Frequency Estimator values with the software counterparts, a full DFT over 10 periods of the signal is performed, as presented in
Figure 14.
Figure 14 was obtained in Octave by using the “fftshift(fft(x))” functions and normalizing the results before plotting their absolute value. The frequency bin axis has been calculated by considering the sampling frequency as 4800 samples/s (corresponding to the signal period of 80 samples/s), resulting in a total spectrum of 4800 values. Because we used 10 signal periods, according to
Table 4, the resulting frequency resolution is 6 Hz. According to the function definition, the resulting fundamental strength should be found at the indices 391 and 411, corresponding to the frequency bins of −60 and 60 Hz. We observed that Octave presents a plot drawing error, as if we examine the closest points around the fundamental value, the magnitude values are sub unitary. The theoretical error analysis for the hardware module has several variables generated by the CORDIC estimation and the normalization for the intermediate results. The direct error analysis results in a maximum relative difference of 0.47% between one point from the Octave calculated spectrum (the fundamental value) and a targeted frequency bin from hardware results.