Test for Reliability for Mission Critical Applications

Pipponzi, Mauro; Sangiovanni-Vincentelli, Alberto

doi:10.3390/electronics10161985

Open AccessArticle

Test for Reliability for Mission Critical Applications

by

Mauro Pipponzi

¹ and

Alberto Sangiovanni-Vincentelli

^2,*

¹

ELES Semiconductor Equipment, 06059 Todi, Italy

²

Department of EECS, University of California, Berkeley, CA 94720, USA

^*

Author to whom correspondence should be addressed.

Electronics 2021, 10(16), 1985; https://doi.org/10.3390/electronics10161985

Submission received: 25 June 2021 / Revised: 6 August 2021 / Accepted: 12 August 2021 / Published: 17 August 2021

(This article belongs to the Special Issue Test and Monitoring of Aging Effects in Electronics)

Download

Browse Figures

Versions Notes

Abstract

:

Test for Reliability is a test flow where an Integrated Circuit (IC) device is continuously stressed under several corner conditions that can be dynamically adapted based on the real-time observation of the critical signals of the device during the evolution of the test. We present our approach for a successful Test-for-Reliability flow, going beyond the objectives of the traditional reliability approach, and covering the entire process from design to failure analysis.

Keywords:

test; reliability; on-chip monitoring

1. Introduction

Reliability analysis has traditionally been seen as one of a sequence of steps in the IC device lifecycle, which is not integrated with the other steps in the design or test flow. It is mostly performed at the end of the process (Figure 1), after manufacturing, and exposing the device to stress conditions of power, voltage, and temperature [1].

This flow alternates between short (few seconds) test trials on an ATE machine and long periods of Burn-in/Life test (Figure 2). Aside from the delay, and related costs, continuously switching equipment only allows for the test results at a discrete number of specified moments, i.e., when the device is run through the ATE machine. Moreover, the result of this process yields a simple PASS/FAIL message, without, in case of failure, specific information on why a failure has occurred.

Testing, despite great strides in methodology, tools and equipment, has been focused on addressing design, process and manufacturing defects, with almost no attention paid to what might happen under stress conditions, when considering the effect of aging and addressing systematic failures that might show up after the device is placed in operation.

These separate approaches have worked well in the past, in relation to the objective to screen out the defective products due to infant mortality, but today’s requirements of mission critical applications make reliability an increasingly important issue.

Reliability affects functional safety: Safety standards require that the causes of systematic failures be removed [2].

Reliability affects security: The exploitation of HW vulnerabilities that might be exposed by failures has become easier to achieve in recent years due to widespread availability and cost reductions in the appropriate equipment.

Focusing on the elimination of units presenting early-life defects [3] is no longer sufficient. We should investigate the reliability of the product the entirety of its useful life. Information is needed to determine why a failure took place and how to improve the process for avoiding it. Collecting this information implies knowledge of the behavior of the device during its operational lifetime, how it will degrade, and the residual sources of systematic failures. Design, test, and manufacturing processes must work together to ensure that products present as low a failure rate as possible (Figure 3), not just for quality (and related economical) reasons, but also to meet security and safety expectations.

Difficult to identify situations, such as intermittent faults [4], arise at unpredictable moments in the device’s lifetime and can only be triggered and detected by a combination of stress conditions and continuous monitoring of the involved signals.

In this paper, we present a methodology to address reliability in the development of a semiconductor product, where reliability is considered early on in, and throughout, the entire process, from design to production (Figure 4).

We call this methodology Test for Reliability (TfR), where test trials under nominal and stress conditions take place at the same time, making use of a massively parallel platform [5]. This setup allows for the continuous monitoring of the key parameters of a large number of devices, with a higher level of granularity than is allowed by the traditional ATE + burn-in approach (Figure 5). The test conditions can be dynamically adapted based on an observation of the results to maximize the extracted information.

Our research contribution is the definition of the methodology and codification of its steps. The methodology is the most important aspect of design. We witnessed and participated in the development of the digital methodology and tool flow in integrated circuit design. Without this methodology, the digital design would not be automated and would not be as powerful as it is today. We believe that TfR will play a fundamental role in the design of reliable systems.

2. Test for Reliability

Test for Reliability aims to provide detailed information on how devices throughout their life in a certain environment by shifting reliability analysis from an activity that takes place later on in the process (Figure 1) to a pervasive concern in the product lifecycle (Figure 4).

In this approach, both test and stress steps take place at the same time (Figure 6) and on the same platform, on a high number of devices, during different stages of the process, device validation, qualification, and production. During this process, key parameters, environment data, functionality, and diagnostics are continuously monitored. Hence, reliability analysis is not confined to the end of the production flow, but it starts very early, during qualification and pre-qualification, providing data to design and making it possible to remove the causes of systematic failures when it is still possible and cost-effective to do so.

Test for Reliability is centered on the collection of test and reliability data (Figure 7): it spans the entire process, from design to manufacturing, test trials and failure analysis (Figure 8), and involves a larger number of devices than is typically used in a burn-in process.

During the device-design phase, observers, which are used to monitor the evolution of the critical signals, and controllers, as well as to force states that make it easier to trigger failures, must be inserted into the design to observe the parameters whose behavior should be monitored. These monitors and controllers can be traditional DfT features (such as BIST or scan chains), sensors, safety mechanisms, status and/or diagnostic registers. These features are often present in the device because they were inserted to satisfy other requirements. Sometimes, they must be deliberately added to satisfy the TfR requirements.

The development of the test program considers the specific features of the device, the functionalities and the technology or technologies used by the device, to maximize stress coverage and observability.

Running the test on a massively parallel stress-and-test platform (Figure 9) is essential to extract the data necessary to make the flow successful and gather the information needed to trigger continuous improvements in the design and process. A largely parallel platform makes it possible to collect data from a large population of devices (in principle the entire population) with a high granularity of measures (continuous monitoring of the relevant signals during for tens of minutes, as opposed to a discrete number of snapshots every N minutes).

The collection of an extensive database of reliability and test data is the principal objective of the TfR flow. These data are much more informative than the traditional binary PASS/FAIL result for each test. Information is provided about where the failure takes place, under which conditions, and how the signals have evolved to lead to the failure.
The data collected allow for an analysis of the behavior of the device in time, as stress conditions are applied, an evaluation of the insurgence of anomalies, advanced characterization, the highlighting of discrepancies among production lots and feedback to design and process future developments.
The data collected also provide Failure Analysis with enough information to determine why the device has failed, and what the possible root causes of the failures might be, eventually leading to improvements in the design and manufacturing process.

In summary, running the stress and test together presents many advantages:

It takes into account the behavior of the device under stress conditions, as early as pre-qualification and qualification;
It provides early feedback to the designer, starting from early silicon, making it possible to improve the process when it is still possible and cost-effective;
It provides a continuous picture of how the most significant signals evolve over time rather than just signaling failure;
It allows for the continuous monitoring of key parameters that provide a better description of the product behavior, allows for the generation of technology models and supports the investigation of failures;
It addresses difficult-to-detect faults, such as intermittent faults, the role of which is becoming increasingly important in mission critical applications.

Ultimately, TfR affects all the stages of the process of collecting data from a large population of devices with high granularity, thereby providing the necessary information to make the process more robust and move toward the elimination of all systematic failures (zero defects).

3. Design for Reliability Test

TfR allows for movement from a sequence of PASS/FAIL answers to a Learn-from-Fail concept, where the failure data captured during the test are used to improve the design and manufacturing process. To make these data available, TfR relies on embedded circuits within the device to provide continuous control and monitoring of key device parameters that impact reliability, such as junction temperatures and node voltages. In addition, these embedded features are used to optimize test coverage and stress levels, and also to reduce test time.

In many cases, little or no additional chip area is required for the extra functions needed for reliability testing, especially for automotive devices. Indeed, standards such as AEC-Q100 and ISO26262, as well as security considerations [6], require that the chip includes an extensive built-in self-test, as well as online safety mechanisms.

To ensure that the device is adequately covered, our methodology prescribes a number of steps during the design phase (Design for Reliability Test (DfRT)):

Analysis of the device structure;
Identification of which IPs and Technologies are used in the device. Indeed, the embedded circuitry requirements may be different depending on the specific technology used in the device, whether digital, analog, or non-volatile memories, and the targets that need to be reached in terms of coverage and failure rate;
Depending on the application and target product, identification of the test and reliability targets that must be achieved;
Definition of the types of reliability test trials necessary to achieve these targets (e.g., HTOL and ELFR);
Definition of the signals and parameters to be monitored during the test trials;
Identification of the features necessary in the IC to:
○
Trigger faults and observe signals;
○
Provide access to the controlled and observed signals.
Finalization of the DfRT Plan:
○
List of the test trials to be performed per IP and technology block;
○
List of the necessary triggering/observability features;
○
List of the interfaces required;
○
Requirements for triggering patterns.
Design and implementation of the required features;
Planning of special test functionalities outside the DUT (i.e., on the test board) in case not all the requirements can be met inside the IC. These features sit physically on the test board and logically between the board and the DUT and provide testing abilities such as temperature regulation, precision analog measurements, current sensors and regulators, and peak detectors;
Finalization of the DfRT plan coverage matrix, providing an upfront a view of what can and cannot be achieved due to the lack of necessary features.

Once we have analyzed the different blocks and technologies, and the targets that need to be achieved, we can determine the list of tests necessary to satisfy those targets for each IP block. Once we have the list of test trials, they, in turn, determine:

Which parameters and signals we need to monitor and control;
Which features we need to be embedded in the device to be able to monitor and control these signals;
The communication interfaces to add to the device in the desired state for each trial and how to access the embedded features, whether for reading or for writing.

In a typical digital SoC device, the list of signals that need to be monitored is rather straightforward. In many cases, special attention must be paid to non-digital technologies such as power devices, memories, and MEMs, where the list of signals that need to be observed and controlled may not be easy to identify.

There are several types of signal that must be monitored, depending on the type of technology and the requirements of the test. Among them are:

Output of scan chains;
BIST signatures;
PVT sensors;
Diagnostic registers;
Periphery of analog blocks.

Table 1 provides an overview of the different embedded functionalities used to address signals and device parameters, and their relationships with different types of analyses and technologies.

4. Results

Our TfR flow was applied to several devices from different application domains, such as ADAS, high power and storage, making use of a variety of technologies. The results have shown significant improvements in terms of their reliability, cost of test, and time-to-market reduction.

We analyzed the results for an Automotive MEMS device. In this example, we continuously monitored key parameters during the tests, such as High Temperature Operating Life (HTOL) and temperature humidity bias (THB).

Observers were provided for thermal acquisition, sensors outputs, diagnostic registers, analog block measurements, internal alarms, and logic and memory blocks.

The test was conducted in parallel with the traditional approach, with a sample size of 3200 devices. The continuous monitoring of the test parameters during stress conditions has allowed for a better understanding of the behavior of the device and the stability of the process in relation to Temperature, Voltage and Power.

Besides this feedback on the process, three issues were detected that were not identified. The first issue was an anomalous value detected in an over-voltage/under-voltage test. The second was observing the destruction of a device and the third was an intermittent fault detected under particular stress conditions.

Although in the case of anomalous value, the analysis was eventually able to determine that the cause was related to a test setup problem; this conclusion was easily reached by the availability of the data collected during the test. This was made possible by the additional observers, which quickly allowed for determination of which device reported the anomaly and reconstruct the history of the device, at which point in the test the anomaly appeared and the conditions under which it appeared.

Figure 10 shows the outlier that suggested an anomaly in one of the devices and triggered the investigation. This anomaly was not detected using the traditional test flow.

The collected data provided a detailed history of the evolution of the device. Figure 11 shows exactly when, during the evolution of the test, the anomaly took place. Figure 12 shows the result of further analysis, where it was also possible to relate the anomaly to other measures, highlighting the issues at the same time and providing useful information regarding the identification of the source problem.

The collected test data were equally valuable when analyzing the other two defects, making it possible to identify the source of the failures and their potential causes. These issues were also identified in the characterization trials; hence, early in the flow, allowing for the provision of feedback for manufacturing improvement.

The TfR flow also allowed for a test cost reduction (up to 50%) by avoiding the need to switch between ATE and burn-in equipment. This also allowed for a faster time to market (a 2–month saving from qualification to production) by providing data to improve design and manufacturing early on in the process. Furthermore, it was demonstrated via Early Life Failure RETE (ELFR) that the expected failure rate was assessed well before the traditional burn-in phase.

Similar results were observed in other cases, such as:

ADAS devices, with early detection of failures, and significant ppm reduction after a design fix;
Power devices, with process improvement and layout issues detected in pre-qualification;
Micro e-memory devices.

In all these cases, lower costs and a faster TTM were also achieved.

5. Evolution of the Methodology

In the traditional approach, reliability analysis is performed relatively separately from the rest of the design and production process, and focuses on the screening of early life failures with a PASS/FAIL approach. In this paper, we proposed a Test for Reliability approach, where reliability affects each stage of the process and where the information extracted about failures is much more detailed and can be used to trigger a continuous process of improvements in the design and manufacturing process. The two pillars of this solution are as follows:

Embedding of in-chip monitors and controllers to trigger and detect all the possible failures;
Running the test on a massive parallel test and stress platform. This allows the test to be run on a large population of devices with higher granularity.

A natural evolution of Test for Reliability is Online Reliability, i.e., the use of the on-chip monitors not just during the test, but also online during the operational life of the device. An example of this is reusing the same features for offline testing and online safety mechanisms. We posit that this online scenario can also affect reliability.

We need data on the behavior of the device during its useful life. The continuous observation of critical parameters would allow for expertise to be gained regarding the degradation, detection of the cause of failure and the ability to derive the exact fault that led to failure. This detailed knowledge, gained from online observations of the devices, will lead to the development of analytic models of device behavior with respect to environmental conditions such as voltage, temperature and power, especially with respect to misbehaviors that only present themselves in particular conditions and corner cases.

We will develop a specific architecture to manage and optimize the data coming from the DUT. We envision a companion reliability monitor, implemented in HW (Figure 13A) or SW (Figure 13B,C), that manages the flow of reliability and test data, and decides where and when to store information that will be used for offline analysis.

The goal of generating further data from online observations of the device leads us to other two areas where further work is necessary: how to effectively analyze this large quantity of data, and how to make sure that all and only the signals that are necessary to the analysis are observed.

The central point of TfR is its ability to provide a collection of testing and reliability data that make it possible to trigger a virtuous cycle of improvement (Figure 14). Once we have this collection of data, it is important to know what to do with it. The development of analysis techniques that relate the observed data to the causes of failure becomes particularly important when these causes are due to weak circuit structures, weak layout, or random causes. Here, machine learning may play a role, especially considering the type and amount of data coming from online reliability.

The ability to collect all and only the information needed to learn from failure can be seen as another face of the same coin. Given the amount of data generated, it is important to be confident that the information is important in driving the failure analysis, and not redundant. There is, therefore, room for improvement of the concepts introduced in Section 3 to understand which parameters should be monitored, as well as how to put the devices in a state that can trigger failure.

Particularly where non-digital technologies [7] are involved, we still need to structure the knowledge of how we identify the critical configurations that we want to monitor and how we write rules to recognize these situations, as well as patterns that trigger these weaknesses [8]. The ability to define these rules at the transistor level is particularly important if we want to target analog blocks and look inside digital cells and macroblocks.

We realize that both directions are part of the same process. What we learn from the analysis of the failure data also helps us to identify the part of a design that we want to monitor in future devices. In turn, a more precise identification of the parameters that will be observed, as well as how we monitor them, provide us with a better quality of the reliability data.

6. Conclusions

We presented a new methodology for the design, verification, and manufacturing of reliable semiconductor circuits, with the ultimate goal of identifying and possibly remove all the systematic causes of failure. The two pillars of the methodology are: (1) the instrumentation of the device, allowing it to monitor and control all the signals of interest, and (2) its ability to run test and stress on the same platform, on a large population of devices, starting from the early stages of the design process. The union of these two techniques is a novel approach to address the reliability of IC devices, which provides us with a detailed picture of the evolution of critical signals under different stress conditions. This allows us to identify situations that might lead to failures, including situations such as intermittent faults that are difficult to detect. Using this information early on in the device development process, e.g., pre-qualification for early silicon, makes it possible to address the necessary changes to the design and manufacturing processes, to avoid the causes of failure when it is still possible and cost-effective to do so. As the next steps, we aim to create the reliability features that can be used to instrument the device, and play an online role, allowing one to observe and learn from the device during its useful lifetime.

Author Contributions

Writing—original draft, M.P. and A.S.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. It was supported by ELES.

Acknowledgments

The authors gratefully acknowledge the contributions of Antonio Zaffarami, Alessandro Maseri and Luca Moriconi to the development of the methodology and its implementation.

Conflicts of Interest

The authors declare no conflict of interest.

References

JEDEC Solid State Technology Association. Stress-Test-Driven Qualification of Integrated Circuits; JEDEC: Arlington, VA, USA, August 2018. [Google Scholar]
Zorian, Y.; Mariani, R. Automotive Reliability & Test Strategies. In Proceedings of the IEEE International Test Conference, Fort Worth, TX, USA, 31 October 2017. [Google Scholar]
Kuper, F.; van der Pol, J.; Ooms, E.; Johnson, T.; Wijburg, R.; Koster, W.; Johnston, D. Relation between yield and reliability of integrated circuits: Experimental results and application to continuous early failure rate reduction programs. In Proceedings of the International Reliability Physics Symposium, Dallas, TX, USA, 30 April–2 May 1996. [Google Scholar]
Constantinescu, C. Intermittent Faults and effects on reliability of integrated circuits. In Proceedings of the Reliability and Maintainability Symposium (RAMS), Anaheim, CA, USA, 24–27 January 1994. [Google Scholar]
Kim, H.; Lee, Y.; Kang, S. A Novel Massively Parallel Testing Method Using Multi-Root for High Reliability. IEEE Trans. Reliab. 2015, 64, 486–496. [Google Scholar] [CrossRef]
Possamai Bastos, R.; Torres, F.S. On-Chip Current Sensors for Reliable, Secure, and Low Power Integrated Systems; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Milor, L.; Sangiovanni-Vincentelli, A. Minimizing Production Test Time to Detect Faults in Analog Circuits. IEEE Trans. Comput. Aided Des. 1994, 13, 796–813. [Google Scholar] [CrossRef]
Milor, L.; Sangiovanni-Vincentelli, A. Optimal Test Set Design for Analog Circuit. In Proceedings of the IEEE International Conference on Computer-Aided Design, Santa Clara, CA, USA, 11 November 1990. [Google Scholar]

Figure 1. Traditional Reliability Flow.

Figure 2. Traditional Approach based on Burn-In.

Figure 3. Pushing down the bathtub diagram.

Figure 4. How Test for Reliability affects the flow.

Figure 5. Comparison between the traditional approach and the Test for Reliability methodology.

Figure 6. Test for Reliability Flow.

Figure 7. The collection of test and reliability data is at the center of the process.

Figure 8. Test for Reliability Building Blocks.

Figure 9. Massive Parallel Test (MPT) architecture.

Figure 10. Anomalies in the OV/UV test.

Figure 11. Observation of the anomalous behavior in time.

Figure 12. Correlating the anomalies to other parameters.

Figure 13. Ideas for online reliability monitor architectures. (A) reliability monitor implemented in HW (B) reliability monitor implemented in SW running on the same component it monitors (C) SW reliability monitor running on a companion component.

Figure 14. Learning how to improve the design process from data analysis.

Table 1. Embedded circuitry and their relevance to different types of test.

-	-	Qualification Trial			Manufacturing Trial
-	-	Life Trial	Characterization Trial	Static Trial	Manufacturing Trial
Mode selections	Test Mode	++	++	++	++
Mode selections	Low Power Mode	o	o	++	o
Comm. Interfaces	SPI	++	++	++	++
	I2C
	JTAG
Traceability	Unique DIE ID	++	++	++	++
Traceability	Wafer coordinates	+	+	+	+
Diagnostic	Diagnostic Register	++	++	o	++
Diagnostic	Status Register	++	++	o	++
Sensors	Current	+	+	o	+
	Voltage	+	+	o	+
	Thermal	++	++	o	++
MUX	Internal analog nodes selection	+	+	o	+
MUX	External analog nodes selection	+	+	o	+
A2D	Analog nodes measurement	+	+	o	+
DFT functions	Logic BIST	++	++	o	++
	Analog BIST	++	++	o	++
	Memory BIST	++	++	o	++
	Unique Scan Chains	++	++	o	++

++: mandatory. +: nice to have. o: not required.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pipponzi, M.; Sangiovanni-Vincentelli, A. Test for Reliability for Mission Critical Applications. Electronics 2021, 10, 1985. https://doi.org/10.3390/electronics10161985

AMA Style

Pipponzi M, Sangiovanni-Vincentelli A. Test for Reliability for Mission Critical Applications. Electronics. 2021; 10(16):1985. https://doi.org/10.3390/electronics10161985

Chicago/Turabian Style

Pipponzi, Mauro, and Alberto Sangiovanni-Vincentelli. 2021. "Test for Reliability for Mission Critical Applications" Electronics 10, no. 16: 1985. https://doi.org/10.3390/electronics10161985

APA Style

Pipponzi, M., & Sangiovanni-Vincentelli, A. (2021). Test for Reliability for Mission Critical Applications. Electronics, 10(16), 1985. https://doi.org/10.3390/electronics10161985

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Test for Reliability for Mission Critical Applications

Abstract

1. Introduction

2. Test for Reliability

3. Design for Reliability Test

4. Results

5. Evolution of the Methodology

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI