# First PhD Report

PhD Student: Liviu Agnola

PhD Advisor: Prof. Dr. Ing. Mircea Vladutiu

# **Table of Contents**

| 1 |   | Intro  | duction                                                            | . 4 |
|---|---|--------|--------------------------------------------------------------------|-----|
| 2 |   | Memo   | ory Faults and Testing                                             | . 5 |
|   | 2 | .1     | Basic notions and concepts on faults and dependability             | . 5 |
|   |   | 2.1.1  | Failures, Errors and Faults                                        | . 5 |
|   |   | 2.1.2  | 2 Dependability and Security                                       | . 6 |
|   |   | 2.1.3  | Means to Achieve Dependability and Security                        | . 9 |
|   | 2 | .2     | Cache memories                                                     | 10  |
|   |   | 2.2.1  | Cache memory organization                                          | 13  |
|   |   | 2.2.2  | 2 Set associative caches                                           | 15  |
|   | 2 | .3     | Memory testing                                                     | 18  |
|   |   | 2.3.1  | Functional RAM chip models and faults                              | 19  |
|   |   | 2.3.2  | Reduced functional faults                                          | 22  |
|   |   | 2.3.3  | 3 Traditional and March Tests                                      | 26  |
| 3 |   | Built- | -In Self-Testing and Graceful Degradation                          | 32  |
|   | 3 | .1     | Memory Built-In Self-Test                                          | 32  |
|   |   | 3.1.1  | Introduction to BIST                                               | 32  |
|   |   | 3.1.2  | 2 Memory Built-In Self-Test                                        | 33  |
| 4 |   | State  | e of the art in graceful degradation techniques for cache memories | 35  |
|   | 4 | .1     | Description of a Process-Tolerant Cache Architecture Method        | 35  |
|   | 4 | .2     | Results of the Process-Tolerant Cache Architecture Method          | 4C  |
|   | 4 | .3     | Conclusions and discussion                                         | 43  |
| 5 |   | Self-A | Adaptive cache Memories                                            | 44  |
|   | 5 | .1     | Introduction                                                       | 44  |
|   |   | 5.1.1  | L-Zone                                                             | 44  |
|   |   | 5.1.2  | 2 "More Than One" column                                           | 45  |
|   | 5 | .2     | Modifications of the Set Associativity                             | 46  |
|   |   | 5.2.1  | Maintaining the set associativity                                  | 46  |
|   |   | 5.2.2  | Reducing the set associativity                                     | 47  |
|   |   | 5.2.3  | Reorganizing the memory                                            | 48  |
|   | 5 | .3     | Overhead                                                           | 48  |
|   | _ | 4      | Conducions                                                         | EC  |

| 5 | .5     | Future work | 1 |
|---|--------|-------------|---|
| 6 | Biblic | ography 5   | 3 |

## 1 Introduction

Ever since digital systems were created there were problems in making sure that the systems are working correctly, i.e. the results offered by the machines are accurate and correct.

With ever growing computing power and memory size this issue has become of great importance. Given the fact that in the last years memory size and speed were increased considerably and also the memory in a computing system accounts for somewhere about 50% of the power that the system uses, and taking into account Moore's law (the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years) it is imperative that the memory works correctly and without faults.

The doctoral program addresses the domain of Computer Science, with emphasis on Computer Hardware Design and Built-In Self-Test/Repair. In the last few decades the main focus in computer systems has shifted from performance towards reliability, yield and robustness. As memory systems continue to decrease in size an increase in capacity, the probability of hard, permanent faults increases, especially in SRAM cells [1]. Due to this fact the usual method: using spare rows or columns, for preventing hard faults can become obsolete [1] [2]. The hard or permanent errors can appear due to process variation [1] [3] and aging [4].

My work focuses on improving the reliability and yield of set associative cache memories. In order to address this issue first we will need to present the basics of cache memories, memory testing, built-in self-test solutions and graceful degradation solutions.

We propose a new method that can be implemented on any set associative cache memory and that provides an increase in reliability, yield and functioning time of the memory chip. All of this benefits will be at only a small cost in performance, due to the fact that it is a case of graceful degradation [5] [6] [7]. The increase in reliability, yield and functioning time is achieved by removing from use any faulty cell that has been diagnosed as an incurring hard error [8]. The small cost in performance is achieved from the reorganization of the memory cell array, this is done both for maintaining a high reliability, yield and functioning time of the memory chip. Also it is done for maintaining a relatively high performance of the memory, by reducing the number of misses and increasing the number of cache hits. To this end, we will assume that the cache memory is equipped with a concurrent built in self-test mechanism capable of detecting the hard error that may appear during the use of the chip and also during the production stage [8].

The thesis is structured in 7 chapters. We will provide in the followings a short description of each chapter.

# 2 Memory Faults and Testing

# 2.1 Basic notions and concepts on faults and dependability

In this section we will start by giving some general definitions on faults, errors, failures; also the basic means for fault detection, correction and fault tolerance.

#### 2.1.1 Failures, Errors and Faults

A system is an entity that is interacting with other entities, the other entities may be: humans, other entities, software, hardware, and the external world or physical world [9]. The function of a system is described by the functional specification, and it is what the system is intended to do in terms of functionality and performance [9]. The service that the system is delivering is its behavior as it is perceived by the user, where a user is another system, which receives the service provided by the first system.

In order to be able to define faults, errors and failures we must first state what a correct service of a system is. A system is said to deliver a correct service when the service implements the system function. A failure or a system failure is an event that happens when the delivered service deviates from correct service [9]. A system fails in one of two cases: either the specification did not adequately describe the system function; or because it doesn't comply with the functional specification [9]. A service failure is a transition from correct service to incorrect service [9]. A service outage is the period of delivering an incorrect service, a service restoration is the transition from incorrect service to a correct service [9].

When a system deviates from the correct service state the deviation is called an error. The hypothesized or adjudged cause of an error is called a fault [9]. A fault can be either external or internal of the system. An error is the part of the total state of the system that can lead to its subsequent service failure [9]. A fault is active when it causes an error; otherwise it is called dormant. Many errors don't reach the system's external state and cause a failure [9].

A degraded mode that still offers a subset of needed services to the user is when the functional specification of a system includes a set of several functions and the failure of one or more of the services implementing the functions may leave the system degraded [9]. The specification may identify several such modes, for example: limited service, slow service, emergency service, and others [9].

The manifestation and creation mechanism of faults, errors, and failures are depicted in Figure 2.1, these mechanism presented in Figure 2.1 enable the "chains of threads" to be completed, as illustrated in Figure 2.2.



Figure 2.1: Error propagation, from [9].



Figure 2.2: The fundamental chain of dependability and security threads, from [9].

Eight basic viewpoints classify all faults that may affect a system during its life, leading to elementary fault classes, as depicted in Figure 2.3.

For a simpler representation we can group the combined fault classes, presented in Figure 2.4, into three groups [9]:

- Interaction faults, that include all external faults
- Physical faults that include all fault classes that affect hardware
- Development faults that include all fault classes occurring during development

## 2.1.2 Dependability and Security

As presented in [9] there are two valid definitions of dependability, the first, and original definition of dependability is the ability of a system to deliver service that can justifiably be trusted. The other definition for dependability is the ability to avoid service failures that are more frequent and severe than is accepted. The latter definition is providing a criterion for making a decision if a system is dependable or not, while the first definition is stressing the importance of justification.

According to [9] the dependability of a system is an integrating concept that includes the following attributes:

- Availability
- Reliability
- Safety
- Integrity
- Maintainability



Figure 2.3: The elementary fault classes, from [9].

In the followings we will present the definition of security as illustrated in [9]. Security is a composite of the attributes of confidentiality, integrity, and availability, requiring the concurrent existence of: availability for authorized actions only; confidentiality; and integrity. In Figure 2.5 is summarized the relationship between security and dependability.



Figure 2.4: The classes of combined faults, from [9].



Figure 2.5: Dependability and security attributes, from [9].

The means to attain dependability and security are: fault prevention, i.e. a way to avoid the beginning or happening of faults; fault tolerance, i.e. a way to avoid, in presence of faults, the service's failures; fault removal, i.e. a way to reduce the severity and number of faults; and fault forecasting, i.e. a way to approximate the current number, the future occurrence, and the likely consequences of faults.

Before passing on to the next subsection we will present two more definitions of dependability as they appear in the ISO standards. The first one appears in [10]: the collective term used to describe the availability performance

and its influence factors: reliability performance, maintainability performance and maintenance support performance. The second definition is from [11]: the extent to which the system can be relied upon to perform exclusively and correctly the system task or tasks under defined operational and environmental conditions over a defined period of time, or at a given instance of time. The ISO definition, i.e. the first one, is focused mainly on availability [9]. Due to the unavoidable presence of faults, no system is totally available, safe, secure, or reliable [9].

# 2.1.3 Means to Achieve Dependability and Security

From the means to achieve dependability and security listed in the previous subsection, in this section we will focus mainly only on fault tolerance and fault removal, the other two methods will be given only a short description.

Fault prevention, as a way to avoid the beginning or happening of faults, is a part of general engineering [9], so it is mainly utilized by the manufactures in order to increase yield and causes of faults. The faults occurring in a system can be recorded by that system and used by the producer to eliminate the fault causes via process modification [12] [13].

Fault tolerance, which purpose is to avoid failures of the system, is implemented via error detection or correction and through system recovery [9] [14]. The techniques involved in fault tolerance are presented in Figure 2.6.

The focus of this thesis will be on isolation of the faults and reconfiguration of the system afterward. Also for this we will need an error detection mechanism and to be more specific, a mechanism for concurrent fault detection, capable of detecting errors and even correcting some of them as they appear. We will also provide an option for diagnosis to be sent back to the manufacturer for future improvements to their products.

Many approaches and schemes have been proposed over the decades for fault tolerance and for the many parts of fault tolerance. There exist a large number of synonymous for fault tolerance: self-repairing and self-healing are just two of them. Also in [15] the term recovery-orienting computing has been presented, this term defines a fault tolerant method for the goal of overall system dependability.

The fault removal technique aims at reducing the number of faults and their severity. Hardware testing is mainly aimed at removing production faults [9]. An important part of fault removal is the fault removal during use. The fault removal during use aims at removing the faults without stopping the system for maintenance. Also this technique increases a system dependability and functioning time. This technique, along with fault tolerance is very useful when a proper maintenance of a system cannot be done, for example a deep space probe cannot be returned back to earth each time an error occurs, and so that system needs to have very efficient fault removal and fault prevention techniques in order to be able to function in an inaccessible, for maintenance, environment.



Figure 2.6: Fault tolerance techniques, from [9].

As a conclusion to this section Figure 2.7 shows a refined dependability and security tree, from the definitions and techniques presented in this section [9].

#### 2.2 Cache memories

Since our thesis describes a self-repair method for set associative cache memories, in this section we will provide a brief introduction that will contain the basics on cache memories.

First of all we will start by presenting the memory hierarchy that is used in modern computers, Figure 2.8. In this hierarchy from top to bottom the storage devices get slower in speed, larger in capacity and cheaper in cost per byte. When computer system first started to develop only three levels of memory existed: CPU registers, DRAM or main memory, and the local hard disk [16]. Since the 1980's when the speed of the CPU registers and the speed of the main memory were almost equal, the gap between these two elements of a computer system has increased constantly, see Figure 2.9.



Figure 2.7: A refined dependability and security tree, from [9].

Because of this gap in performance and speed between the main memory and CPU registers, in order to increase the performance of the whole computer system, producers had to introduce a new level in the memory hierarchy, an SRAM memory type, called cache level 1. This level 1 cache was able to increase performance but not for too long, because the gap, in speed, between this level and the main memory also started to increase. A new cache level was needed, the level 2 cache. In the last few years producers needed again to introduce the so called level 3 cache memory, and probably in another three or four years we will see the level 4 and so on.



Figure 2.8: The memory hierarchy, from [16]



Figure 2.9: The gap in performance between memory and CPU, from [17]

So in order to conclude, a definition for cache memory: is a SRAM type memory placed between CPU registers and main memory (DRAM), it is superior in speed, compared to the main memory, but has a lower capacity. The cache memory contains copies of the locations in the main memory in order for the system to gain in speed and performance. So every byte that is processed by the CPU is passed

through the cache system, for this reason the dependability of the cache system becomes crucial.

# 2.2.1 Cache memory organization

Usually the level 1 cache is located on the same chip as the CPU, and can be accessed in one or two clock cycles. The cache level 2 is usually placed outside the CPU chip, and so it has greater access times, to the order of 10 clock cycles [16]. Figure 2.10 shows a typical structure for a computing system with a two level cache system.



Figure 2.10: Typical structure for two level cache, from [16].

Now we will take a closer look at what is inside a cache memory. Before we start we must state the number of bits m that uniquely identifies every line of memory in that computer system. This m bits permit access to  $M=2^m$  address lines or memory locations in the system. A cache memory for this system will have  $S=2^s$  cache sets, within each of these sets there will be a number of E cache lines, each line will have a data block of  $B=2^b$  bytes, t=m-(b+s) tag bits, that are used to uniquely identify the block stored in the cache line, and one valid bit that is used to indicate if the cache line either has or hasn't significant information [16]. An example of such a cache memory is illustrated in Figure 2.11.



Cache size:  $C = B \times E \times S$  data bytes

Figure 2.11: General organization of a cache memory, from [16].

Usually a cache memory's organization and size can be characterized by these four parameters: S, E, B, and M. Figure 2.12 illustrates the organization of the address of such a cache memory with the parameters discussed above.



Figure 2.12: Address organization of a cache memory, from [16].

A summary of the most usual cache memory parameters is presented in Figure 2.13.

| Fundamental parameters |                                               |  |  |  |  |
|------------------------|-----------------------------------------------|--|--|--|--|
| Parameter              | Description                                   |  |  |  |  |
| $S=2^s$                | Number of sets                                |  |  |  |  |
| E                      | Number of lines per set                       |  |  |  |  |
| $B=2^b$                | Block size (bytes)                            |  |  |  |  |
| $m = \log_2(M)$        | Number of physical (main memory) address bits |  |  |  |  |

| Derived quantities                                  |                                                                          |  |  |  |  |
|-----------------------------------------------------|--------------------------------------------------------------------------|--|--|--|--|
| Parameter Description                               |                                                                          |  |  |  |  |
| $M = 2^m$ Maximum number of unique memory addresses |                                                                          |  |  |  |  |
|                                                     | Number of set index bits                                                 |  |  |  |  |
| $b = \log_2(B)$                                     | Number of block offset bits                                              |  |  |  |  |
| t = m - (s+b)                                       | Number of tag bits                                                       |  |  |  |  |
| $C = B \times E \times S$                           | Cache size (bytes) not including overhead such as the valid and tag bits |  |  |  |  |

Figure 2.13: Cache parameters, from [16].

This concludes the present subsection of our thesis; we will not go any further in detail, in presenting the organization of cache memories, for this we will refer the reader to [16] [17].

#### 2.2.2 Set associative caches

The most usual method to group cache memories is after E, the number of lines in each set of the cache memory. After this classification the cache memories are split into three major groups: direct mapped cache memories, where E=1; set associative cache memories, where E>1, and also S>1; and in the last group are fully associative cache memories where S=1, i.e. there is only one set and a location from the main memory can be mapped in any line without restriction. An example of the differences in mapping between the three groups of cache memories is depicted in Figure 2.14.

We will start by providing the reader with a short description of direct mapped cache memories; our focus will mainly be on set associative cache memories, them being the object of this thesis. For a more detailed approach to direct mapped and fully associative cache memories the reader is referred to [16] [17].



Figure 2.14: Mapping differences between groups of caches, from [17].

As stated before a direct mapped cache memory is a cache memory that only has one line per set, i.e. E=1. Such a memory is depicted in Figure 2.15. This type of cache memory is the simplest and easiest to understand [16].



Figure 2.15: Direct mapped cache, from [16].

Set associative cache memories are those caches for which E>1, and also S>1, i.e. there is more than one line in each set of the memory. This provides an advantage from the direct mapped caches because a location from the main memory can be mapped in more than one place in the cache. This is being

particularly useful when working with array that have two or more dimensions. In Figure 2.16 is presented a 2-way set associative cache memory.



Figure 2.16: Organization of a 2-way set associative cache memory, from [16].

The access in a set associative cache memory is similar as in any other type of memory. First the set is selected as shown in Figure 2.17. After the set is selected the second task is to see if any line in that set matches the tag of the address requested by the CPU. If we have a line matching, which is also known as a cache hit, we proceed to the extraction of the word from the cache block. This is shown as an example in Figure 2.18.



Figure 2.17: Set selection in a set associative cache memory, from [16].

We will conclude this subsection with an example of a set associative cache memory from the microprocessor Alpha 21264. This is a 2-way set associative cache that contain 64KB of data, with the block size of 64 bytes. The organization of this memory is presented in Figure 2.19.



Figure 2.18: Line matching and set selection in a set associative cache memory, from [16].



Figure 2.19: The organization of the cache in Alpha 21264 microprocessor, from [17].

# 2.3 Memory testing

In this section we will provide our reader with the basics on functional models of memory chips, the errors that can appear in accordance with these models, and also some test methods that are used for memory testing.

# 2.3.1 Functional RAM chip models and faults

We will first present the functional model for a RAM memory with all of the main components, Figure 2.20 illustrates this.



Figure 2.20: DRAM memory model, from [18].

Since we will be working with cache memories that are SRAM memory types, from the DRAM memory model we will exclude the refresh logic, since the SRAM is non-volatile. Figure 2.21 shows a memory model for a SRAM type of memory.

Some of the functional faults that can appear in a RAM memory are illustrated in Table 2.1, the list is not complete. Note that we refer to a cell as an entity that stores data, and to a line as an entity that is used to transmit data from one entity to another.

As can be seen from Table 2.1, the list not being complete, the number of functional faults is very large. Given the large number of functional faults and the fact in order to test for each individual group of faults can be very expensive and very time consuming we can start grouping some of the elements of the memory as shown in Figure 2.22. As can be seen in Figure 2.22 the address latch, column decoder, row decoder and the connections between them are grouped in the address decoder, the memory cell array remains unchanged and the read/write logic has the following elements: write driver, sense amplifiers, data register and the connections between them.

The reduced functional model from Figure 2.22 generated the following types of errors: stuck-at faults, transition faults, coupling faults and neighborhood

pattern sensitive faults. Table 2.2 presents the reduced functional faults. As can be seen in this table the number of potential types of faults is reduced considerably, leaving only four categories of faults, that include all the other types of faults. This is a clear advantage, because with a smaller number of functional faults it is easier, cheaper and faster to test the memory chips.



Figure 2.21: SRAM memory model, from [18].

Table 2.1: RAM functional faults, from [18].

|   | Functional Fault                                  |
|---|---------------------------------------------------|
| а | Cell stuck                                        |
| b | Driver stuck                                      |
| С | Read/write line stuck                             |
| d | Chip-select line stuck                            |
| е | Data line stuck                                   |
| f | Open circuit in data line                         |
| g | Short circuit between data lines                  |
| h | Crosstalk between data lines                      |
| i | Address line stuck                                |
| j | Open in address line                              |
| k | Shorts between address lines                      |
| I | Open decoder                                      |
| m | Wrong address access                              |
| n | Multiple simultaneous address access              |
| 0 | Cell can be set to 0 but not to 1 (or vice versa) |
| р | Pattern sensitive cell interaction                |



Figure 2.22: Reduced functional model, from [18].

Table 2.2: Reduced functional faults, from [18].

| 1. SAF   | Stuck-At Fault                        |
|----------|---------------------------------------|
| 2. TF    | Transition Fault                      |
| 3a. CF   | Coupling Fault                        |
| 3b. NPSF | Neighborhood Pattern Sensitive Faults |

We can furthermore group the type of faults from Table 2.2 into three categories: faults involving one cell, faults involving two cells, and faults involving n cells. The classification is as follows [18]:

- Faults involving one cell:
  - o Stuck-At Faults (SAF)
  - o Transition Faults (TF)
- Faults involving two cells:
  - o Coupling Faults (CF)
- Faults involving n cells:
  - o The n cells are allowed to be located anywhere in the memory. These are the n-coupling, bridging and the state coupling faults
  - o The *n* cells are clustered together in a physical neighborhood. These are the Neighborhood Pattern Sensitive Faults (NPSF)

Table 2.3 describes the standard notations used when describing faults and types of faults as presented in [18].

This concludes this subsection of our thesis. In the following subsection we will provide the reader with a short description of each category of the reduced functional faults.

Table 2.3: Standard fault notations, from [18].

| 0                                                                                         | denotes that the cell is in a logical state 0                        |  |  |  |  |
|-------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--|--|--|--|
| 1                                                                                         | denotes that the cell is in a logical state 1                        |  |  |  |  |
| X                                                                                         | denotes that the cell is in a logical state $x$ , where              |  |  |  |  |
|                                                                                           | $x \in \{0,1\}$                                                      |  |  |  |  |
| 1                                                                                         | denotes a write 0 operation to a cell containing 1                   |  |  |  |  |
| $\downarrow$                                                                              | denotes a write 0 operation to a cell containing 1                   |  |  |  |  |
| <b>\$</b>                                                                                 | denotes a write $\bar{x}$ operation to a cell containing an $x$      |  |  |  |  |
| $\rightarrow$                                                                             | denotes a write 0 operation to a cell containing an 0                |  |  |  |  |
| $\rightarrow$                                                                             | denotes a write 1 operation to a cell containing an 1                |  |  |  |  |
| $\Rightarrow$                                                                             | denotes a write x operation to a cell containing an x                |  |  |  |  |
| A                                                                                         | denotes any operation; ∀∈ {↑,↓,↑,→,⇒}                                |  |  |  |  |
| <>                                                                                        | denotes a particular fault; "" describes the fault                   |  |  |  |  |
| <i f=""></i>                                                                              | denotes a fault in a single cell                                     |  |  |  |  |
|                                                                                           | I describes the condition for sensitizing the fault:                 |  |  |  |  |
|                                                                                           | $I \in \{\uparrow, \downarrow, \uparrow, \rightarrow, \Rightarrow\}$ |  |  |  |  |
|                                                                                           | F describes the value of the faulty cell: F∈ {0,1,↑,↓                |  |  |  |  |
|                                                                                           | ,\$}                                                                 |  |  |  |  |
| <i1, i2,,="" in-1;<="" th=""><th colspan="4">denotes a fault involving n cells</th></i1,> | denotes a fault involving n cells                                    |  |  |  |  |
| In/F>                                                                                     | I1,, In-1 describes condition on the <i>n</i> -1 cells to            |  |  |  |  |
|                                                                                           | sensitize the fault in cell n                                        |  |  |  |  |
|                                                                                           | In describes the condition for the fault to be                       |  |  |  |  |
|                                                                                           | sensitized in cell $n$ . It may be empty (In=[]) in which            |  |  |  |  |
|                                                                                           | case In/F=[]/F can be written as F                                   |  |  |  |  |

#### 2.3.2 Reduced functional faults

## Stuck-At Faults

The most common definition of a stuck-at fault is: the logic value of a stuck-at line or cell has always the same logic value, either 0 (SA0 faults) or 1 (SA1 faults) [18]. The notation for a SA0 fault is  $< \forall/0>$ ; and for a SA1 fault  $< \forall/1>$ . A test that can detect and locate all stuck-at faults in a memory chip has to read a 0 and a 1 from each memory cell [18].

Figure 2.23a shows a state diagram for a healthy memory cell. In Figure 2.23b and Figure 2.23c are shown the state diagram for SAO and SA1, respectively. A cell has the logic value 0 in state 0 ( $S_0$ ), and the value 1 in the state  $S_1$ .



Figure 2.23: State diagram for SAF, from [18].

#### **Transition Faults**

The definition of transition faults is: A cell or line which fails to undergo a  $0 \to 1$  transition when it is written is said to contain an up transition fault; similarly, a down transition fault is the impossibility of making a  $1 \to 0$  transition [18]. The notation for the up TF, as shown in [18] is <1/0>, and for the down TF <1/1>.

The transition faults are a special case of stuck-at faults, in order for a better understanding of this we will provide the reader with a short example [18].

#### Example

Figure 2.24 shows a Set/Reset (S/R) flip-flop with the Reset stuck-at 0. In this situation the fault may be classified as a <1/1 > fault because the S/R flip-flop will fail to make a  $1 \rightarrow 0$  transition.



Figure 2.24: A flip-flop as a model for a transition fault, adapted from [18].

Transition faults cannot be treated as SAx faults because other faults, such as coupling faults, may bring the cell back into  $\operatorname{state} \overline{x}$ . So in order to test for transition faults we have to use a slightly more complex algorithm. A test that has to detect and locate al TFs, should satisfy the following requirements, according with [18]: Each cell must undergo a \(^1\) transition (state of the cell goes form 0 to 1), and a \(^1\) transition (state of the cell goes from 1 to 0), and be read after each transition before undergoing any further transitions.

The state diagram of a memory with a <1/0> transition fault is illustrated in Figure 2.25. The notations are the same as for the stuck-at faults.



Figure 2.25: State diagram for TF, from [18].

# **Coupling Faults**

Coupling faults are grouped according to these assumptions:

- 1. A read operation will not cause an error.
- 2. A non-transition write operation will not cause a fault.
- 3. A transition write operation may cause a fault.

The coupling faults that involve two cells, and that is used in [18] [19] [20] [21], has a definition as follows: a write operation which generates a  $\uparrow$  or a  $\downarrow$  transition in one cell changes the contents of a second cell.

The coupling fault that involves two cells is a special case of the more general case k-coupling fault that involves k cells and is defined as follows: is the same as the two coupling fault, but in addition the transition is only performed when the other k-2 cells are in a certain state [20]. If there is no restriction on the placement of the k cells the k-coupling fault is very complicated to test for [22].

The two coupling faults can be grouped in two types: inversion coupling faults and idempotent coupling faults, which will be briefly discussed. Special cases of coupling faults are state coupling faults and bridging faults, for detailed perspective these types of coupling faults we refer our reader to [18].

The inversion coupling faults (CFin) has the following definition: a  $\downarrow$  (or  $\uparrow$ ) transition in one cell inverts the contents of a second cell [18].

A test that detects all CFins must satisfy the following: "for all cells which are coupled cells, each cell should be read after a series of possible CFins may have occurred (by writing into the coupling cells), with the condition that the number of transitions in the coupled cell is odd (i.e. the CFins do not mask each other)" [18].

The idempotent coupling faults (CFid) has the following definition: A  $\downarrow$  (or  $\uparrow$ ) transition in one cell forces the contents of a second cell to a certain value, 0 or 1 [18].

A test that detects all CFids must satisfy the following: "for all cells which are coupled cells, each cell should be read after a series of possible CFids may have occurred (by writing into the coupling cells), in such, a way that the sensitized CFids do not mask each other" [18].

As a conclusion to the state coupling faults Figure 2.26 illustrates the state diagram of two good cells (a), the state diagram of a  $<\uparrow;\downarrow>$  CFin (b); and the state diagram of a  $<\uparrow;\downarrow>$  CFid (c).

# **Neighborhood Pattern Sensitive Faults**

The neighborhood pattern sensitive fault is a special case of the k-coupling fault, in the sense that the k-1 cells, beside the base cell are in the immediate vicinity of the base cell. In Figure 2.27 the NPSF terminology, as presented and used in [18], is depicted. There are three cases of NPSF: ANPSF (Active Neighborhood Pattern Sensitive Faults), PNPSF (Passive Neighborhood Pattern Sensitive Faults), and SNPSF (Static Neighborhood Pattern Sensitive Faults). In the following we will present a short description of each of these types of NPSF along with a testing requirement for each one, again for a more ample description we refer our readers to [18].



Figure 2.26: State diagrams involving two cells, from [18].

#### Memory array

|   | d |   |  |
|---|---|---|--|
| d | b | d |  |
|   | d |   |  |

b : base cell

d: deleted neighborhood cell

b+d: neighborhood

Figure 2.27: NPSF terminology, from [18].

In ANPSF due to a change in the deleted neighborhood pattern the base cell changes its contents. The change in the deleted neighborhood is a transition while the rest of the deleted neighborhood cells and the base cells have a certain pattern. In order to detect and locate ANPSFs a test must satisfy the following requirement: "each base cell must be read in state 0 and in state 1, for all possible changes in the deleted neighborhood pattern" [18].

In PNPSF due to a certain neighborhood pattern the content of the base cell cannot be changed. In order to detect and locate PNPSFs a test must satisfy the following requirement: "each base cell must be written and read in state 0 and in state 1, for all permutations of the deleted neighborhood pattern" [18].

In SNPSF a state of the deleted neighborhood pattern forces the content of the base cell to a certain value. In order to detect and locate SNPSFs a test must satisfy the following requirement: "each base cell must be read in state 0 and in state 1, for all permutations of the deleted neighborhood pattern" [18].

With this we conclude the present subsection dedicated to describing the most important possible types of faults. The next and last subsection of this chapter is dedicated to describe some traditional tests and some march tests along with their test times.

#### 2.3.3 Traditional and March Tests

In this subsection of our thesis we provide our reader with a brief description of the traditional test: zero-one, checkerboard, GALPAT and Walking 1/0, sliding diagonal, and butterfly. Also we will provide a short description of the march test MATS and MATS+, concluding this subsection with a comparison between the traditional tests and a couple of march tests. Table 2.4 summarizes the notation used throughout this subsection.

Table 2.4: Notation and abbreviations used in memory testing

| В | The number of bits (cells) in a memory word, thus the width of the memory     |
|---|-------------------------------------------------------------------------------|
| Ν | The number of address bits; the number of addresses will thus be $2^N$        |
| n | The total number of bits (cells) in the memory, which equals B·2 <sup>N</sup> |
| k | The size of the neighborhood                                                  |
| Α | An address                                                                    |
| С | A cell                                                                        |
| M | A set of cells, words or addresses                                            |
| r | A read (operation)                                                            |
| W | A write (operation)                                                           |

#### Zero-One

This is the simplest test for a memory chip. It consists of writing 1s and 0s in the memory cell array. The algorithm consists of four steps, see Figure 2.28. This algorithm is also known as MSCAN (Memory Scan) [18] [23].

Step 1: write 0 in all cells

Step 2: read all cells

Step 3: write 1 in all cells

Step 4: read all cells

Figure 2.28: Zero-One test algorithm, from [18].

This test detects all SAF, and also it detects some TF, and some CF. The test has a length of  $4 \cdot 2^N$ , and it is of order O(n) [18].

#### Checkerboard

For this test we first need to split the memory in two groups: group 1 and group 2, in a checkerboard pattern, as shown in Figure 2.29. Figure 2.30 presents the algorithm of the checkerboard test. The fault coverage is similar with the zero-one test, and also the number of operations is the same as the zero-one test, giving the checkerboard test an order of O(n) [18].

| 1 | 2 | 1 | 2 |
|---|---|---|---|
| 2 | 1 | 2 | 1 |
| 1 | 2 | 1 | 2 |
| 2 | 1 | 2 | 1 |

Figure 2.29: Cell numbering for checkerboard algorithm, from [18].

Step 1: write 1 in all cells-1 and 0 in all cells-2

Step 2: read all cells (words)

Step 3: write 0 in all cells-1 and 1 in all cells-2

Step 4: read all cells (words)

Figure 2.30: Checkerboard algorithm, from [18].

# GALPAT and Walking 1/0

These two tests are similar, that is why we present them together. First the memory is filled with 1s (or 0s), except for one cell, called the base cell that has the opposite value. For both these tests the base cell covers the whole memory. The difference between these two tests appears when the base cell is read: in GALPAT the base cell is read after each cell is read, while in Walking 1/0 the base cell is read only once after all the other cells have been read. This is depicted in Figure 2.31. The fault coverage for both these test, according with [18] is: all SAF, TF, CF are detected and located. Note that the tests are performed twice once with a 0 background and the second time with a 1 background. The order of both these test is  $O(n^2)$  [18] [24].



Figure 2.31: Read actions for GALPAT and Walking 1/0, from [18].

# **Sliding Diagonal**

The sliding diagonal has been developed as a shorter alternative to GALPAT, so instead of a single base cell as in GALPAT the sliding diagonal test uses an entire diagonal of base cells, making it faster but less efficient. Figure 2.32 shows the read actions for the sliding diagonal test. As stated before the fault coverage is smaller

than the GALPAT: some CF are detected and located, but not all of them; also this test detects and locates all SAF and TF. Due to the fact that sliding diagonal uses an entire diagonal instead of a single base cell the time order of this test is reduced to  $O(n^{3/2})$  [18].



Figure 2.32: Read actions for sliding diagonal, from [18].

## **Butterfly**

The butterfly test has been designed in order to reduce even more the test time of the GALPAT test, but with the purpose to only find SAF [18]. We will not go in detail with this algorithm, providing only a very short description of the reading of the cells. From GALPAT, only the reading of the cells differs, in that only the neighboring cells with the base cell are read. So the algorithm can detect and locate all SAF. The test order of the butterfly is O(nlogn) [18].

Before moving on to MATS and MATS+ test we will make a short observation regarding all of the march type tests. These tests are called march test because they "march" throughout the memory. A march element as described in [18] is "a finite sequence of the operations applied to every cell in the memory before proceeding to the next cell". The order of the addresses can either be increasing ( $\Uparrow$ ), decreasing ( $\Downarrow$ ), or unimportant ( $\Uparrow$ ). An example of a march element can be  $\Downarrow$ ( $\upmu(1,r1)$ ), that means that in every cell of the memory starting with the highest address and deceasing is first written a 1 and immediately is read a 1.

#### **MATS**

The MATS test or Modified Algorithm Test Sequence is the shortest march test [18], it detects all SAF. This test requires a number of 4n operations, having the test time order O(n). The basic scheme of the MATS test is illustrated in Figure 2.33.

$$\{ \updownarrow (w0); \updownarrow (r0,w1); \updownarrow (r1) \}$$

Figure 2.33: MATS test scheme, from [18].

Looking at the MATS in comparison with Zero-One or Checkerboard, which have the exact same number of operations performed on the memory cell array we can see a net superiority of the MATS test in the fault coverage [18].

#### MATS+

MATS+ is a special version of the MATS test, used when the technology of the memory chip is unknown [18] [25]. This test uses 5n operations, so has an order of O(n). The fault coverage is the same as the MATS test. The scheme of the algorithm is depicted in Figure 2.34.

$$\{ \updownarrow (w0); \, \pitchfork (r0,w1); \, \Downarrow (r1,w0) \}$$

Figure 2.34: MATS+ test scheme, from [18].

We will conclude this section with a summary of the tests described in this section alongside with some other march tests described in [18]. This summary is presented in Table 2.5. As can be observed from Table 2.5 the test times for even a small memory chip can be very high. Also in order to be able to apply these tests there are necessary special equipment outside the memory chip, these test equipment are very expensive because they are usually used only in one generation of chips, needing change after each technological improvement. Also in the last years the size of the memory has increased considerably without a corresponding increase in speed, this making the tests lengthy and sometimes even obsolete. Due to these facts and many other disadvantages the producers have started exploring alternatives to the old testing methods, these alternatives have developed in a general method called Built-In Self-Test that is integrated on the memory chip and permits the test of the chip, only by adding some extra pins, without the special equipment, or with some equipment that permit the production cost to be reduced. The Built-In Self-Test methods along with others of similar type will be presented in the next chapter.

Table 2.5: Comparison of memory test algorithms, from [18].

| Algorithm     |    | Fault Coverage |    |    |                    | Test Time      |        |
|---------------|----|----------------|----|----|--------------------|----------------|--------|
| Algorithm     | AF | SAF            | TF | CF | Others             | Order          | 1Mb    |
| Zero-One      | _  | L              | _  | _  |                    | n              | 0.42s  |
| Checkerboard  | -  | L              | -  | -  | Refresh            | n              | 0.52s  |
| Walking 1/0   | L  | L              | L  | L  | Sense amplif. rec. | n <sup>2</sup> | 2.5day |
| GALPAT        | L  | L              | L  | L  | Write recovery     | n <sup>3</sup> | 5.1day |
| GLAROW        | LS | L              | L  | L  | Write recovery     | $n\sqrt{n}$    | 7.2day |
| GLACOL        | LS | L              | L  | L  | Write recovery     | $n\sqrt{n}$    | 7.2day |
| Sliding Diag. | LS | L              | L  | _  |                    | $n\sqrt{n}$    | 10s    |
| Butterfly     | _  | L              | _  | _  |                    | 2n             | 3.6m   |
| MATS          | DS | D              |    |    |                    | n              | 0.42s  |
| MATS+         | D  | D              | -  | _  |                    | n              | 0.52s  |
| Marching1/0   | D  | D              | D  | _  |                    | n              | 1.5s   |
| MATS++        | D  | D              | D  | _  |                    | n              | 0.63s  |
| March X       | D  | D              | D  | D  | Unlinked CFins     | n              | 0.63s  |
| March C-      | D  | D              | D  | D  | Unlinked CFins     | n              | 1.0s   |
| March A       | D  | D              | D  | D  | Unlinked CFs       | n              | 1.6s   |
| March Y       | D  | D              | D  | D  | Linked TFs         | n              | 0.85s  |
| March B       | D  | D              | D  | D  | Linked CFs         | n              | 1.8s   |

L=Locate LS=Locate Some D=Detect DS=Detect Some

# 3 Built-In Self-Testing and Graceful Degradation

Throughout this chapter we will discuss the various methods used for Built-In Self-Test (BIST) for memory testing. Also we will provide a description of a method called graceful degradation, which, as its name suggests, allows the memory to continue functioning even after faults appear.

# 3.1 Memory Built-In Self-Test

We will start this section with a basic description of what BIST means and implies, and we will continue with a more detailed presentation of BIST methods used for memory testing.

## 3.1.1 Introduction to BIST

In the digital world everything eventually breaks down and stops functioning correctly. The most important thing to know is when to trust the result that a digital device provides to be correct and when not. The methods described in the previous section, though useful, are not practical because they need special equipment in order to be able to test a device. In order to eliminate this inconvenient the industry has provided a solution called Built-In Self-Test, which adds the extra logic needed for the test sequence on the chip of the circuit under test (CUT). The first digital systems to have a BIST were two Hewlett-Packard digital voltmeters, as described in [26]. The development cost and time increased by 1%, also there was a 1% increase in part costs, but the total costs dropped by 5% because the modularity of the system was no longer needed. Frohwerk describes in [27] a method for determining the correctness of a circuit by analyzing a *signature*. A *signature* is a statistical property of a circuit. In order to built BISTs for integrated circuits he applied the work of Peterson and Weldon [28] and Golomb [29] on error correcting codes and shift registers [30].

A digital system is diagnosed and tested during its lifetime on countless occasions. The tests and diagnosis must be quick and they need to have a very high fault coverage [30]. A way to ensure these restrictions is to specify a test as one of the system functions, so it becomes a self-test [30]. Many of the digital systems designed at AT&T around 1987 had self-tests, usually implemented in the software [30] [31]. Although this approach provided flexibility and its fault coverage and diagnosis weren't as high as expected [30]. This led to the building of the self-test function into the hardware [32] [33]. The earlier in the design stage the testing is considered the more efficient it is and the more the cost is reduced, this is because of the reduced number of prototypes and re-fabrications that are needed.

In the last few years due to the large integration the need for testing is greater than ever, that is why the great majority of the manufacturers, if not all of them, use BIST methods on a very large scale. The BIST solutions for testing can be applied to any digital system, but due to the fact that in our thesis we only discuss memory testing we will stop with this general introduction of BIST here, refereeing the reader for a more detailed description to [18] [30] [34].

# 3.1.2 Memory Built-In Self-Test

Random Access Memories (RAM) memories are perhaps the hardest elements in digital systems to test; this is because memory testing requires delivery of a huge amount of pattern stimuli to the memory. Also it requires the readout of an enormous amount of information [30]. With the memory Design for Testability (DFT) the most time consuming part is implemented on-chip, and it reduces the order of test time by a magnitude order [18]. The area overhead for memory DFT for a 4Mb DRAM is 1% [34]. The area overhead for memory BIST can be expected at around 2% [30].

The most important difference between memory BIST and memory DFT is that the memory BIST is completely self-contained, which means that all the functions required for the BIST are contained in the chip such that the test can be performed autonomously [18]. For DFT parts of a test are implemented on chip, these are the ones that provide with the largest reduction in test time. So this way the inner loops of a test algorithm can be executed by the DFT on the chip, while the other parts of the algorithm are executed, by externally providing certain control and test data and/or observing certain response data [18]. This is why the test times are in favor of the BIST when compared to DFT [34].

The most important advantages of BIST are: the test time, which is minimized (i.e. it is from 2 to 3 orders of magnitude faster than the conventional tests [18]); and the test is completely contained on the memory chip. The disadvantages of the BIST are: the area overhead is larger than DFT, usually with a factor of 2 [18]; it is only capable of implementing the tests for which it was designed.

The types of memory BIST are:

- Concurrent BIST
- Non-Concurrent BIST
- Transparent Testing

The concurrent BIST is a memory test mechanism where the memory can be tested concurrently with the normal system operation. This means that faults occurring during normal use of the memory can be detected, and depending on the complexity of the test even be corrected. For this type of BIST usually a form of information redundancy is used in the form of a parity bit or an error correcting code (ECC), which also increase the area overhead due to the extra information that has to be stored. The advantages for the concurrent BIST are that all faults, within the restrictions of the method used, are detected and/or corrected. This means that all permanent and transient faults are detected and/or corrected when they appear.

The disadvantages for the concurrent BIST are: the large area overhead needed, the performance penalty because of the constant need of checking the ECC, also the number and type of faults that can be corrected is limited, and so even if we have a complex concurrent BIST we cannot guarantee that the memory will be completely fault free. Note that the 100% certainty that the memory is fault free cannot be achieved by any kind of test.

The non-concurrent BIST is a memory test mechanism that requires interruption of the normal system function in order to perform the testing. The original memory contents are lost. The advantages of this kind of BIST are: maximum freedom in the data pattern used, more complex fault models can be detected. The disadvantages of the non-concurrent BIST are: the faults not covered by the BIST algorithm are not detected; the transient faults that occur between BIST periods are not detected, so only the permanent faults can be detected by this kind of BIST.

Transparent testing is a memory test mechanism that requires interruption of the normal system function for testing. The original memory contents are preserved in the memory after the testing is finished. Due to the fact that this is a particular method of non-concurrent BIST the advantages and disadvantages of the non-concurrent BIST also apply.

# 4 State of the art in graceful degradation techniques for cache memories

In the field of graceful degradation techniques a few ideas made themselves notable. Among this, the most important regarding SRAM memories and cache memories was presented in 2005 in an article called: "A Process-Tolerant Cache Architecture for Improved Yield in Nanoscale Technologies" [1]. In this article the authors describe a new graceful degradation method that is applied to cache memories, in order to improve their yield. In the following we will provide an ample description of the method presented in their article alongside with our comments, observations and remarks, providing our reader with a full overview of both the advantages and downfalls of the discussed method.

# 4.1 Description of a Process-Tolerant Cache Architecture Method

The method described in "A Process-Tolerant Cache Architecture for Improved Yield in Nanoscale Technologies", will be referred from hereon as PTCA (Process-Tolerant Cache Architecture). The PTCA method is used to improve the cache memories yield; the results are quite remarkable from a basic 33% yield, by applying the PTCA the yield will increase up to 94% [1]. But this can be only by having a direct intervention in the cache's memory architecture structure, in order to provide replacements for faulty cache cells.

The fault distribution for a cache memory is depicted in Figure 4.1. This is used by the authors in order to design their method and also for showing their results after the implementation of their methods.



Figure 4.1: Fault statistics of a 64-K cache, from [1].

The basic idea behind PTCA is to replace a faulty cache cell with a healthy neighbor cell, e.g. if there are 8 cells in a row and in one of them becomes faulty it will be replaced by one of the remaining seven cells in that row. When all cells in a

row become faulty the entire memory becomes faulty. The results and architecture presented for PTCA relate to a 64 kB direct mapped cache memory, as in Figure 4.2:



Figure 4.2: Block diagram of a 64-K cache macro, from [1].

The proposed architecture is depicted in Figure 4.3. In Figure the BIST used can be any BIST that can recognize a given number of faulty bits in a cell [36] [37]. The Config Storage is used in order to maintain the configuration of the faulty cache cells on the hard drive, even when the processor is turned off, and then reload that configuration on the next booting of the system. The Configurator configures the way that the faulty cells will be mapped in their row and gives information to the tester if the chip is faulty or not.

The convention is to store multiple cache blocks in the same row and access them simultaneously by the use of a word line [38]. So when a cache cell that is healthy is accessed nothing happens besides the normal cache access, but if a faulty cell is accessed then the Column Multiplexer forces the cell that will be accessed to be the healthy cell that replaces the faulty one. An example of this is presented in Figure 4.4.

With this type of remapping there can appear a few problems: if for example two cache cells in the same row have the same tag, one of them is faulty (cell "one"), and is replaced by the other one (cell "two"); then a problem appears if we are looking at this instructions:

STORE D "one"

LOAD "two" Register

The problem is that since both cells have the same tag, then cell "one" will be stored at the location of cell "two", then the processor will deal with a cache hit instead of a cache miss, due to the fact that they have the same tag, see Figure 4.5

(a). In order to deal with this problem the authors have proposed to include the column address bits into the tag bits, which will increase the size of the tag, see Figure 4.5 (b). After applying these modifications to the mapping, then the problem discussed, will be resolved as depicted in Figure 4.5 (c).



Figure 4.3: Architecture of a 64-K process-tolerant cache, from [1].



Figure 4.4: Resizing the cache, from [1].



Figure 4.5: Resizing of cache based on the fault location. (a) Mapping problem. (b) Extending tag bits. (c) Resolving the mapping problem.

The fault tolerance of the PTCA method is proportional with the number of columns in a row, so the more columns the higher the fault tolerance will become. In order to implement the Config Storage the authors have proposed two methods:

- 1) content addressable memory (CAM) implementation
- 2) one-bit implementation (OBI)

These two implementations are presented in Figure 4.6.

For the CAM implementation the fault locations (index bits) will be stored into a CAM [39]. The size of the CAM will depend on the total number of faults that need to be tolerated. A 100 entry CAM is depicted in Figure 4.6 (a).

The OBI adds one bit per cache block which tells the Controller if that block is faulty or not. An example of the OBI is depicted in Figure 4.6 (b).

Table 4.1 shows a comparison in terms of energy and performance between the CAM implementation and the OBI. While in Table 4.2 is presented how a four block per row cache is evolving when encountering four errors.



Figure 4.6: Config Storage. (a) CAM, an example to store 100 faults, (b) OBI, from [1].

In Figure 4.7 there are presented some probabilistic results that show the improvements resulted from the use of the PCTA method. Figure 4.7 (a) compares the PCTA with ECC (error correcting codes) and redundancy, Figure 4.7 (b) depicts the behavior when the OBI is used, Figure 4.7 (c) presents the PCTA with a redundancy used in the cache memory, while Figure 4.7 (d) depicts the ECC method with a redundancy option.

Table 4.1: Comparison of Energy and Performance between different Config Storage, from [1].

| Energy and Performance |          | 64KB<br>Cache | CAM<br>100 entry | CAM<br>200 entry | OBI<br>2k bit |  |
|------------------------|----------|---------------|------------------|------------------|---------------|--|
| Del                    | ay (ns)  | 3.86          | 0.88             | 1.11             | 1.81          |  |
| Energy (nJ)            | Match    | 1.00          | 0.036            | 0.074            | 0.034         |  |
|                        | No-Match | 1.89          | 0.031            | 0.063            |               |  |
| Г                      | Match    |               | 1.9%             | 3.9%             |               |  |
| Energy<br>overhead     | No-      | NA            | 1.6%             | 3.3%             | 1.8%          |  |
|                        | Match    |               | 1.070            | 3.370            |               |  |

Table 4.2: Column address selection based on fault location, from [1].

|                                        | P. L. I. C. I           | Accessed Column Address |              |              |          |
|----------------------------------------|-------------------------|-------------------------|--------------|--------------|----------|
| Faulty Blocks in                       | Fault Info by<br>Config | 00                      | 01           | 10           | 11       |
| Accessed Row                           | Storage                 | Forced Column Address   |              |              |          |
|                                        |                         | $\overline{}$           | $\downarrow$ | $\downarrow$ | <b>1</b> |
| None                                   | 0000                    | 00                      | 01           | 10           | 11       |
| 3rd Block                              | 0100                    | 00                      | 01           | 00           | 11       |
| 2 <sup>nd</sup> &3 <sup>rd</sup> Block | 0110                    | 00                      | 00           | 11           | 11       |
| 1st, 2nd & 3rd Block                   | 0111                    | 11                      | 11           | 11           | 11       |
| All four Blocks                        | 1111                    | NA                      | NA           | NA           | NA       |

# 4.2 Results of the Process-Tolerant Cache Architecture Method

In Figure 4.7 there are presented some probabilistic results that show the improvements resulted from the use of the PCTA method. Figure 4.7 (a) compares the PCTA with ECC (error correcting codes) and redundancy, Figure 4.7 (b) depicts the behavior when the OBI is used, Figure 4.7 (c) presents the PCTA with a redundancy used in the cache memory, while Figure 4.7 (d) depicts the ECC method with a redundancy option.



Figure 4.7: Probability of salvaging a chip versus fault probability for a 64-K cache. (a) Proposed architecture, ECC, and redundancy. (b) Proposed architecture with redundancy in OBI. (c) Proposed architecture with redundancy in cache. (d) ECC with redundancy, from [1].

In Figure 4.8 (a) the results in terms of yield are presented when OBI is used, while in Figure 4.8 (b) a comparison between the PCTA architecture implemented with OBI and redundancy, compared to the ECC method and the simple redundancy method in terms of yield.



Figure 4.8: Effective yield improvement. (a) Using proposed architecture along with redundancy to OBI. (b) Using different schemes with redundancy to cache. Plot 1: Redundancy to cache. Plot 2: ECC along with redundancy to cache. Plot 3: Proposed architecture along with redundancy to cache and OBI (r = 3), from [1].

Figure 4.9 shows the number of chips that can be saved after implementing the PCTA method, compared to the initial statistics of the faults of a cache memory.



Figure 4.9: Number of chips saved by proposed architecture and ECC with optimum redundancy versus number of faulty cells for a 64-K cache, from [1].

## 4.3 Conclusions and discussion

As a conclusion to the presented article we can say that even though it presents a number of improvements to the yield of cache memory chips, it also presents a set of disadvantages. The first and most important one is that in order to implement the PTCA method the manufacturer has to modify the reading and writing process of the cache memory in order to accommodate the changes that come with PTCA. A second disadvantage is that the yield improvement is dependent on the number of cache blocks that are accommodated within a row. But the most important downfall, in our opinion, is that the PTCA method does not address a very important issue, that is, usually errors appear in patterns (e.g. NPSF), and so it is more probable that a neighboring cell of a faulty cell to become faulty than any other. So the method by which a faulty cell is replaced with a neighboring healthy cell might, in time, not to be very efficient.

In order to address with these downfalls of PTCA and not only we will propose a method called Self Adaptive cache Memories, or SAM, this will be presented in the following chapter.

# 5 Self-Adaptive cache Memories

In this chapter we discuss an original graceful degradation method applied to k-way set associative cache memories. The method is called "Self Adaptive cache Memories" (SAM); it works by removing the faulty locations from use, while reorganizing the memory to maintain a high performance. For the proposed contribution, the analysis provided herein reveals a significant reliability increase for the cache memory, while the entailed overhead remains small in comparison with the attained goals.

#### 5.1 Introduction

As memory systems continue to decrease in size, the probability of hard, permanent faults increases especially in SRAM cells [1]. Due to this fact the usual method, using spare rows/columns, for preventing hard faults can become obsolete [1] [2]. The hard errors can appear due to process variation [1] [3] and aging [4].

We propose a new method called SAM (Self Adaptive cache Memories), which is used to disable from use the faulty cells that have been diagnosed as incurring hard errors. To this end, we will assume that the cache memory has a concurrent built in self-test (BIST) capable of detecting the errors may occur. Being a case of graceful degradation, this method will have a loss in performance because the size of the cache memory is decreasing [5] [6] [7]. The research presented herein aims at reducing that loss to a minimum by remapping some memory locations, and by the fact that the memory will be continuously adapting to new fault locations.

#### 5.1.1 L-Zone

First we need an extra bit for each memory cell; we will call this bit an 'L' bit. This bit allows us to separate the faulty cells from the non-faulty cells: if the L-bit of a cell is '1' it means that the cell is faulty and if the L-bit is '0' it means that the cell is working correctly.

For a simpler representation of the memory, we will separately present the L-Zone from the memory cell array. Taking a k-way set associative cache memory with n locations in each set; we consider having 5 faulty cells – represented by shaded cells in Figure 5.1 (a). Figure 5.1 (b) represents the corresponding L-Zone of the memory cell array.

The L-Zone is filled with zeros when the entire memory works correctly. When a hard error that cannot be corrected appears in a memory cell, the cell's corresponding L-bit becomes 1. An error is dealt with in the following way: if the concurrent BIST detects an error which it cannot correct, then the error type (hard or soft) will subsequently be determined; this can be done as simple as another read/write from/to the same cell. If it was a soft error, then at the next access of

the cell it has a very high probability of disappearing. If it disappears, it means that we don't have a hard error, so the memory can resume its normal functions. If at the next access the error persists, this means that we have to deal with a hard error and that particular memory cell can't be used any longer without possible data corruption. This is the point where the actual SAM method is taking over. If the BIST logic of the memory chip has a non-concurrent testing option and if the system in which the cache memory is working in allows it to be shut down for a period of time, than the non-concurrent BIST can be used as the next two (optional) steps of the algorithm: they consist of shutting down the memory for some time, so a non-concurrent test can determine the type of error that has been found and can generate a report to be processed by the CPU; this feature can be used by the manufacturer to make future improvements of the product. The final step is to make the cell's corresponding L-bit '1'. This step triggers the following operations:

- Checking if more than one location in a "line" is faulty. We refer to a line as all of the cache locations in which a main memory location can be mapped (see 5.1.2).
- Taking the preemptive measures in order to assure that no more than one location per line will be faulty (5.2.1).
- If there is no way that we can avoid more than one faulty location per line, then we have to decrease the set associativity of the cache. See 5.2.2 for details.
- In 5.2.3 we'll propose a method of reorganizing the memory in order to eliminate from use the faulty locations



Figure 5.1: SAM description. (a) Memory cell array; (b) L-Zone; (c) MTO column and counter, from [8].

## 5.1.2 "More Than One" column

If we have only a faulty cell on a line that means that the set associativity of that line is reduced by one, and we will encounter a problem if in some lines all the locations work correctly while in other lines we face two or more faulty locations. A

method for avoiding this situation will be discussed in section 5.2. In this section we will provide MTO (More Than One) column, as an instrument for preventing the problems listed above, which consists of one extra bit for each line of the cache memory, the so called MTO-bit.

Besides this column we need a counter to keep track of the numbers of 1s in the MTO column with the maxvalue=n, where n is the number of locations per set,  $n=(number\ of\ locations\ in\ cache)/k$  with the cache being k-way set associative. This counter will hold the number of encountered faults. We could use another method: a cascade of AND gates from the MTO column which will indicate if all MTO-bits are '1'; this can reduce the logic of the circuit, but it has a downfall: the exact number of faults that had occurred will be unknown, see Figure 5.1 (c).

The MTO-bit of a line becomes '1' when an error is found on that line, and it stays '1' until all of the MTO-bits are '1' and an error is discovered, then the whole MTO column will be reset to '0'. The MTO column along with the hard error signal will generate the following behavior: if the MTO-bit is '0' then it becomes '1'; else if the MTO-bit is '1' and we don't have an overflow from the counter, the MTO will generate a signal called L-Zone\_output which will indicate that we have more than one error in a line.

## 5.2 Modifications of the Set Associativity

### 5.2.1 Maintaining the set associativity

Maintaining the set associativity in a continuously degrading memory is a difficult task even if we can eliminate the faulty cells from use, because if – for instance – an entire line is eliminated the memory, it will work slowly or it won't work at all.

If we take the example described above, depending on the write policies we can have a very slow working system in case of a look-aside policy, and a faulty system in case of a look-through policy. In order to avoid such a case we implemented a replacement policy; see the algorithm in Figure 5.2.

We will focus on the "modify\_address\_to\_first\_not\_0\_in MTO\_column" instruction for this we will use an example. Considering the situation from Figure 5.3 (a) and we have a new uncorrectable error in line two set two. The memory contents will look like in Figure 5.3 (b), which will decrease the set associativity of line two with two while we still have lines with an intact set associativity; this is unacceptable. Therefore we search for the first line with the MTO-bit '0', in this case this is line one, and we'll need to "switch" the faulty cell with a healthy cell from the same line, in which case the transformation of the memory will look like Figure 5.3 (c).

```
if (hard_error)
    if (MTO[line]==0)
    {
        MTO[line]=1;
        L-bit[address]=1;
        counter=counter+1;
    }
    else if (overflow==0)
    {
        modify_address_to_first_not_0_in_MTO_column counter=counter+1;
    }
    else
    {
        counter=counter+1; //reset counter for (i=0; i<n; l++)
        MTO[i]=0;
    }
}</pre>
```

Figure 5.2: SAM algorithm, from [8].

The "modify\_address\_to\_first\_not\_0\_in\_MTO\_column" instruction does the followings: it searches for the first '0' in the MTO column (it is found because the counter hasn't reached an overflow), and it makes a "switch" between the last available memory location in that line with the faulty location. Note: the actual memory doesn't switch the locations physically, so the memory still looks like Figure 5.3 (b) for the considered example; it is a virtual switch because the faulty location cannot actually be replaced with the healthy location, but instead all of the operations on the faulty cell will be performed on the healthy cell. Section C explains the way to implement the switch.

### 5.2.2 Reducing the set associativity

If we encounter a number of m faulty locations, where m is a multiple of the number of locations per set, n (i.e.  $m=n\cdot l$ ,  $l\in\{1,\ldots,k\}$ , where k is the number of sets), in order to maintain a stable performance we are obliged to reduce the set associativity of the cache memory. This varies from cache memory to cache memory, mainly depending on the replacement policy that is being used. In this paper we will only discuss the reducing of set associativity for cache memories that use LRU (Last Recently Used) as replacement policy. A similar method can be used for FIFO (First In First Out) replacement policy, due to their similar implementation.

One of the implementations of the LRU algorithm is depicted in Figure 5.4 (a). The main idea is to maintain a list of cache set indices sorted from LRU to MRU (Most Recently Used) [8]. When a cache set is accessed its set index *s* is presented to the list, and that index is rotated to the MRU position at the end.



Figure 5.3: SAM remapping. (a) initial memory; (b) unacceptable error distribution;

(c) acceptable error distribution, from [8].

For using the SAM method it is more convenient to reduce the set associativity of each line, instead of just waiting until we encounter n errors. The reduction will be performed by moving the faulty cell address in the LRU index and, after that, the LRU-1 will become the LRU column, as in Figure 5.4 (b).

### 5.2.3 Reorganizing the memory

One final step that we have to discuss is the "switching" of the locations. The proposed method is somehow similar to the TLB (Translation Lookaside Buffer), meaning that we have a table with two columns: within the first we have the address of the faulty location, whereas within the second one we have an address of a healthy location which is taken from the first line in the memory cell array with the MTO-bit equal to '0'. See Figure 5.5, which is a simple example of cache memory with faulty locations.

The actual switching doesn't occur until the memory location (2,4) is accessed; then its L-bit being '1' and the address being found in the table, the location (4,4) is used instead.



Figure 5.4: (a) LRU algorithm (b) reducing the set associativity, from [8].

#### 5.3 Overhead

Giving the fact that the SAM method eliminates faulty memory cells from use, it is no reason to worry about encountering any other faulty locations besides the ones already eliminated. Our main concern is to find the most efficient size for

the switching table. To this end, we need to take into consideration the overhead that the table is generating and the number of faults that need to be tolerated.



Figure 5.5: Switching table, from [8].

In order to find the most efficient size of the switching table we resort to some probabilistic calculations. It is necessary to find the most probable distribution of the errors in the memory, after a number of *I* errors already occurred. We will consider that a new faulty location can appear anywhere in the memory with the same probability.

If we have a memory like the one in Figure 5.6, after l errors the possible locations in the faulty lines becomes:  $possibleF = x \cdot k - l$  while the one in the healthy locations:  $possibleH = (n-x) \cdot k$ , in order to be in the most probable case scenario, after n errors, the two would have to be equal: possibleH = possibleF which implies that:



Figure 5.6: Faulty/healthy cells memory organization, from [8].

**Example.** Consider a L2 cache, 2MB 8-way associative, with 256B block size, as described in [5]. We will calculate the overhead for this memory, for the case of the most probable scenario, as discussed above. The number of bits in the switching table will be  $\log_2(1024.8)=13$ , thus making the size of the switching table

equal to  $2\cdot 13=26$  bits. This number will be multiplied by the number of locations necessary in the switching table. We will calculate the overhead necessary in order to reduce the cache from an 8-way to a direct mapping. There are 448+439+427+410+384+342+256=2706 locations necessary in the switching table, thus making its size  $2706\cdot 26=70356$  bits, see Table 1. These bits are added to the ones from the MTO and L-Zone: n(k+1)=9216 bits, resulting in 79572 overhead bits. This will result in an overhead of 0.474% without taking into consideration the valid bit and the tag bits, see Figure 5.7.

Table 5.1: Numbers of locations required in the switching table, from [8].

| k   | 8   | 7   | 6   | 5   | 4   | 3   | 2   |
|-----|-----|-----|-----|-----|-----|-----|-----|
| Х   | 576 | 585 | 597 | 614 | 640 | 682 | 768 |
| n-x | 448 | 439 | 427 | 410 | 384 | 342 | 256 |

Compared to the method described in [8], where if a whole row becomes faulty, the yield will be decreased, SAM can maintain a cache memory working even if a whole line becomes faulty; this is done by the use of the switching table.



Figure 5.7: Overhead for each reduction of the set associativity, from [8].

#### 5.4 Conclusions

The main goal of this paper is to establish the theoretical foundations of the SAM method. By applying the SAM BIST method to a cache memory we increase its reliability by eliminating the faulty cells from the memory.

In order to give a rough estimate of the reliability of the memory we make a set of assumptions: the faults appear independent within the memory, without correlation, we take into consideration only the memory cell array; the SAM method is applied in order to sustain the reduction of the set associativity until direct mapping. We can say that the memory will stop working correctly after a number of  $(k-1)\cdot n+1$  faults. A fault in the memory appears with a p probability, so instead of a reliability R=1-p [40], we obtain a reliability  $R=1-p^{(k-1)\cdot n+1}$ , which means that we obtain a much more reliable memory system. Considering that a fault appears every 10 hours of continuous memory functioning, after introducing a concurrent BIST to the memory we increase that period to 100 hours. This can suffice to an application in which the reliability isn't as important as the performance but, for an application where the importance of reliability is paramount, this doesn't suffice. After introducing the SAM method to that memory system we can keep the memory functional not for 100 hours but for  $100 \cdot ((k-1) \cdot n+1)$  hours (e.g. k=8, n=1024 $\Rightarrow$ 100·((k-1)·n+1)=716900 hours, which means an improvement of 7169 times. This improvement is created at the cost of reducing the capacity of the memory. It is necessary to find a critical point at which the performance will decrease too much and the memory chip will need to be replaced. This critical point will differ from application to application.

By introducing a BIST which detects and corrects more errors, we can avoid eliminating some of the healthy cells in the memory; this can happen if another soft fault appears in the re-reading of the memory cell. Another way we can reintroduce some cells in the normal use is by a non-concurrent BIST test which determines if the cells in the L-Zone are truly faulty or have been eliminated by mistake. If any cells like this exist they can be taken out of the L-Zone and re-posses their place in the memory, hence increasing the reliability and the performance of the system.

The overhead introduced by the SAM method can be considered as small given the reliability which it provides, as it is presented in section 5.3. Because we seldom need to reduce the performance of the cache memory to a direct mapping, the overhead can be approximated by the one obtained at k/2 set associativity for which the overhead in the example proposed is 0.32%.

In short, the advantages brought by the SAM method greatly exceed the disadvantages and the shortcomings that were also identified in this paper. Another perspective on the contribution is that by creating a few extra misses in the memory cache we obtain a huge increase of the reliability of the memory.

### 5.5 Future work

Our future work will be focused mainly towards finding acceptable limits in performance loss; this will consist mainly of finding a capacity of the memory for which the performance loss is small to negligible. In order to make this work, we will try to start the memory at over 100% capacity and not decreasing it below 50% (e.g. we can start at 125% of the memory capacity needed and stop at 75%). Also for the increase of performance we will analyze the advantages and disadvantages

of adding a non-concurrent BIST which from time to time will test the memory to see if there are any locations that can be taken out of the L-Zone and given back to normal use. Another way will be that of a more sophisticated concurrent BIST to be used for the detection and correction of errors.

We will focus also on adding some extra logic in order to reduce the number of searches in the switching table for switched locations; this will have as effect an increase in performance.

The SAM method can also be applied in order to improve the yield of the memory.

We are also starting to test the performance decrease in comparison with previous methods described in other papers.

Another research direction will be the reduction of the overhead. For this, we can try to find a more exact distribution of the errors that appear in the memory cell array, by including the new faults induced by interactions between the already faulty cells and other, previously healthy cells.

Our last and most ambitious research objective is the application of the SAM method to any other memory types: from main memories to hard disks and flash.

# 6 Bibliography

- [1] A. Agarwal, B.C. Paul, H. Mahmoodi, A. Datta, and K. Roy, "A process-tolerant cache architecture for improved yield in nanoscale technologies," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 13, no. 1, pp. 27 38, January 2005.
- [2] A. Sasan, H. Homayoun, A. Eltawil, and F. Kurdahi, "Process Variation Aware SRAM/Cache for Aggressive Voltage-Frequency Scaling," in *Design, Automation & Test in Europe Conference & Exhibition*, Nice, 2009, pp. 911 916.
- [3] S. Ramaswamy and S. Yalamanchili, "Customizable Fault Tolerant Caches for Embedded Processors," in *International Conference on Computer Design*, San Jose, CA, 2006, pp. 108 113.
- [4] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers, "The impact of technology scaling on lifetime reliability," in *International Conference on Dependable Systems and Networks*, Florence, 2004, pp. 177-186.
- [5] Lee Hyunjin, Cho Sangyeun, and Bruce R. Childers, "Performance of Graceful Degradation for Cache Faults," in *IEEE Computer Society Annual Symposium on VLSI*, Porto Alegre, 2007, pp. 409 415.
- [6] Lee Hyunjin, Cho Sangyeun, and Bruce R. Childers, "Exploring the interplay of yield, area, and performance in processor caches," in *25th International Conference on Computer Design*, Lake Tahoe, CA, 2007, pp. 216 223.
- [7] Premkishore Shivakumar, S.W. Keckler, C.R. Moore, and D. Burger, "Exploiting microarchitectural redundancy for defect tolerance," in *21st International Conference on Computer Design*, San Jose, CA, 2003, pp. 481 488.
- [8] Liviu Agnola, Mircea Vladutiu, and Mihai Udrescu, "Self-Adaptive mechanism for cache memory reliability improvement," in *IEEE 13th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS)*, Vienna, 2010, pp. 117 118.
- [9] Algirdas Avizienis, Jean-Claude Laprie, Brian Randel, and Carl Landwehr, "Basic Concepts and Taxonomy of Dependable and Secure Computing," *IEEE Transactions on Dependable and Secure Computing*, vol. 1, no. 1, pp. 11-33, January-March 2004.
- [10] Qaulity Concepts and Terminology, part 1: Generic Terms and Definitions, 1992, Document ISO/TC 176/SC 1 N93, February 1992.
- [11] Industrial-Process Measurements and Control Evaluation of System Properties for the Purpose of System Assessment, Part 5: Assessment of System Dependability, 1992, Draft, Publication 1069-5, International Electronical Commission (IEC) Secretariat, February 1992.
- [12] M.C. Paulk, B. Curtis, M.B. Chrissis, and C.V. Weber, "Capability maturity model, version 1.1," Carnegie Mellon University, Pittsburgh, PA, Technical

- CMU/SEI-93-TR-24, ESC-TR-93-177, 1993.
- [13] R. Chillarege et al., "Orthogonal defect classification-a concept for in-process measurements," *IEEE Transactions on Software Engineering*, vol. 18, no. 11, pp. 943 956, November 1992.
- [14] Algirdas Avižienis, "Design of fault-tolerant computers," in *Proceedings of the November 14-16, 1967, fall joint computer conference AFIPS Joint Computer Conferences*, Anaheim, CA, 1967, pp. 733-743.
- [15] A. Fox and D. Patterson, "Self-Repairing Computers," *Scientific American*, vol. 288, no. 6, pp. 54-61, June 2003.
- [16] Randal E. Bryant and David R. O'Hallaron, *Computer Systems: A Programmer's Perspective*, 2nd ed. Upper Saddle River, NJ, USA: Prentice Hall, 2001.
- [17] John L. Hennessy and David A. Patterson, *Computer Architecture: A Quantitative Approach*, 4th ed. San Francisco, CA, USA: Elsevier, Morgan Kaufamann, 2007.
- [18] A. J. van de Goor, *Testing semiconductor memories: theory and practice*. New York, USA: John Wiley & Sons, Inc., 1991.
- [19] M. Marinescu, "Simple and Efficient Algorithms for Functional RAM Testing," in *IEEE Test Conference*, Philadelphia, 1982, pp. 236-239.
- [20] R. Nair, S. M. Thatte, and J. A. Abraham, "Efficient Algorithms for Testing Semiconductor Random-Access Memories," *IEEE Transactions on Computers*, vol. 27, no. 6, pp. 572-576, June 1978.
- [21] D.S. Suk and S.M. Reddy, "A March Test for Functional Faults in Semiconductor Random Access Memories," *IEEE Transactions on Computers*, vol. 30, no. 12, pp. 982 - 985, December 1981.
- [22] C. A. Papachristou and N. B. Sahgal, "An Improved Method for Detecting Functional Faults in Semiconductor Random Access Memories," *IEEE Transactions on Computers*, vol. 34, no. 2, pp. 110-116, February 1985.
- [23] Magdy S. Abadir and Hassan K. Reghbati, "Functional Testing of Semiconductor Random Access Memories," *ACM Computing Surveys*, vol. 15, no. 3, pp. 175 198, September 1983.
- [24] M.A. Breuer and A.D. Friedman, *Diagnosis and Reliable Design of Digital Systems*, 1st ed. Maryland, USA: Computer Science Press, 1976.
- [25] R. Nair, "Comments on "An Optimal Algorithm for Testing Stuck-at Faults in Random Access Memories"," *IEEE Transactions on Computers*, vol. 28, no. 3, pp. 258-261, March 1979.
- [26] G. Gordon and H. Nadig, "Hexadecimal Signatures Identify Troublespots in Microprocessor Systems," *Electronics*, vol. 50, no. 5, pp. 89-96, March 1977.
- [27] R. A. Frohwerk, "Signature Analysis: A New Digital Field Service Method," Hewlett-Packard Journal, vol. 28, no. 9, pp. 2-8, May 1977.
- [28] W. W. Peterson and E. J. Weldon, Error-Correcting Codes. New York, USA: John

- Wiley & Sons, 1972.
- [29] S. W. Golomb, *Shift Register Sequences*. Laguna Hills, CA, USA: Aegean Park Press, 1982.
- [30] Michael L. Bushnell and Vishwani D. Agrawal, *Essentials of Electronic Testing for Digital, Memory & Mixed-Signal VLSI Circuits*. New York, NY, USA: Springer, 2000.
- [31] V. D. Agrawal et al., BIST at Your Fingertips Handbook, June, 1987, AT&T.
- [32] V. D. Agrawal, C.R. Kime, and K. K. Saluja, "A Tutorial on Built-In Self-Test, Part 1: Principles," *IEEE Design & Test of Computers*, vol. 10, no. 1, pp. 73-82, March 1993.
- [33] V.D. Agrawal, C. R. Kime, and K. K. Saluja, "A Tutorial on Built-In Self-Test, Part 2: Applications," *IEEE Design & Test of Computers*, vol. 10, no. 2, pp. 69-77, June 1993.
- [34] Daniel P. Siewiorek and Robert S. Swarz, *Reliable Computer Systems Design and Evaluation*, 3rd ed. Natick, MA, United States of America: A K Peters, 1998.
- [35] R. Kraus, O. Kowarik, K. Hoffmann, and D. Oberle, "Design for Test of Mbit DRAMs," in *Proceeding of the International Test Conference*, Washington DC, USA, August, 1989, pp. 316-321.
- [36] S. Nakahara, K. Higeta, M. Kohno, T. Kawamura, and K. Kakitani, "Built-in self-test for GHz embedded SRAMs using flexible pattern generator and new repair algorithm," in *Proceedings of the International Test Conference*, Atlantic City, NJ, USA, 1999, pp. 301-310.
- [37] M.H. Tehranipour, Z. Navabi, and S.M. Fakhraie, "An efficient BIST method for testing of embedded SRAMs," in *The 2001 IEEE International Symposium on Circuits and Systems*, Sydney, NSW, Australia, 2001, pp. 73-76.
- [38] D. Weiss, J.J. Wuu, and V. Chin, "The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor," *IEEE Journal of Solid-State Circuits*, vol. 37, no. 11, pp. 1523-1529, November 2002.
- [39] H. Miyatake, M. Tanaka, and Y. Mori, "A design for high-speed low-power CMOS fully parallel content-addressable memory macros," *IEEE Journal of Solid-State Circuits*, vol. 36, no. 6, pp. 956-968, June 2001.
- [40] Martin L. Shooman, *Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design*. New York, Unites States of America: John Wiley & Sons, 2002.
- [41] Shuai Wang, Jie Hu, and Sotirios G. Ziavras, "On the Characterization and Optimization of On-Chip Cache Reliability against Soft Errors," *IEEE Transactions on Computers*, vol. 58, no. 9, pp. 1171-1184, September 2009.
- [42] Lucian Prodan, Mihai Udrescu, and Mircea Vladutiu, "Self-Repairing Embryonic Memory Arrays," in *2004 NASA/DoD Conference on Evolution Hardware*, Seattle, Washington, USA, 2004, p. 130.

- [43] P. H Bardell, W. H. McAnney, and J. Savir, *Built-In Test for VLSI:* Pseudorandom Techniques. New York, NY, USA: John Wiley & Sons, 1987.
- [44] M. Nicolaidis, "An Efficient Built-In Self-Test Scheme for Functional Test of Embedded RAMs," in *Proceeding of the IEEE Fault Tolerant Computer Systems Conference*, Ann Arbor, MI, USA, 1985, pp. 118-123.