https://iecscience.org/journals/JIEC ISSN Online: 2643-8240

# A Novel Scheme for Fault Tolerant Computing

## Pramode Ranjan Bhattacharjee

Retired Principal, Kabi Nazrul Mahavidyalaya, Sonamura, Tripura 799131, India Email: drpramode@rediffmail.com

**How to cite this paper:** Pramode Ranjan Bhattacharjee (2021). A Novel Scheme for Fault Tolerant Computing. Journal of the Institute of Electronics and Computer, 3, 17-23.

https://doi.org/10.33969/JIEC.2021.31002.

Received: July 2, 2021 Accepted: July 22, 2021 Published: August 9, 2021

Copyright © 2021 by author(s) and Institute of Electronics and Computer. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

http://creativecommons.org/licenses/by/4.0/





#### **Abstract**

A novel scheme for ensuring reliability in the operation of a combinational digital network has been offered in this paper. This has been achieved by making use of three copies of the same digital network along with two additional sub-networks, one of which consists of three additional control inputs, which can also be used as additional observable outputs. If both the said two sub-networks are fault free, then the primary output of the network in the present scheme will always give fault-free responses even if a fault (single or multiple) occurs in one of the three copies of the digital network under consideration. Unlike the Triple Modular Redundancy (TMR) scheme, the present scheme does not require any majority voter circuit. Furthermore, unlike the TMR scheme, the additional sub-networks in the present scheme can be tested off-line by predefined test input patterns.

# **Keywords**

Combinational digital network, Stuck-at- fault, S-a-0 and S-a-1 faults, Reliability of a digital network, Fault masking

#### 1. Introduction

It is well known that even in presence of faults, a fault tolerant system works properly. The Triple Modular Redundancy (TMR) scheme [8] belongs to such a fault tolerant system. Such a scheme of fault tolerant computing makes use of three independent copies of the same digital IC and a voter (also known as the majority voter circuit), which is a simple AND-OR circuit. The TMR scheme is used to mask the fault occurring in one hardware unit and it can be considered as a hardware redundancy scheme to ensure the reliability of the system.

Testing of digital circuits has two broad contexts. The major context (and that which is normally implied) is the 'acceptance testing' of ICs after fabrication. Here test vectors (usually derived on stuck-at- fault model) are used [1]-[7] to distinguish good devices from faulty devices. The second context is testing done on circuit boards constructed using ICs that have originally passed acceptance testing and periodic maintenance checking [8]-[11].

With a view to facilitating periodic maintenance checking, a scheme for designing a reliable digital network has been offered in this paper. Like the Triple Modular Redundancy (TMR) scheme, the present scheme also makes use of three copies of

the same digital network (considered in this paper to be a combinational network) which have already passed acceptance testing along with two additional sub-networks with an additional observable output and three additional control inputs, which can also be used as additional observable outputs. But unlike the TMR scheme, the present scheme does not require the majority voting circuit and the additional sub-networks in the present scheme can be tested by predefined test input patterns. Fault-free response at the primary output in the present scheme could be ensured by observing the responses at the three additional outputs, which could also be used as additional control inputs. Further more, if in course of operation, a fault is developed in an IC, then that faulty IC could be identified by looking at the responses of the said three additional observable outputs so that the faulty IC could be taken away and replaced by another copy of the digital network under consideration which has already passed acceptance testing before another IC develops a fault to ensure reliability of the network. Thus the reliability of the digital network formed by a combination of three pieces of the combinational network under consideration could be enhanced after going through periodic maintenance testing.

# 2. The Novel Scheme for Ensuring Reliability of a Digital Network

The novel design for ensuring reliability of a digital network is shown in Fig. 1. As shown in Fig. 1, such a design consists of three copies of the digital network (assumed in this paper to be a combinational network) whose reliability has to be ensured in course of operation. Each of these three digital networks has 'n' inputs  $\mathbf{x_1}, \mathbf{x_2}, \dots, \mathbf{x_n}$  and one output. The three output lines of the three copies of the digital network drive an OR gate with an output. There are two sub-networks  $N_1$  and  $N_2$  in the proposed design. The sub-network  $N_1$  consists of an AND gate driven by the primary input lines with an additional observable output  $O_1$ . On the other hand, as shown in Fig. 1, the sub-network  $N_2$  consists of three additional control input lines, which can be used as additional output lines as well along with three AND gates, six NOT gates, two OR gates, and an EXCLUSIVE-OR gate with its output as the primary output  $O_1$ .

It would be worth mentioning here that, prior to the start of operation, the above two sub-networks N<sub>1</sub> and N<sub>2</sub> should have to be tested off-line for single stuck-at-fault by making use of the primary inputs and the three additional control inputs respectively and observing the responses of the additional output O<sub>1</sub> and the primary output O respectively to ensure that they are fault-free. If the said two sub-networks are fault free, then in course of operation, the primary output O will always give fault-free responses under the application of primary input patterns even if a fault (single or multiple) occurs in any one of the three copies of the digital network. Furthermore, in order to ensure reliability of the proposed scheme, if a fault is developed in any one of the three copies of the digital network, there is an urgent need to identify the faulty module and replace the same by another copy of the same digital network which has already passed through acceptance testing and found to be fault free before a second or third module in the design develops a fault. This can be done by observing simultaneous responses of the three additional output lines (which may also be used as additional control inputs) of the sub-network N2. If in course of operation, the responses of the three additional outputs of sub-network N<sub>2</sub> are either 000 or 111, then in each case, the primary output O will give fault free

response. If however, the response of the three additional outputs of sub-network N<sub>2</sub> is 011, then a fault exists in the first copy of the digital network and under this condition the primary output will also give fault free response. In a similar manner if the response of the three additional outputs of sub-network N<sub>2</sub> is 101, then a fault occurs in the second copy of the digital network although under this condition, the primary output O will also give fault free response. Finally, if the response of the three additional outputs of sub-network N<sub>2</sub> is 110, then a fault exists in the third copy of the digital network although under this condition, the primary output O will also give fault free response. In this way the faulty copy of the digital network in the design could be identified and may then be replaced by another fault free copy of the same before any one of the other two copies of the digital network in the design develops a fault with a view to ensuring reliability of the digital network in the course of operation. Finally if the responses of the three additional outputs are 100 or 010 or 001, then a fault occurs in the first or second or third copy of the digital network and hence the faulty copy of the digital network could be identified and it can be taken out and replaced by a fault free copy of the same before any one of the other two copies of the digital network develops a fault. But it is to be noted that in each of the aforesaid cases the primary output will give the correct response.

#### 3. Theorems

Theorem 1: The complete test set  $T_1$  for detection of single fault in the additional sub-network  $(x_1, x_2, ..., x_n, O_1)$  is given by  $T_1 = \{011 ... 1, 101 ..., 1, ..., 11 ... 10, 111 ... 1\}.$ 

*Proof:* It is well known that an irredundant combinational network could be tested for single fault (stuck-at 0 or stuck-at 1) occurring anywhere in the network, if all its primary inputs are tested for both stuck-at 0 and stuck-at 1 single faults. Now, it can be readily seen that the additional sub-network  $(x_1, x_2, ..., x_n, O_1)$  in the proposed scheme is an irredundant combinational network. Hence detection of single faults (both stuck-at 0 and stuck-at 1) on each and every primary input line of this additional sub-network under the application of primary input patterns belonging to set  $T_1 = \{011 ... 1, 101 ... 1, ..., 11 ... 10, 111 ... 1\}$  will be sufficient to detect single fault (stuck-at 0 or stuck-at 1) occurring anywhere in the said additional sub-network.

Theorem 2: All kinds of detectable single faults (stuck-at 0 and stuck-at 1) in the additional sub-network  $(y_1, y_2, y_3, 0)$  could be detected if all the additional control inputs  $y_1, y_2, y_3$  of this sub-network are tested for stuck-at 0 and stuck-at 1 single faults under the application of at most all possible additional control input pattern belonging to the set  $T_2 = \{000, 001, 010, 011, 100, 101, 110, 111\}$ .

*Proof:* The Proof of this theorem is obvious.



**Figure 1.** Diagram showing the novel TMR scheme for ensuring reliability of operation of a digital chip.

### 4. Operation of the Novel Design

In order to discuss about the operation of the novel design shown in Fig. 1, it should be noted that each of the three digital ICs realizes the same combinational network and each of them has passed through acceptance testing before fabrication. The reliability of operation of the proposed design is based on the assumption that, in course of operation, only one of the three digital ICs develops a fault. Prior to the

online operation, the two additional sub-networks  $(x_1, x_2, ..., x_n, 0_1)$  and  $(y_1, y_2, y_3, 0)$  should be tested off-line to ensure that both of them are free from any kind of detectable single faults (stuck-at 0 and stuck-at 1). Then the primary output O will always give fault free responses under the application of primary input patterns regardless of whether all the three copies of the digital ICs are fault free or a fault (single or multiple) exists in any one of the said three digital ICs. This can be explained as follows.

It can be easily seen from Fig. 1 that, under the application of each of the primary input patterns belonging to the set of minterms realized by the digital IC under consideration, one of the inputs of the EXCLUSIVE-OR gate will be led to logical 0, while the other input of the EXCLUSIVE-OR gate will be led to logical 1 regardless of whether all the three copies of the digital ICs are fault free or any one of the three copies of the digital ICs develops a fault (single or multiple).

Thus for each such primary input pattern belonging to the set of minterms realized by the digital IC under consideration, the primary output O will always give the correct response, logical 1.

Similarly under the application of each of the primary input patterns belonging to the set of non-minterms realized by the digital IC under consideration, one of the inputs of the EXCLUSIVE-OR gate will be led to logical 0, while the other input of the EXCLUSIVE-OR gate will also be led to logical 0 if all the three copies of the digital ICs are fault free, as a consequence of which the primary output O will be led to logical 0. Now, if in course of operation, one of the three copies of the digital ICs develops a fault, then that fault may lead to both the two inputs of the EXCLUSIVE-OR gate to logical 1, there by leading the primary output to logical 0 under the application of each of the primary input patterns belonging to the set of non-minterms realized by the digital IC under consideration. Thus for each such primary input pattern belonging to the set of non-minterms realized by the digital IC under consideration, the primary output O will always give the correct response, logical 0.

Finally it would be worth mentioning here that, if in course of operation, one of the three copies of the three digital ICs develops a fault, then that copy of the digital IC could be well identified by observing the responses of the additional outputs  $y_1, y_2, y_3$ . For example, if under the application of any one primary input pattern, it is observed that the response of the said additional outputs is 100, then the fist copy of the digital IC develops a fault. If the response at any time is 010, then the second copy of the digital IC develops a fault. The third copy of the digital IC develops a fault if the response at any time is 001. In a similar manner, the first or second or the third copy of the digital IC develops a fault according as whether the response of the additional outputs  $y_1, y_2, y_3$  is 011 or 101 or 110 respectively. After identifying the faulty copy of the digital IC, if any, it has to be replaced by another copy of the same digital IC which has already passed acceptance testing before the second or third copy of the digital IC develops a fault to ensure reliability of the proposed scheme.

#### 5. Conclusion

A novel design for ensuring reliability in the operation of a combinational digital network has been presented in this paper. The novelty and benefits of the present design discipline must be sought in the following.

- (i) Unlike the Triple Modular Redundancy (TMR) scheme, the present scheme does not require the majority voter circuit.
- (ii) Unlike the TMR scheme, the additional sub-networks in the present scheme can be tested off-line under the application of pre-defined test input patterns and hence no test generation technique [1]-[7] is needed in such regard.
- (iii) If in course of operation, any one of the three copies of the digital networks becomes faulty, then that faulty copy of the digital network could be identified by observing the responses at the three additional outputs (each of which can also be used as additional control inputs) and that faulty copy of the digital network may then be replaced by another copy of the said digital network which has already passed through acceptance testing before a second copy of the digital network in the proposed design develops a fault so as to ensure reliability of operation of the scheme offered.
- (iv) The present scheme will work if in course of operation, a fault (single or multiple) occurs in any one copy of the three digital ICs.
- (v) The proposed design discipline considered in this paper is simple, novel and interesting from the view point of academic interest.

#### **Conflicts of Interest**

The author declares that there are no Conflicts of Interest for the present work. The work has been carried out exclusively by the author himself and no fund is associated with this work.

#### References

- [1] Yau, S. S. and Tang, Y.S. (1971) An efficient algorithm for generating complete test sets for combinational logic circuits. *IEEE Trans. Comput.*, C-10, 1245-1251.
- [2] Ku, C. T and Masson, G. M. (1975) The Boolean difference and Multiple fault Analysis. *IEEE Trans. Comput.*, C-24, 62-71.
- [3] Wang, D. T. (1975) An algorithm for the generation of test sets for combinational logic networks, *IEEE Trans. Comput.*, C-24, 742-746.
- [4] Papaioannou, S. G. (1977) Optimal test generation in combinational networks by pseudo-Boolean programming. *IEEE Trans. Comput.*, C-26, 553-560.
- [5] Ahmed, Z. (2012) A method for test pattern generation of combinational circuits using ordinary algebra. *International Journal of Computer and Information Technology*. 1, 1-8.
- [6] Roth, J. P. (1966) Diagnosis of automata failure: A calculus and a Method. *IBM J. Res. Develop.*, 10, 278-291.
- [7] Fujiwara, H. and Shimono, T. (1983) On the Acceleration of Test Generation Algorithms. *IEEE Trans. Comput.*, C-32, 1137-1144.
- [8] Von Neumann, J. (1956) Probabilistic Logics and Synthesis of Reliable Organisms from Unreliable Components. In: Shannon, C.E. and McCarthy, J., Eds.,

- Automata Studies, in Annals of Mathematical Studies, No. 34, Princeton University Press, Princeton, 43-98.
- [9] Isermann, R. (2006) Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance; Springer Science & Business Media: Berlin/Heidelberg, Germany
- [10] Kshirsagar, R. V. and Patrikar, R. M. (2009) Design of a novel fault-tolerant voter circuit for TMR implementation to improve reliability in digital circuits. *Microelectron. Reliab.*, 49, 1573–1577.
- [11] Kohavi, Z. (1978) Switching and Finite Automata failure, 2<sup>nd</sup> edition, Tata McGraw-Hill Publishing Company Limited, New Delhi, India.