FAULT-TOLERANT COMPUTING
“Fault tolerant computing is the art and study of building figuring frameworks that keep on operating agreeably within the sight of issues”. A fault tolerant framework might have the capacity to endure at least one faults including – i) transient, irregular or perpetual equipment faults, ii) programming and equipment plan blunders, iii) administrator mistakes, or iv) remotely prompted upsets or physical harm. A broad technique has been produced in this field in the course of recent years, and various fault tolerant machines have been created - most managing arbitrary equipment issues, while a littler number manage programming, outline and administrator deficiencies to fluctuating degrees. A lot of supporting research has been accounted for.
Adaptation to non-critical failure and trustworthy frameworks explore covers a wide range of applications are embedded real-time systems, commercial transaction systems, transportation systems, and military/space systems etc., The supporting exploration incorporates framework engineering, plan methods, coding hypothesis, testing, approval, confirmation of accuracy, modeling, software reliability, operating systems, parallel processing, and real-time processing.
The primary forum for presenting research in this field has been the yearly IEEE International Symposium on Fault-Tolerant Computing (FTCS) and the papers in its Digests give an essential reference source.
Essential Concepts:
Hardware Fault-Tolerance -
The reliability of a system can be improved through the introduction of redundancy in the system. Some of the examples of redundancy in the operation are as below:
A goal of redundant topologies is to eliminate network down time caused by a single point of failure. All networks need redundancy for enhanced reliability. Network reliability is achieved through reliable equipment and network designs that are tolerant to failures and faults. Networks should be designed to reconverge rapidly so that the fault is bypassed.
In this second part of article review, it will be mainly focused on the controllable part (Risk Management Framework, RMF). The controllable part contains six different phases that work among the system development life cycle (SDLC). The controllable phases provide the system developers a way to enable security controls, measure the risk level of data, and the system. Combine both parts, it becomes a security framework to allow the system developers to go through the step-by-step process to gather useful system data,
Faults are a precise interaction of hardware and software that can be fixed given enough time.
This paper will be used to present one of the project problems mentioned in the text of the book “The Mythical Man Month” by Frederick P. Brooks Jr. In addition, I will present my answers to the questions about the intangibility of the software and the increasing cost connected with higher reliability requirements. The last part presents my views which dependability attributes could be most crucial in four real life systems.
Fault tolerance was proposed as a technique to allow software to cope with its own faults in a manner reminiscent of the techniques employed in hardware fault tolerance [4]. It is the essential element that is needed for the creation of the next generation of reliable computer systems. Unreliable software is a very important factor that can have a terrible effect on the software’s quality and the software’s cost. It also changes the time of software delivery. When the test results of the systems differ the software ends up having a defect. A defect is any significant, unplanned event that occurs during a software test.
Automated processes and network links for sharing data will change from time to time and hardware components will fail. This means there is an ongoing need to monitor and manage physical sharing arrangements and resolve failures as they arise. Once sharing arrangements are established if it may be assumed that everything will continue to operate without failure – and often they will for years, but when there is a failure it then can take some time to identify the issue and resolve it. This takes us into the need to manage data sharing.
Nowadays distributed systems are large and finding faults in such large system ishard. Distributed nature of those faults makes it complex to identify. Again thosefaults are often partial, irregular and may result in abnormal behavior rather thansystem failure. So diagnosing a problem in such systems require collecting relevantinformation from many different nodes and correlating those with the problem.One of the main sources of information that can give the idea about root cause of prob-lems in a distributed systems is communication information between nodes. Lookingat communication between different devices in distributed system, it is seen that thoseare operated using various protocols. Among the protocols, Border
Abstract--A number of software fault tolerance methods have been proposed to achieve high reliability. However, these methods suffer from the lack of considering the possibility of correlated failures, where the failure of multiple components leads to system failure. Furthermore, previous methods to assess the impact of correlated failure on software fault tolerance require extensive testing of data and are therefore less suitable to conduct reliability analysis in the early design stages. Hence, these types of failures must be explicitly incorporated into reliability analysis. The influence of correlated failures on application reliability must be analyzed within the context of the application architecture. This paper provides a survey of some of the studies conducted so far in the field of software reliability and gives a summary of the models proposed to study the impact of correlated failures on the software reliability. The paper concludes by highlighting the major contributions of some studies and provides a direction for the future study in the field.
The Cambridge Distributed Computing System has played an important role in improving the quality of communications and opened a new way in which computing systems could be used for getting better results. Research in this area started in around the year of 1975 and till now much
All of the basic techniques that help to cope with failures involve some kind of replication. Typically the state of a system’s computation is replicated onto independently failing nodes. The system should then coordinate with the replicas for accurate recovery from failures. Fault-tolerant techniques are most usually designed to tolerate up to a pre-defined number, say k, of simultaneous failures. If such methods are used, the system is then said to be k-fault tolerant.
As a ramification, it is difficult for human operators to anticipate faults within the system and prevent and manage the risks incurred by an operational accident accordingly, making them “incomprehensible”. Therefore, organisational accidents in complex systems are inevitable as despite defensive measures implemented to mitigate their risk, such as the training of operators and regular maintenance, the fragile design of the systems is the core reason why accidents occur.
The N-version software concept attempts to parallel the traditional hardware fault tolerance concept of N-way redundant hardware. In an N-version software system, each module is made with up to N different implementations. Each variant accomplishes the same task, but hopefully in a different way. Each version then submits its answer to voter or decider which determines the correct answer, (hopefully, all versions were the same and correct,) and returns that as the result of the module. This system can hopefully overcome the design faults present in most software by relying upon the design diversity concept. An important distinction in N-version software is the fact that the system could include multiple types of hardware using multiple versions of software. The goal is to increase the diversity in order to avoid common mode failures. Using N-version software, it is encouraged that each different version be implemented in as diverse a manner as possible, including different tool sets, different programming languages, and possibly different environments. The various development groups must have as little interaction related to the programming between them as possible. N-version software can only be successful and successfully tolerate faults if the required design diversity is met.
Fault tolerance is the characteristic of a system that tolerates the class of failures. It will analysis the performance and enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails. The system as a whole is not stopped due to problems either in the hardware or the software. [1][2] Fault tolerance is the ability for software to detect and recover from a fault that is happening or has already happened in either the software or hardware in the system in which the software is running in order to provide service in accordance with the specification. [3]Fault tolerance is not a solution unto itself however, and it is important to realize that software fault tolerance is just one piece necessary to create the next generation of systems. A highly fault-tolerance system might continue at the same level of performance even though one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails [4].
For some applications, software fault tolerance is more of a safety issue than reliability. There can be legal or regulatory requirements to fault tolerance. For many B2B and B2C transactions, fault tolerance is a business decision. There are other organizations whose charter is driven by the implementation of a fault tolerant system. For example, the aviation industry is required to meet strict specifications for hardware and software. Airlines board over 500 million passengers per year. The transportation of such large numbers of human life would demand fault tolerance environments of robust safety and reliability. In the nuclear industry, the U.S. Nuclear Regulatory Commission developed a tool to assess software reliability systems by modeling redundant systems with analysis of failure data and calculations of availability. The International Society for Pharmaceutical Engineering provides risk-based guidance to encourage cost effective error detection and failure prevention. Companies that strive to maintain accurate and comprehensive environmental data are regulated with