INDUSTRY TRENDS

Fault Tolerant Control Systems - (Part 1)

Components of control systems fail. This may be a minor annoyance or a major disaster, depending on the circumstances and an organisation’s level of preparedness. For most organisations, a system failure jeopardises personnel safety, increases production downtime, increases raw material waste, impacts on customer service and can result in production data loss. As control systems become more dependent on computers and communication networks, there is a greater likelihood that some parts will fail, and thus a greater need to design systems to increase their reliability and integrity.

Reliability is a function of the mean time between failure for all system components. It reflects the quality of the equipment used and it also depends on the software (in PLCs and PCs) used as the platform to develop the control application. High reliability can be achieved by selection of well-proven, high quality components from reputable suppliers. Integrity reflects the performance and behaviour of the system in case of a component failure – its fault tolerance. It depends mainly on design issues in areas such as the structure of the system, the hardware configuration, communication links between system elements and the quality of developed application software.

Fault tolerant systems employ a range of technologies that improve system integrity and reduce the likelihood of control systems failures. The techniques fall into two general classes:

1. Hardware
This includes making hardware more reliable and rugged, for example by making it capable of taking abuse or extreme environmental conditions. Eliminating single points of failure is also critical in designing the system to be fault tolerant. This can be done by building in redundant components within a controller or computer, or locating completely redundant systems in different physical locations. Building-in reliability is obviously not limited to special controllers or computers. Many control systems also include a UPS (Uninterruptable Power Supply) to minimise disruptions due to power failure and reduce stress on components due to fluctuations in the power supply (thus reducing the chance of failure in the first place).
2. Software
Making software more reliable by designing and testing out software bugs, or allowing the software to survive even if the hardware fails. This is often more difficult than improving hardware reliability because it is very difficult to conclusively prove that all errors in developed software have been detected and rectified. In both PLCs and PCs, structured design of the software plays a major role in creating an environment where the software can be developed and tested in an orderly manner. Well-designed software can also be maintained more effectively, reducing the number of errors introduced when modifications are implemented.

The purpose of this article is to review some of the techniques that are used to improve the reliability and integrity of control systems. Solutions being used for PLCs (Programmable Logic Controllers), communication networks and supervisory and management computers will be reviewed. Part of this article will also look at simple design concepts and inexpensive alternatives to improve integrity or allow fast recovery from a failure in less critical systems.

Programmable Logic Controllers

A common technique used to improve the integrity of PLCs and other controllers is the "hot-standby" dual system, where the standby unit takes over if the "hot" unit fails. A related technique in critical systems is the triplicated system, typically called TMR (Triple Modular Redundant), composed of triplicated controllers, Input/Output (I/O) processors and interconnecting hardware. In TMR systems, voting schemes between components determines the correct operation and allows the failure of one resource.

These systems handle failure of a single resource well, but if two or more resources fail in most dual or triple fault tolerant systems, the entire system may fail. This is known as common cause failure, where two or more resources fail due to the same stress event. Stress events include mechanical shock, vibration, Electromagnetic Interference/Radio Frequency Interference (EMI/RFI), temperature, humidity, maintenance errors and operational errors.

GE Fanuc provides options for variable redundancy for up to three PLC CPU processors. Redundant systems can be configured using standard Series 90-70 PLCs and GE Fanuc’s Genius I/O components, which have the capability to diagnose system faults and take corrective action automatically. To ensure correct control decisions, system inputs, CPU and output demands are majority voted. In the event of a system failure, all Genius I/O modules are configured with "safe state" default parameters.

The Modicon TSX Quantum Hot Standby Option system provides similar support for critical process applications. Central to the system is the standby controller itself, which is continually updated with the system’s current status. Linked to the primary controller via a secure, high-speed fibre optic connection, the Hot Standby receives register and I/O state tables at the beginning of each scan. In the event of a primary controller failure, the standby option processor can take immediate control of the system. This provides a seamless, instantaneous control transfer, which is completely transparent to system operation.

In the Allen Bradley ControlNet PLC-5 backup system, two identical ControlNet PLC-5 processors are used to create a primary and a secondary system. Both primary and secondary consume the same input information, and both connect to the same outputs, though only the primary controls those outputs. Both processors are linked to the same ControlNet network. This allows them to maintain synchronous network communication and program scanning. When the primary is no longer capable of control due to an internal fault or an external power/communication loss, the secondary takes over. The secondary either runs a program identical to that on the primary or, a unique ladder program defined by the user for safe shutdown and/or limited production. Redundant ControlNet networking is also available as a part of this system.

Without resorting to redundant hardware, less critical areas of a control system can be made more tolerant of failures during system design. When allocating the I/O’s in a PLC system to specific I/O hardware modules, some thought should be given to minimise the effect of the failure of a single I/O module.

For example, if a number of devices, such as drives or pumps, are to be controlled, their I/O should be allocated so that failure of a single I/O module only affects one of those devices. If the devices can operate independently of each other, failure of a single I/O module would only affect one device allowing the others to continue normally.

Consideration also needs to be given to repair of the system in the event of a failure. For example, in an application that can’t tolerate a system shutdown during production, there is little point having hot standby PLCs and a fault tolerant design if you need to shut down the PLC system to replace a faulty I/O card. In this case, use of a PLC that supported hot swappable I/O cards would also be an important design issue.

In many cases, PLCs interact with supervisory computers to obtain schedule and production information and to feedback data and statistics gathered from the process. Supervisory computers are often not as reliable as the process control PLCs, and therefore the PLC system should be designed to operate independently for a period of time. For example, if the PLC requires production information at the beginning of each batch, the PLC could be designed to accept information for sufficient batches to last for one day’s production. The supervisory computer could update the information when necessary. However, in the event of a supervisory computer failure, the plant would be able to continue production.

In the same way, provision should be made to buffer production feedback information in the PLC. If a supervisory computer is unavailable, even for a short time while it is being restarted, important information can be buffered and transferred when the computer re-establishes communication.

Communication Networks

In most cases, fault tolerance in communication networks is achieved by implementing redundant networks with multiple, independent paths between devices on the network. Good redundant network design features communication cables running along different routes, minimising the risk of network failure if one cable is accidentally damaged. In some cases, all available network bandwidth is used, with a failure of part of the network increasing the traffic in the remaining parts of the network. In this case, the system needs to be designed so that some services are discontinued or slowed down to reduce traffic to match the available network bandwidth

For example, Toshiba Corporation’s ADMAP network is used for high-speed, reliable communication between Distributed Control System (DCS) and PLC components. ADMAP uses a dual redundant network to connect all devices in the system. In the event of a failure of one network, the second network takes over transparently, recovering from the fault. Toshiba’s network interface adaptors in each device manage all aspects of the failure recovery transparently.

Open networks based on standards such as Ethernet typically use a combination of redundant managed hubs, routers and network cables. In many installations, a single connection is made between a computer and a hub, with redundant paths provided from the hub to other parts of the network. In critical applications, multiple network interface adaptors can be installed in the computer and multiple connections established to multiple hubs. A number of vendors, for example, provide Microsoft Windows NT software drivers that allow multiple network interfaces to be seen by NT as a single network interface. In the event of a failure of one interface card or its connection to the hub, the driver will switch to the backup interface automatically.

Higher performance requirements can be addressed using technologies such as FDDI and ATM, which have built redundancy and high network availability into their original design philosophy.

The next part of this article will continue to look at the technologies being used to make supervisory computers and servers more fault tolerant.

Home | About | Services | Products | Projects | News & Articles | Downloads