INDUSTRY TRENDS

Fault Tolerant Control Systems - (Part 2)

The first part of this article introduced the technologies for improving fault tolerance of control systems at the PLC and networking levels. This part of the article follows by looking at the technologies available to make supervisory computers and servers more fault tolerant.

Communication to Redundant PLCs

Because computers in control system environments are typically connected to PLCs and other equipment on the factory floor, there are issues that need to be considered where the supervisory computer is communicating to a redundant PLC system. For example, how will the computer communicate with the PLCs and how will it switch to the stand-by PLC in the event of a failure?

The most common method of implementing this type of system is to connect the computer and PLCs via a common network, such as Ethernet. Communication software running in the computer then communicates with the primary PLC and switches to the stand-by PLC in the event of a communication failure.

For example, a standard feature of OPC Drivers used by HMI and SCADA systems is to be able to define a main communication channel and a backup communication channel. The main channel is configured to communicate to the primary PLC, and the backup channel is configured to communicate to the stand-by PLC. In the event of a communication failure on the main channel, the OPC Driver automatically switches to communicate to the stand-by PLC via the backup channel.

A possible concern is accurately determining which PLC has control and ensuring the supervisory computer system is communicating to that PLC. For example, if the primary PLC failed, but it continued to communicate normally, the computer system may not be able to determine that there was a problem and the stand-by PLC had taken over. This can be avoided by ensuring the application software in the supervisory computer also monitors data registers in the PLCs that indicate which PLC has control. The supervisory computer may then force the communication software to switch to the stand-by PLC if it has not already done so.

Redundant Components

As with control systems, a common technique for improving the fault tolerance of computer systems is through redundancy. The two main causes of computer failure are the malfunction of a hard drive (about 50% of failures) or power supply (about 25% of failures). Therefore, up to 75% of computer failures can be prevented by providing redundant power supplies and hard drives.

Power Supply Redundancy

Power supply redundancy is realised with the use of dual, load sharing power supplies. Each supply can provide the full power requirement for the computer. The outputs of the supplies are connected together and the supplies are adjusted so that each provides approximately half the load. The mean time before failure (MTBF) of each supply will be reduced due to the reduced stress on each supply running at half power. In the event of a power supply failure, the "good" supply will assume the full load and the computer will continue to operate normally.

If the power supplies are connected to an Uninterruptible Power Supply (UPS), then it is also important that the UPS has a dual power supply to prevent a failure in the UPS from disabling both computer power supplies.

Hard Drive Redundancy

Redundant hard drives are arranged in an array usually referred to as a RAID array (Redundant Array of Independent Disks). There are different RAID levels for different applications. Certain array configurations improve read or write performance while others are primarily intended to maintain data integrity. For industrial control applications, a RAID 1 configuration, also referred to as disk mirroring, is often used. Since all data is completely redundant, data integrity in the event of hard drive failure is assured. This is the simplest array consisting of two hard drives and a disk controller. The controller writes to both primary and secondary drives simultaneously, but reads from the primary drive. In the event of a primary hard drive failure, the RAID controller will automatically switch to the secondary drive, thus preventing a computer failure. RAID 5 is also popular and is less expensive than RAID 1 where large disk capacities are involved. In this RAID configuration, data is striped across multiple hard drives to achieve high performance through parallel disk I/O. If a disk in a RAID 5 system fails, the system can continue to operate with the remaining disks. The faulty disk may be replaced and the data restored on the new disk using the information on the remaining disks.

Hot Swap or Cold Swap

If a failure occurs, there are two options for replacement of either power supply or hard drive. The least costly approach is called "cold swap". If a power supply fails, the surviving supply will continue to provide adequate power to operate the computer but the computer must be powered down to replace the defective supply. In the event of hard drive failure, the remaining drives will continue to provide normal operation but, again, the computer must be powered down to replace the defective hard drive. A cold swap strategy always results in downtime.

The "hot swap" replacement approach means that power supplies and hard drives are mounted in removable modules. A defective module can be removed and a replacement module installed without powering the computer down.

Server Clusters

A server cluster is a group of independent servers managed as a single system for higher availability, easier manageability, and greater scalability. The minimum requirements for a server cluster are:

  • Two servers connected by a network.
  • A method for each server to access the other's disk data (eg. standards like SCSI).
  • Special cluster software to provide services such as failure detection, recovery and the ability to manage the servers as a single system. There are a number of systems available for Unix based computers, as well as Microsoft Cluster Server (MSCS) for Windows NT.

Using Microsoft’s MSCS as an example, we can see how clustering solutions improve the fault tolerance of a server system. MSCS uses software "heartbeats" to detect failed applications or servers. In the event of a server failure, it automatically transfers ownership of resources (such as disk drives and IP addresses) from a failed server to a surviving server. It then restarts the failed server's workload on the surviving server. All of this - from detection to restart - typically takes under a minute. If an individual application fails (but the server does not), MSCS will typically try to restart the application on the same server; if that fails, it moves the application's resources and restarts it on the other server. Various recovery policies can be set, such as dependencies between applications, whether or not to restart an application on the same server and whether or not to automatically "failback" and rebalance workloads when a failed server comes back online.

Clustering solutions need to be used in conjunction with the other technologies mentioned previously to ensure maximum fault tolerance.

Redundant Computers

Where even a short period of downtime cannot be tolerated, duplicate redundant computers are necessary with each running a copy of the same application software. Each computer is typically installed in different locations to avoid physical damage in one area (for example, fire or water damage) affecting both computers.

Where communication between the PLCs on the factory floor is to a system of redundant computers, it is common to connect the computers and PLCs via a common network such as Ethernet. Each computer may then communicate to the PLCs as required. If the redundant computers are always on-line (ie. hot stand-by), both computers will communicate to the PLCs. This will double the communication to each PLC, but one computer will always be immediately available in the event of a failure in the other. If the PLCs are redundant, the techniques described above may also be used.

In some cases, the redundant stand-by computer may only take action if the main computer fails. In that case, there may be a delay while the stand-by computer starts up and updates its database with the current information from the field. However, the communication traffic to the PLCs is half of that of a hot stand-by system.

A number of control system devices only support RS-232 communication, which requires a point to point connection between a single computer and the device. It is very difficult to implement a redundant computer system in this case because the RS-232 cable would need to be physically switched from the primary computer to the stand-by in the event of a failure. Techniques to work around the problem include using RS-232 to RS-485 converters to create a multi-dropped RS-485 network between the computers and the device. The two computers would then need to coordinate between themselves to ensure that only one communicated to the device.

Simplified Systems

Some of the technologies discussed so far involve purchase of additional hardware and software to achieve a high degree of fault tolerance. In a number of cases, however, simple measures can be implemented at a modest cost to achieve a reasonable level of fault tolerance.

For example, the use of hot swap redundant hard drives and power supplies is only an issue where the system can’t tolerate an unexpected shutdown and loss of production data. In some cases, temporary failure of a computer lasting, say, 10 minutes are inconvenient, but may not cause a major problem. As mentioned, about 50% of failures are caused by hard disk failures. Therefore, a simple precaution is to set up a spare hard disk with an image of the existing hard disk in a removable chassis. A chassis can be purchased for less than $50 and a standard hard disk can be inserted to allow it to be removed from the computer system in seconds. Once an image of the existing hard disk is made, a hard disk failure can be rectified within minutes by removing the existing hard disk / chassis, inserting the new disk / chassis and rebooting the computer. While this is quick and inexpensive, any production data accumulated on the failed hard disk will be lost.

Another option adopted by some organisations is to implement a simple, semi-automatic system to backup their production systems. For example, during commissioning of a warehouse management system, a simple spreadsheet was developed to verify the data produced by the automatic system. The spreadsheet obtained data from the PLCs on the factory floor via a separate communication driver it could be run on any PC on the corporate network. Following commissioning, this spreadsheet was used as a backup system in the event of a failure of the main computer system. While not providing any of the features of the production system, it provided the bare minimum information necessary to run the factory in an emergency.

The technologies introduced in this article are fundamental to developing fault tolerant control systems. They provide a range of options that can be implemented to develop a control system that is consistent with the mission of the organisation and the cost of downtime.

Home | About | Services | Products | Projects | News & Articles | Downloads