Computing platforms inevitably suffer from corruptions that creep into the operating system, or elsewhere into the software platform. It is possible to reinstall individual applications or the operating system, but such remedies are slow and expensive both in terms of money and downtime. For this reason, system administrators typically rely upon system recovery applications, several of which are currently available on the market.
Unfortunately, most of these system recovery solutions are inadequate in one or several ways. For automated industrial systems the clearest point of failure would be their lack of automation: currently, none of the most widely known software recovery systems offer any means of configuring automated, smart reactions to system events. However, to achieve a reliably automated system recovery solution requires BIOS level engineering, and careful planning for the customer’s end needs. In the white paper that follows, we describe for you what the key scenarios a suitably smart, automated software platform recovery tool must rectify, and describe the technology needed to produce it.
Expertise, efficiency and speed
Automation advances in industrial processes increasingly involve using industrial PCs for the control and monitoring of every sort of machinery and process. Yet as industrial PCs penetrate ever further into the automation stack, the problem of data loss and system corruption becomes a critical consideration both for the security issues raised as well as overall system availability. As an operating system ages, small corruptions work their way into the root system’s software and it slows down. Eventually – and sometimes quickly and suddenly – small corruptions become big corruptions, bringing down the entire OS.
The most vexing problems are, of course, those which remote troubleshooting and reconfiguration cannot fix: if a systems administrator cannot reliably and quickly bring a hung system back online, then the most effective solution is to replace the system (either hardware or software) as quickly as possible. Unfortunately, the expertise of the staff who maintain and operate these automated facilities – whether it is electrical substations or wind farms, hazardous environments such as oil and gas fields, or transport environments such as trains and ships – does not include computer administration, troubleshooting, and repair.
The problem is compounded by the fact that, in recent years, computers used in automated processes are increasingly installed as embedded devices with no ready means of local manipulation or input. Keyboards, mice and even monitors are often absent, leaving many controller devices only accessible over networked connections, via the scada or a remote control station. Even for people who are comfortable troubleshooting a desktop system, and can make their way around an office computer or an enterprise server, administering multiple embedded computers from a remote scada is often a new and knotty challenge.
The industry is already knocking about
These problems are increasingly worrisome for administrators tasked with keeping remote systems available 24 hours a day, 7 days a week. Preventive maintenance has become a critical consideration for both technology at the edge as well as industrial solutions built on established technologies. Many examples are found. One leading multinational integrator of wind systems has been looking for a fully automated, fail-safe means of rescuing a computing platform’s software system: when devices are mounted high on an industrial-grade wind turbine, sending maintenance and repair personnel up to reinstall an operating system becomes a costly and expensive distraction. An East Asian systems integrator for on-board train solutions required an automated network solution to re-flash the computing hardware for any platform along the route. A Chinese high-speed rail company required a means of restoring computer systems at controller stations where the devices lacked any means of input or local systems monitoring. Finally, a European manufacturer of railway vehicles required an automated solution that could be called immediately by system users on board the vehicle, with no need for formal training or consultation with remote system administrators.
In each of these systems, the requirements were beyond what most other rescue software provides, running the gamut from regular maintenance updates for preventive measures to rescues on crashed systems that are no longer able to boot up. These are industrial challenges, and are too much for currently available rescue/rewrite solutions. None of the currently available solutions are fully automated, standalone systems capable of rescuing the entire platform from a permanent system crash. Instead, the market’s current offerings all suffer from a fatal design flaw: they operate in user-space, which means that should an operating system crash from software corruption then remote, human interaction is required to reset the system. While it is theoretically possible to configure a software solution that allows remote administration, doing so on a case-by-case basis requires a lot of detailed (and costly) coding and reliability testing. Even then, bugs are likely to creep in, and the system will require a networked solution that still requires human management and oversight.
Giving the customer what the customer wants
In contrast, the most effective platform rescue is one that can both rewrite the operating system once it has slowed down from corruption and also rescue it from a full crash where the machine can no longer even boot up. For this, an automated mechanism integrated with the hardware platform at the BIOS level is required, something that can re-write the entire suite of installed software – operating system, all applications, and the full system configuration – at the block level, from a cached copy, and then reboot into normal operations. A solution of this sort would be capable of resetting the entire platform to its earliest configuration state, effectively returning the entire software system to a pristine operating condition.
Moreover, as the four examples cited above indicate, a wide variety of solutions are needed to serve the full industrial horizon. For wind farm solutions, for instance, the recovery system will need to be local, standalone, and capable of automatically initialising in response to pre-set conditions with no human input whatsoever. In the case of a smart grid solution, the system must be capable of all that, and also able to respond to commands sent to it from a control centre (either on-site or distantly remote). Finally, for the last train system, the rescues must also be able to be initiated by the user at the physical device itself, without any need for specialised computer knowledge or administrative expertise.
Four fundamental needs
This paper presents four automated re-write procedures which, taken together, will allow any automated re-write mechanism to maximise predictive maintenance efficacy and convenience while automatically resolving all possible system failures attributable to software corruption. These procedures may be broken down thus:
1. Automated re-writes at scheduled intervals.
2. Fully automated recoveries on system crashes or slowdowns.
3. Remotely initiated automated recoveries.
4. Manual recoveries initiated at the device’s physical location.
BIOS-initiated block level recoveries
Simple file-level rewrites managed from within the OS are not enough. When software becomes so corrupted that the OS refuses to boot then automated recoveries can only be initiated before the operating system kicks in. Effective platform recovery automation must, therefore, initiate before the operating system kicks in, integrated with both hardware and software. The device must support an alternate storage mechanism for use as a dedicated cache – from which the recovery image and operating environment are read – while the BIOS itself must also be enhanced with the addition of a watchdog, that measures boot-times, registers when the platform is crashing or performing poorly, and then automatically calls up the recovery environment while monitoring the entire process for success or failure.
Perhaps most importantly, engineers should be able to simply and conveniently configure these automated system re-writes from within the OS, using only tags and a watchdog timer. After noting the precise time it takes for the system to boot up, administrators may open a dialogue to set a timeout. Whenever the platform’s boot process slows beyond the configured timeout, the BIOS will automatically call the recovery procedure, switching the system’s boot procedure over to an alternate recovery platform stored on a separate storage drive.
However, to guarantee accurate and uncorrupted re-writes one more design feature must be carefully attended to: re-writes should take place at the block level, recopying the system by bits, rather than by files. By copying the system at the block level, system corruptions are far less likely to creep in than when software is copied over at the file level. File level rewrites cannot compensate for corruption of the physical storage device, but bit level rewrites can. Only by guaranteeing that every bit is successfully re-written to the platform’s physical storage medium – whether disk or solid state – can the system recovery mechanism guarantee that a successful recovery procedure has been completed, and that every fragment of data has been successfully returned to its initial post-install state.
Automated recoveries at scheduled intervals
As systems deteriorate over time, the computing platform slows down. This can be a troublesome, debilitating problem for finely tuned automation networks that demand extreme precision at the process level. Preventive maintenance procedures are, for these situations, a critical tool in an administrator’s maintenance arsenal. A system recovery option that allows system administrators to configure a scheduled software recovery eradicates this worry.
To combat the persistent problem of system slowdowns, administrators must be able to configure a system to re-write itself at scheduled intervals. After estimating when the system begins to suffer and slow down from routine use, an administrator may set it to perform a scheduled recovery to its initial post-install state, allowing engineers to push the efficiency of their predictive maintenance routine further than it has ever gone before. By resetting a software platform whenever at set times, administrators can guarantee that a healthy hardware setup will always function at the benchmarks for which it was initially configured. By periodically refreshing the entire computing platform, every computer in the network will consistently perform at its freshest post-install configuration.
Fully automated recoveries on system crashes or slowdowns
An intelligent recovery mechanism will need to be configured to automatically re-write the system whenever a specified period of time has elapsed.
As mentioned above, administrators should be able to simply set a platform to return itself to a fresh post-install state whenever the boot process crashes. By enabling engineers to access a BIOS timeout counter, whenever the system fails to boot up by an appointed time a truly smart recovery system will automatically re-boot the system into a secure recovery environment, from which it will then re-write the entire platform, bit-by-bit. Once this is completed, the system will again re-boot into the original platform. If the new initialisation fails, then our smart recovery bot will continue attempting the re-writes until one of two basic conditions is met:
1. The system successfully recovers.
2. The system consistently fails, whereupon the recovery mechanism concludes its work and takes the system off-line. The number of times the system will attempt the recovery process will be configured by the system administrator.
Of course, if preferred, the system may also be allowed to continue its recovery attempts. Depending on administrator preferences, repeated boot failures may actually serve as a notification mechanism for critical maintenance.
Remotely initiated automated recoveries
While full automation is useful, certain situations will demand user-initiated recoveries as well. User-initiated recoveries may be broken down into two basic types: recoveries initiated remotely, via a scada or control room, and recoveries initiated by a user present at the physical device.
‘Remote’ recoveries include not only procedures that are initiated from a far-distant control room, but also those called from a local control station located on-site. The mechanism, in either case, is straightforward: when a system administrator perceives that a computer is crashing, or suffering from problematic slowdown, they may send a call to the device that begins an automated rewrite. With little more than a click of a mouse, the administrator will take the device offline and return the platform to its earliest, most pristine configuration. Because the rewrite is at the bit level, administrators can confidently recover the system whenever they feel the need, for whatever reasons they feel are justified.
Remote rewrites give administrators the power of tuning up or rescuing a system remotely, whenever the need arises. These recoveries can become a powerful tool beyond system rescues, allowing administrators to evaluate the condition of remote sites whenever they need.
Manual recoveries initiated at the device’s physical location
Manual recoveries, on the other hand, take place at the device itself. Using an automated system recovery, administrators may build a USB key that automatically triggers a full system rewrite. Manual recovery keys are perfect for user-initiated recoveries where authorised or trained engineers are not available, as is often the case at remote sites where computers are used as HMIs for heavy industrial machinery: for instance, ships, oil platforms or solar farms.
Manual recoveries are simplicity itself: after determining that the system requires software maintenance, users only need to insert the USB key into the device and then restart the computer. The recovery then proceeds automatically and securely with no further interaction required. Once the process is completed, the platform will either be returned to its earliest post-install state, or its permanent failure confirmed.
An effective smart recovery solution is a secure, fully automated, intelligent BIOS level platform recovery that copies all software from the block level. It offers convenient usage modes and configuration options that will suit the needs of any industrial computing or automation system. Configuring the recovery solution should be the last process automation engineers complete before deployment, after every other element of platform has been set up and configured. In today’s automation environment, every remote embedded computer should be equipped with the hardware and BIOS enhancements that deliver this powerful improvement on traditional maintenance and rescue procedures.
|Tel:||+27 11 781 0777|
|Articles:||More information and articles about RJ Connect|
© Technews Publishing (Pty) Ltd | All Rights Reserved