November 8, 2019
Cosmic Mishaps Part III: How Cosmic Rays Can Disrupt Your Design
Our last two blog posts have discussed how cosmic radiation can damage a microcontroller. In space, cosmic-radiation is a top-of-mind concern. But at ground level, cosmic radiation is usually a 6σ event — it’s the last failure mode a designer would need to consider. If you design products for high-reliability applications, cosmic radiation must be considered.
Fighting Back Against Physics
Fortunately, many electronic devices have radiation hardened counterparts — those are versions of the devices that have been engineered to be less susceptible to failure in the event that a high energy photon or particle strikes them. Unfortunately, radiation hardened devices usually come out months or years later than their non-hardened counterparts, if they come out at all.
So what can you do with commercial off-the-shelf (COTS) electronic components already available at your favorite distributor? You can usually improve your design.
Go through each part in your schematic and ask yourself “What happens if this particular component fails at the worst possible time? What happens if two electronic components fail at once?” Does it mean an airplane will fall thousands of feet in the air or perhaps crash? Does it mean a person can be electrocuted? Does it mean your firmware can be unlocked? Does it mean your ATM will spit out all its money? If so, you need to weigh the catastrophe against the cost of mitigating the failure.
Lower-voltage Components are More Susceptible to Single-Event Upsets
Cosmic Rays don’t arrive from the heavens with a single energy level. There is a spectrum of energies, and high energy cosmic-rays are less common than low-energy cosmic rays. As electronics continue to shrink, chip designers are able to reduce the operating voltage of devices — which makes them more energy efficient, and also more susceptible to the lower-energy cosmic rays.
As a charged particle travels through an insulator or semiconductor, it can liberate electrons and leave a trail of positively charged ions in its wake. These positive charges can shift the logic thresholds for a device.
That means, from a reliability standpoint, an integrated circuit that operates at 5.0 V logic will be less susceptible to a cosmic ray impact than one that operates at 1.8 V logic. The 5.0 V logic device might have logic thresholds of 1.5 V for logic-high and 3.5 V for logic-low, a potential difference of 2.0 V, while the 1.8 V device might have thresholds defined at 1.2 V for logic-high and 0.7 V for logic-low, a potential difference of only 0.5 V. Cosmic rays capable of generating that 0.5 V of electric potential are simply more plentiful than ones capable of imbuing 2.0 V in a design. The threshold differences are even smaller for <1 V electronic circuit designs.
Note though, that the voltage of many memory modules is internally regulated, so they will operate at the same fixed voltage regardless of the input voltage. In that case, the supply voltage does not affect the soft-error rate due to cosmic rays.
Sensors and Microcontrollers Options
Triple Modular Redundancy
Critical sensors and microcontrollers can be installed in triplicate to provide redundancy. Things break, wires rub, rats chew cabling. Your sensor system is likely to be damaged by something far less exotic than a cosmic-ray, but a redundant design philosophy will protect against a wide assortment of failures. As a recent example, redundant sensor design and programming would have allowed the flawed Boeing 737-Max aircraft to land after one of its air-speed sensors failed.
A cosmic ray is statistically likely to only ever strike one of the three devices at a time, so two devices should have identical results — the third device’s results can be ignored. A Power-Management-IC can then independently cycle the power of the malfunctioning sensor/microcontroller to attempt to return it to service. If the third device continues to return flawed results after power cycling, it can be shut down and ignored until it is replaced. In fact, the design can continue to function until the last two sensors fail to agree. At that point, you presumably wouldn’t know which sensor to trust and the electronic circuit would require repair or replacement.
Intelligent Firmware
If your design is less critical, and you cannot afford to include redundant sensors, can you run your data through a computational sanity-check? A digital filter might identify and reject results that are physically impossible, or statistically unlikely from a sensor. For example, a temperature sensor that steadily increases by 2 °C/min over the course of 10 minutes is probably unlikely to suddenly increase by 4098° C/min at the next measurement interval. Does your microcontroller take the input value, no matter what it is, and act on it? Or can you implement programming safeguards to ensure that only reasonable values affect the state of your device?
Error Correction
Electrical Engineers have invented a number of clever schemes that help engineers detect data transmission errors and in many cases correct them.
Manchester-encoding transmits each bit-value as two (or more) logic-levels separated by a transition rather than a single logic level. It can detect errors, and the receiver can request retransmission of the message.
Manchester Encoding can transmit clock and data on a single line. The top line shows the clock signal, the second line the data, and when the two values are combined via XOR, they become the encoded signal shown on the bottom line.
Forward Error Correction
Forward error correction uses a convolutional encoder that alters the original data in a reversible way. This technique creates a data sequence that is interlinks sequences of bits. The receiver of this data can correct one incorrect bit and detect two incorrect bits before requiring retransmission.
This animation of a convolutional encoder is from the article “What is Bluetooth 5? Learn about the Bit Paths Behind the New BLE Standard” at the website AllAboutCircuits.com
Since the data output depends on data previously entered into the state-machine, only certain states are accessible from the previous one. This is what allows the “Forward Error Correction” to occur.
Each state can transition to only two other states. By keeping track of the data, and the allowable states, the receiver can detect when invalid data has entered the data stream. This state diagram is courtesy AllAboutCircuits.com
There are other error detection and correction schemes out there.
Summary
Cosmic-Rays are not likely to impact your system, but they are possible. As you get your design ready for market, be sure to consider the possible failure modes and what they might mean to the safety and longevity of your product.