The Coldboot Attack

Read the article

John Little is a R&D Systems Engineer working on flying balloons in new ways to maximize performance and data collection. His current project is figuring out how to fly a balloon into a hurricane. Outside of work, he's a folk dance teacher, musician, and bike aficionado.

At WindBorne, the extreme environments we operate in can pose a challenge for our onboard flight software. Most commercial electronics are rated for temperatures down to -40 ºC, yet we consistently collect critical weather data in the upper atmosphere where temperatures reach as low as -80 ºC. At these temperatures, batteries have difficulty providing enough current to power our onboard electronics. We’ve got plenty of tricks up our sleeve to achieve reliable power management, but even in the coldest conditions, our microcontroller can occasionally reboot due to limited available current.

Here’s the problem with rebooting: when a computer reboots, its dynamic working memory is not saved. The capacitors and transistors in the memory module can discharge at different rates, so when the device reboots, it might start up with entirely different values stored. Because of this, you have to reload memory from a known initial state, stored in the non-volatile memory.

In practice, this means that we might lose some of our onboard state when we reboot. There’s a small set of state variables that the balloon cannot compute on its own after rebooting, yet they are critcal to the continuance of a safe flight. They include the following:

The most recent estimate of the safe altitude lower bound. Without this info, the controller might descend into a hazardous storm system.
The most recent estimate of the balloon’s weight. Without this info, the controller might approach the altitude ceiling too quickly, where the balloon might burst.
How much charge has been drawn so far from the batteries. Without this info, the power management system might spend too much energy when the batteries are low, instead of waiting for sunlight to come up and recharge them through the solar panels.

These variables keep changing as the balloon flies, cannot be measured by the balloon’s own sensors on a single boot, and are critical to correct autonomous decision-making by the onboard controller.

So how do we deal with this? To everyone’s surprise, we’ve fixed it with software.

Our onboard microprocessor has two main types of memory storage: Flash and RAM. RAM can be read and written several times through flight—think of it like a big scratchpad that stores the balloon’s state as it flies. However, as mentioned, when it reboots, bits may or may not flip, and the only way to guarantee what is in the RAM success is to simply rewrite it.

Flash, by comparison, is slower to write, but can maintain its state when it powers off and back on. When you turn a computer on, you expect that the flash maintains the same memory that you loaded onto it. Therefore, we program our flight code to flash memory once before launching our balloons, and then perform an initialization procedure each time the microcontroller powers on. On many computers, this initialization is typically the first procedure before any other computation is run. Here’s ours:

‍

The first loop copies a section of flash to RAM, which initializes any global state variables that start with non-zero values. The second loop reserves an additional section of memory where state variables start with the value 0. Even that region needs to be initialized, as it might be full of non-zero values that persisted from before the last reboot!

If flash memory is only programmed once before launch, and RAM doesn’t persist across reboot cycles, we still need a way to store this memory.

Here’s where the magic happens.

Remember that bit about how volatile memory doesn’t persist across reboots? There’s an exception to this that we can exploit: when a computer is functioning at extremely low temperatures, as the device shuts off, its transistors and capacitors discharge much more slowly. That means that there are precious seconds to minutes of time before bits start flipping.

This effect can be exploited by hackers in a method known as a “coldboot attack,” and was demonstrated as a practical attack methodology by Princeton researchers in 2008. Usually the coldboot attack is performed by cooling a target computer with liquid nitrogen. However, in the upper atmosphere, where it’s already cold enough—we can perform this hack on ourselves.

To execute this attack on our own device, before writing the flash memory to RAM, we look at what has persisted in the RAM from the last boot, and copy it elsewhere in static memory. We then copy the remainder of data, overwriting this section, and then proceed to verify the copied persistent memory using checksums.

In the reset handler, before writing the flash memory to RAM, we look at what’s already in RAM and copy it to another location. This is the *very first code* that runs on boot.

‍That’s all well and good - but does it work? Yes, and quite well, in fact! Balloons with this firmware end up retaining their recently calculated and uplinked state variables, and can continue to fly autonomously while avoiding unnecessarily expending limited ballast or lift gas. This is just one of the many off-the-wall hacks that allows us to reliably fly our electronics within extreme system-level weight, temperature, power, and precision constraints.

We're always looking for engineers who thrive on turning 'that'll never work' into 'how did you make that work?' So if you're the kind of person who gets excited about solving impossible constraints with seemingly ‘inadvisable’ solutions, we’d enjoy working with you–drop us a line!

‍