The CPUs used in space applications typically lag ~20 years behind the consumer market due to the need to be very sure that the kinks have been ironed out, and the need to produce variants that can withstand the environmental conditions outside earth's atmosphere.
EDIT: Also worth pointing out - the price of one of these specialized chips is in the six digits, due to the higher tolerances and low production volumes.
Also worth mentioning that chips from 20 years ago are built on old fabrication technology, with ginormous transistor sizes (compared to modern CPUs). The RAD750 has a 150nm to 250nm minimum feature sizes, a Ryzen Zen 3 chip has a feature size of 7nm, two orders of magnitude smaller!
Feature size is super important in radiation exposed environments because resistance to bit flips from a stray proton (or other charged particles flying around in space) is inversely proportional to the size of your transistors. So small transistors mean they’re more likely to be activated by stray charged particles causing all kinds of interesting and exciting problems (which you don’t want to be debugging from one end of a 12 light minute comms channel).
> The RAD750 has a 150nm to 250nm minimum feature sizes, a Ryzen Zen 3 chip has a feature size of 7nm,
Yup. And while the transistor sizes haven't really shrunk by the 35x that this suggests, they've still shrunk by a lot. And, of course, the area has shrunk by this amount squared.
> causing all kinds of interesting and exciting problems
Sometimes super exciting. Latchup can turn the whole package into a white-hot parasitic transistor connecting Vcore to ground at low impedance. Rad-hard variants can largely eliminate this possibility.
This does make one wonder about an architecture that has 4 or more identical processors hooked up to the same inputs and two watchdogs watching them (and each other). If any processor starts disagreeing with the rest it is instantly reset. If any processor starts to draw too much or too little current it is reset. If either watchdog stops responding to the other it is reset.
Keeping the working data synchronized would be the real trick. One could imagine all of these CPUs are hooked to a single bank of redundant and ECC stabilized memory, and all access goes through the watchdog processes which will only let through the ones that are in agreement.
The end result could be a system that is lighter and faster than the traditional RAD hardened system simply because it's built on a smaller process. The downside is the enormous complexity of the watchdog systems, they would be very expensive to get right. Also, synchronization is one of the hardest problems in computer science. It's basically cache invalidation on steroids.
> This does make one wonder about an architecture that has 4 or more identical processors hooked up to the same inputs and two watchdogs watching them (and each other). If any processor starts disagreeing with the rest it is instantly reset. If any processor starts to draw too much or too little current it is reset. If either watchdog stops responding to the other it is reset.
This is a common set of techniques in critical systems. The smallsat built by students that I mentored didn't have voting, but it had processors with moderately sophisticated watchdogs performing mutual-power-monitoring and simple hardware failsafes. Aerospace control systems often run in lockstep and have voting, etc.
> Keeping the working data synchronized would be the real trick. One could imagine all of these CPUs are hooked to a single bank of redundant and ECC stabilized memory, and all access goes through the watchdog processes which will only let through the ones that are in agreement.
Typically you make the memory redundant too, and just ensure that input and outputs are common and all code is deterministic. e.g. There's Tandem / HP NonStop which use these techniques.
> The end result could be a system that is lighter and faster than the traditional RAD hardened system simply because it's built on a smaller process.
Handling upsets by voting makes a lot of sense. But radiation can cause permanent damage of small geometry circuits. And even with fast mechanisms to crowbar power in the event of a latchup, you're not really sure that you'll save the day.
It's a great way to handle things on the cheap for cutting edge payloads and research projects, though.
>>Typically you make the memory redundant too, and just ensure that input and outputs are common and all code is deterministic. e.g. There's Tandem / HP NonStop which use these techniques.
Tandem was software fault tolerant - the Stratus machines were fully hardware fault tolerant and ran code in parallel on seperate logical cpus to test for faults. Truly neat machines. Also, it was a PL/1 based machine - even the OS.
Yes, you're right. Turns out this doesn't apply to all of NonStop. I was thinking of Tandem NonStop CLX which my brother worked on and was lockstepped.
This is rather like how the old Stratus super mini's worked..
12(?) Motorola 68K chips were wired into multiple 'logical' cpu's, and programs executed in parallel on each. If a discrepancy arose, majority won and 'wrong' logical cpu's were taken out of service. It would also automatically call Stratus's remote service, report the failure and order a replacement board :-)
If we compare the RAD750 against the 5nm Apple A1, the newer chip has about 1660 times as many transistors per mm^2. Minimum feature size (ie. line widths) may not have kept pace with the nanometer-based fab node names, but transistor sizes definitely have.
Eh. Don't compare the RAD 750. It's better to compare something that's not domain specific and rad hard.
Compare the PPC 740, which was ~300,000 transistors/mm^2 at 260nm. Apple M1 is ~130,000,000 transistors/mm^2 at 5nm. (260/5)^2 =~ 2700x, but the actual difference is ~430x.
So it's not as bad as line widths, but it still isn't quite keeping pace.
The node sizes used in marketing today haven't actually had any relation to feature size for at least 10 years. You can read some of the actual feature details for 7nm on wikipedia.
If "feature size" is the width of a fin, a 7nm feature will typically be ~15nm across and spaced ~15nm apart (ie the center of one transistor is 30nm away from the center of the transistor next to it).
It's worth noting that the actual thickness of a transistor isn't strictly the most important factor. Which is why that wiki page shows fin pitch (the width of the transistor plus the space between transistors), because that's the metric that you get transistor density from.
Fun fact, a friend of mine bought and old mac, installed linux, and used it to do profiling for some code he wanted to fly. Great way to quickly find the problem areas!
There are memories available in shielded packages. SRAM is used wherever possible. Some of the rad hard micros have on board ECC, others don't and you'd have to spin your own with an ASIC or FPGA or have another mitigation strategy.
Aren't these Motorola-lineage architectures specifically good at doing lockstep computation? I.e. there are a bunch processors executing identical instructions and comparing the outputs?
We're talking launching stuff into space, every kg of extra weight is critical, particularly for scientific missions that are trying to do as much as possible. Also no amount of radiation shielding is going to make you perfectly shielded from cosmic rays.
It's actually a chronic problem with Voyagers 1/2. The teams that maintain them regularly experience weird, unplanned behavior because a cosmic ray bit-flipped some register or other, and then they have to manually figure out what erroneous command got executed and how to recover from it, and those are extremely simple systems compared to what's running on Perseverance.
So yeah, you need radiation hardened parts in space or you're just begging for a dead probe.
There are three computer systems on each Voyager. They are dedicated to specific tasks (Flight Control System, Attitude and Articulation Control System, and Computer Command Subsystem), each with two instances for redundancy.
Right, but that redundancy to my understanding is in case of component/system failure. It's a backup, not a judge-actor or error-correcting architecture.
"Engineers successfully reset a computer onboard Voyager 2 that caused an unexpected data pattern shift, and the spacecraft resumed sending properly formatted science data back to Earth on Sunday, May 23. Mission managers at NASA's Jet Propulsion Laboratory in Pasadena, Calif., had been operating the spacecraft in engineering mode since May 6. They took this action as they traced the source of the pattern shift to the flip of a single bit in the flight data system computer that packages data to transmit back to Earth. In the next week, engineers will be checking the science data with Voyager team scientists to make sure instruments onboard the spacecraft are processing data correctly. "
The single highest factor in all space design is cost per kg lifted. The only thing that is more costly - lifting it successfully and then having it break.
For a rover it's not just lifting it, you have to put it back down once it gets to Mars. That's a 3-stage process with the aeroshell, parachute, and powered descent.
Heavier rover? You'll need more fuel for the skycrane's engines. And that fuel adds weight, so don't forget that you'll need more fuel to lift the extra fuel.
Now your rover and skycrane are bigger and heavier. Do you need a bigger parachute? Does everything still fit in the aeroshell? Is the fuel capacity for orienting and stabilizing the aeroshell still sufficient with the increased mass?
It's not just the weight, there's also the complexity of handling 10 (or whatever) systems. It's not always easy to know when a system fails. How can you be sure that the output is invalid?
Detecting complete blackouts is fairly easy, after a while when a watchdog tickle is absent for too long, but bad output due to flipped memory? Consensus-style systems can be used, but they are also a complexity in itself, and the point here was to replace a hardened single point of failure with redundancy.
There is also a time-domain problem with regards to total integrated dose. All the parts will degrade over time. Radiation hardened parts can have lives of 15 years or more in a GEO environment. A commercial part might be lucky to make it 1 year. Radiation shielding is not as easy as you may think. Heavy ions in space can pass through anything and will impart a charge in whatever they hit as well as creating secondary sources of radiation. Neutrons are not as common in space but can be found near reactors, they are also very penetrating.
There's a dumb instinct to say "that's obsolete" because everything we experience says that we would personally lose the ability to function with a computer from 20 years ago. But it's more than powerful enough to wander around Mars for a while with specialized software. It's not like it's running electron apps.
If I can get a Linux distro to boot into a shell on it, I can do work on it. It'll probably be nicer than vim over ssh still.
The first version of the rPI was pretty close in specs to computers from that time and it works fine - though the G3 mac shipped with way less RAM by default. The first gen rPI on my desk even boots into a usable desktop.
There's a lot of bloat in modern software, such as browsers, but most is in the form of code/features that are rarely used.
Maybe someday we'll see electron versions that can be stripped to stuff you actually need.
It's not obsolete if it's running the apps that you need.
I've done embedded work on MCUs with less horsepower than you probably have in your headphones.
Sure it's not going to run windows 10, but it'll reliably control that device until the mechanical parts wear down. Same applies here, where the advantage of huge transistors means much greater resistance to cosmic rays, and a 20 year old design means all the bugs and quirks are well understood.
The latest and greatest is nice, but not always the right choice for the problem at hand when you have the option of custom firmware and you aren't connected to a hostile internet.
> I've done embedded work on MCUs with less horsepower than you probably have in your headphones.
MCUs are awesome. Given a 100mhz single core M3, half a meg of SRAM, some serious fun can be head.
(honestly 512K SRAM would be insane, but doable now days)
People don't realize that with a purpose built runtime and well crafted code, 100mhz is a LOT of power.
Modern OSes, complicated software stacks, preemptive multi-tasking, and the need for flexible general purpose usage, eat up a huge percent (I'd say well over ~70%) of the usable computing capacity in the devices we deal with every day.
Now I write web services that spend more CPU and RAM decoding JSON than entire runtimes I helped design. Go figure.
Of course productivity is a bit higher, having to design and code the most optimal binary serialization formats for every single task takes a lot of time and effort, but wow is the end result fast and memory efficient!
> Modern OSes, complicated software stacks, preemptive multi-tasking, and the need for flexible general purpose usage, eat up a huge percent (I'd say well over ~70%) of the usable computing capacity in the devices we deal with every day.
I think a big part of it is security. Having to isolate address spaces is expensive.
(Sometimes I like to imagine being able to run a unikernel OS where applications are written in something like safe Rust, everything runs in the same address space, and the OS delegates enforcing memory separation to a trusted compiler. So, things like writing to disk or the network are just regular function calls with no context switch overhead. In this model, unsafe code is treated sort of the way we treat kernel code or setuid binaries in Linux: regular users aren't allowed to run arbitrary unsafe code. With Spectre, I don't know if this sort of approach is still feasible on modern processors, or if it's possible for a compiler to guarantee that the compiled code is not susceptible to any known side-channel attacks.)
> Sometimes I like to imagine being able to run a unikernel OS where applications are written in something like safe Rust, everything runs in the same address space,
Better than that, static allocation. Embedded typically doesn't allow malloc anyway, so give me a platform that has the world's simplest MMU, static bounds checks that are put into write once read many memory upon boot. :-D
I was thinking it terms of general purpose computing, but yeah, static-only allocation make a lot of sense in some contexts (embedded, some HPC workloads) and that would sidestep the usual overhead of the allocator.
It's only obsolete if it doesn't have the features you need.
Which, in this case, include radiation resistance, which is much harder, if not impossible, to achieve with modern chips with very tiny transistors. Large transistors can much more easily shrug it off.
The old chip is not just "good enough". It's the only thing that works.
I am also an old computer hobbyist, with an apparent emphasis on Macs between '90 and '05. I used to try to get them up to the highest spec and newest software they could support, but found that they feel so much more usable when running either their original release OS or something in the middle. I have used Mac OS 9 and OSX 10.4 on my '99 iMac DV (PPC 750/G3, 400MHz), and they feel worlds apart to use. Same with my Performa 6116CD (PPC 601, 60MHz) on System 7 vs. OS 8.
(Of course, the rules are totally different with BSD and Linux, where you generally want the newest software that supports the architecture and it works great as long as you choose your packages wisely.)
Also worth mentioning that the Perseverance rover is mostly based on the Curiosity rover, launched 2011, for which development started in 2004 (https://en.wikipedia.org/wiki/Mars_Science_Laboratory#Histor...). If you consider that, the CPU doesn't sound that outdated anymore...
This. I worked on satellite control systems in that time frame on these exact boards (OSE and VxWorks); at the time they were considered incredibly powerful for space applications. It was quite the step up from other embedded systems.
In comparison I've read in an article that spacex explicitly decided to use redundant of the shelf computer hardware in their spacecraft and rockets, citing huge unit and dev board costs and lead times measured in years.
So far it worked fine for their near Earth applications and it will be interesting to see how they will handle long term deep space in this regard with starship. I guess they could still fly a rack encased in lead given the scale and performance of Starship though. ;-)
Radiation Hardening. Space Sicence Missions are analyzed from a perspective of risk, when designing the hardware specifications. The CPU specifically, is called a RAD750. It is essentially PowerPC 750 Architecture, with radiation hardening.
The key risk vector in semiconductors in a space environment is things like bitflips. So, when a high energy particle collides with your transistor in DRAM and a 0 becomes a 1, that is a problem. There are deep architecture changes made to the RAD 750 specific to deep space missions, for example SRAM is used instead of DRAM, as SRAm is less succesptable to bitflips from high energy particles.
There are other alternative RAD CPUs, such as the Leon. But ultimatley, a space vehicle does not need a lot of compute to have a successful mission. It runs VxWorks, a real time operating system. And a far, far, high risk of failure comes from bit flips, rather than having a bunch of compute for no reason
Ingenuity is also basically disposable. If it has a complete failure there is no critical mission tasks that are blocked or interrupted. When the device is basically a bonus-time test with a fixed weight limit then you can take a lot more risks to pack more into that weight if you accept a reasonably high probability of failure.
They probably also expect to learn something from the failure. If this is a more modern chip that is testing it's radiation hardening, this is a good place to put it in trial run. If it fails, they get new information, if it succeeds they now have a successful flight with it to Mars and some data.
It's a great 2 for 1 thing that NASA typically does. They'll win even when they fail because they learned something and they won't fail that way again.
Is this a good thing, or bad? I've always heard the rumor that NASA computer systems for space-intended systems to be grossly out of date. However this is due to certifying that the processor, etc. can work and that process takes a long time. Just curious if anyone knows why NASA doesn't use more modern processors; is it weight?
It's not weight, it's because they're proven and reliable. You don't want to risk the success of the mission on unproven hardware.
For secondary purposes they have no problem running more modern silicon. E.g. the Ingenuity helicopter, as a last-minute addition (and secondary mission) uses much newer hardware and software: https://spectrum.ieee.org/automaton/aerospace/robotic-explor...
> There are some avionics components that are very tough and radiation resistant, but much of the technology is commercial grade. The processor board that we used, for instance, is a Snapdragon 801, which is manufactured by Qualcomm. It’s essentially a cell phone class processor, and the board is very small. But ironically, because it’s relatively modern technology, it’s vastly more powerful than the processors that are flying on the rover.
Neither. The proper question to ask: "is it the right tool for the job?" The answer here is yes. Power is a vetted architecture and has been used in critical processes for quite some time.
Just because something is old does not mean it is bad or some sort of burden.
Space qualified processors are very expensive to develop and produce. They are optimized for use in space and manufactured on an entirely different process than commercial chips. All of this is to ensure reliability in the harsh environment of space. Due to these development and fabrication costs, it's worth it to keep using the same processor as long as it is capable of doing the needed computations. The current RAD750[0] has been used since 2005 and will likely continue to be used for some time.
The Ingenuity drone also on Mars now uses a Snapdragon processor running Linux. This is not radiation hardened IIRC, but it reboots so fast that it is in control again before it hits the ground in case the computer borks due to radiation flipping bits in flight.
Would love to see the considerations that went into that design!
Does it try to keep state across such a reboot?
Is the first course of action to try to deduce whether it's about to crash?
Is every sensor reading done multiple times to reduce risk of reading wrong (typically, such sensors are basic and doesn't employ error checks or such)?
Does it try to understand why the OS crashed and do any graceful degradation, eg "fine, don't use that sensor then"?
You can filter the sensor data to get rid of any "sparkles". You can also add in power controls for the sensor so that you can reboot them when they either stop responding or send too much erroneous data.
this assumes a bitflip would result in an immediate reboot which is not even that likely i think. detecting these failures probably requires redundant systems and luck i guess.
You could simply assume that any significant deviation from the planned flight profile is due to bit flips and reboot.
That might miss bit flips that don't affect the flight profile, but those do not need a real time response. Bit flips in collected data for example could probably be found afterwards when the data is analyzed.
Heck, it might even make sense to figure out what the minimum time, T, is such that something can go wrong that a reboot will fix fast enough to save the mission but only if the reboot happens within T seconds of the onset of the problem. Then just reboot every T seconds regardless of whether you know of something actually wrong.
The older stuff uses bigger(size) transistors, which are inherently more robust against a cosmic ray causing a bit flip, and thus easier to make radiation hardend versions.
> Is this a good thing, or bad? I've always heard the rumor that NASA computer systems for space-intended systems to be grossly out of date. However this is due to certifying that the processor, etc. can work and that process takes a long time. Just curious if anyone knows why NASA doesn't use more modern processors; is it weight?
Think of it this way: technology is not a straight-line tech tree from "less advanced" to "more advanced." It's more about tradeoffs.
So these processors might have less processing speed compared to modern consumer ones, but they're more reliable and resistant to radiation (and that's partially a consequence of the reduced speed).
They need to use radiation hardened processors. That means the chip will be in special packaging (I think ceramic), use bigger process geometries and, use redundancies in logic design within the chip. Also running at lower clock rates will reduce the chance of failure. These are not mainstream chips so they will be expensive and slow development. And lastly certification takes a lot of time and there are risks to the mission to switch from a proven technology to something new. A single glitch can end a mission that took a decade to accomplish.
It's not a good or bad thing. It's a process. Remote instruments (which is essentially what these missions are) don't need all that much compute to turn and face a target and acquire an image. It'd be nice to have more, but it's not strictly necessary.
Also, spacecraft a little distributed systems. There are _lots more_ processors on these things. The instruments and mechanisms all have their own processors, from what I see ... which is limited.
Don't think too hard about how Percy is self-driving on its limited compute, that will make your head hurt to think about trying to make that work.
Its the right tool for the job, so I would say its a good thing. Its a radiation hardened processor which has proven its worth in previous missions and is well known to the developers, while also providing enough computational resources to do what they need it to do.
I loved that processor [line]. When I first started learning Objective-C I was building a particle system playground with my friend. Then I learned about the vDSP / Accelerate framework, which harnessed the parallel capabilities of the RISC architecture, and suddenly I could make WAY cooler visuals. That’s a technique that has come back many times since for me, and in a way is what made me open-minded towards more parallel approaches to things on GPU, etc. It’s a different way of thinking about problems, and can be very interesting.
While weight would definitely be a problem, I wonder if they could use less hardened, but more modern processors in a redundant setup, similar to how the Space Shuttle worked? If one or more disagreed due bit flips, it would be voted out by the others. I also wonder what benefits they'd get out of the rover by having more advanced processing capability?
On 1-Aug-2017, Jinnah Hosein, who was the head of Software at SpaceX at the time, spoke to my company, Orbital Insight, in Mountain View, and I took some notes. I've never posted them anywhere, but below I'll post some unpolished bits.
WARNING: this was a lot longer and 'scattered' than I recall, apologies in advance.
DISCLAIMER: This was a casual talk and I casually took some notes, and it happened over three years ago as of this writing. In short, don't assume anything below is an accurate representation of what Jinnah said. As a long-time SpaceX fan, and as a much longer-time software engineer, I was super pumped during the whole talk, and was definitely not focused on accurate recording.
PS: I know most of this veers way off topic, but for some reason I decided to take this opportunity to share this material.
PPS: The comment was too long to post, so you'll need to click the above link. Sorry about that. If someone would like to distill the on-topic parts into another comment, that would be fine.
> There is no defined systems integrator role at SpaceX. Everyone is responsible for carrying their system all the way through. They resisted the idea of handoffs.
> They saw that at NASA. Because of the handoff, it caused them to push reliability ratings way higher than needed, which added complexity and customization.
Very interesting!
also
> it took Elon six weeks to go from "OK let's land on a boat" to actually having a boat. The software VP guy didn't think it 'was a 2014 problem...definitely 2015'...people who had been there longer knew Elon would be able to get a boat quick. Definitely a 2014 problem.
then this
> So then they'd fly balloons off of the boat so they'd get low level wind data, said data would go to the vehicle and be used for control for the last few km.
You're welcome! I'd forgotten most of this and so it was really fascinating reading over it again just now, especially in light of all that SpaceX has been doing over the past 3.5 years.
I have no damn clue. I did a quick Google of "DO189B/C" (and related..typos are definitely possible here) and nothing jumped out.
I typed this up 3.5 years ago, and didn't give a crap about Facebook one way or another at that time, so whatever was said there didn't stick with me at all.
Hopefully somebody else where can somewhat decode mystery string (that might have typos) based on context.
Yes, a better question would be "What does Astroboic, Blue Origin, Firefly Aerospace, Ceres Robotics, Draper, Masten, Sierra Navada, Northrup Grumman, Lockheed Martin use"? Those providers will be on / near the moon shortly. So will SpaceX, but for rover, lander, station, orbiters, you'll get a better answer. (Longer duration, more strict payload budgets, extreme thermal concerns).
The CPUs used in space applications typically lag ~20 years behind the consumer market due to the need to be very sure that the kinks have been ironed out, and the need to produce variants that can withstand the environmental conditions outside earth's atmosphere.
EDIT: Also worth pointing out - the price of one of these specialized chips is in the six digits, due to the higher tolerances and low production volumes.