Airbus A320 sun affects

Hi,
With the latest news about an A320 bug that says the sun is causing bit flips. Then those bit flips cause the control surfaces to move when they should not.

Could this be affecting our FW laptops.
Particulaly causing the “sync flood” FTR and FTH problems?

It would be helpful to see what they changed in the software to fix it, and why some need hardware changes.

We could then do the same to the FW laptop.

It could explain why the FTH/FTR happens fairly randomly.

My guess is the FW gets a bit flip, but as ECC is not supported end to end across the bus, that bit flip happens undetected. We therefore need to have hardware that can detect and recover from bit flips on all data paths.

There are multiple possible causes of bit flips. Some are listed here:

Typically ECC ram is installed in servers where it is crucial for services to run correctly. ECC ram tends to be slower, and needs a CPU that is capable of handling it. It typically never arrives in consumer hardware as nobody really cares if one pixel of a Netflix show is different for a splitsecond.

When a bit-flip happens, it can silently affect all sorts of things. It is not just one pixel of a Netflix show.
The main problem is that the bit-flip happens silently, so it might go UN-noticed.
For example, you can have a JPG picture stored on a disk, and it can silently flip a bit. The CRC checks for that sector on the disk will check OK. So, the disk does not detect any disk errors. I had BTRFS file system, that also puts a checksum on the entire file. Only the btrfs checksum of the entire file found the bit-flip, and told me my JPG picture was corrupted.
Nothing warned me while saving the file. It was only detected when I tried to load the JPG picture again later.
If the bit-flip happens to be a critical piece of data, it can result in application crashes, unexpected reboots, all sorts of odd things.
Whereas, on a system that has full path ECC checking, the bit might flip, but at least you will be told immediately that a bit-flipped, and sometimes ECC will auto-correct it, but at least you are told immediately, and so can correct it. e.g. restore from backup, do the calculation again etc.
Many people complain, that now that one can put 128GB of RAM in a laptop, the risk of silent bit-flips becomes much more likely, so AMD and Intel should really provide full path ECC on both servers and laptops. They don’t, for some unknown reason, and everybody suffers.

1 Like

I really dont think this will be a reality soon, even if you look at something critical such as financial applications. How much of that 128gb is really that critical? The banking app your running might be like 100kb of critical portions, the rest is UI, models (I mean what else do you do with 128gb) etc.

The chance that out of 128gb that one part is being hit is so small I doubt its worthwhile to make ECC a standard across all computers. Compare this to servers, who only run mission critical software…

If this is truly a critical portion, I’d say handle that one part (bussines logic / financial software etc) using software checks.

As per your issue with your JPEG, bit flips can happen on SSDs too, so you never really know if an end to end RAM check would have saved your picture…

Edit: Seems I’m wrong to a certain degree because DDR5 has some checking builtin: Challenge Validation

1 Like

I buy ECC for my computers where I can because I don’t want data corruption; everyone has different levels for what counts as ‘mission critical’. Stuff like ZFS and BTRFS using checksumming to also protect data and that is a help, as is the DDR5 error-correction you cite. That DDR5 error correction is to protect data as it moves between the chips and the CPU, not at rest in the DRAM cells.

We don’t agree about the value of this and I want both the option to buy better levels of system integrity and to have the economy of scale for these ways to retain data integrity so that it becomes the norm.

1 Like

That DDR5 error correction is to protect data as it moves between the chips and the CPU, not at rest in the DRAM cells.

Isn’t it the other way around? On-die ECC basically means the memory is checked as it is written (checksum or parity written alongside data) or read (compare data to checksum/parity), but the transportation to CPU/GPU is something “real” ECC will handle.

1 Like

It could well affect any laptop or desktop.

Back in the early 1980s a computer product range that I helped service used first generation 16k x1 triple voltage dynamic RAM chips, which were mounted on modules giving 32k x9, so it had parity checking. Periodically I would get a module come back to the workshop with a ‘permanent’ parity error. If the module was powered up each day the parity error would stay, always the same single bit in one chip. If the module was left on the shelf unpowered for a week the error would go away by itself, and one could test the module all you wanted without the error recurring.

The theory we had was that an alpha, beta or gamma particle (I can never remember which) would hit the memory cell and leave a charge on the floating insulated gate that formed part of the capacitor that stored the state of the memory cell. Each cell required a charge of around 16,000 electrons change between the 0 state and the 1 state. This was also about the charge caused by being hit by an atomic particle, which would leave a charge on the insulated gate, permanently changing the voltage threshold of the cell while the chip was powered, so that it didn’t matter wether you wrote a 1 or a 0, only one state was detected. If the chip was left unpowered for about a week the embedded charge would slowly leak away and the chip would operate correctly again.

There is a reason they don’t fit ECC to consumer items. The likelihood of a similar thing happening today is vastly reduced because chip geometries are so much smaller. Once you get under a certain feature size in chips (IIRC it is 2um) they become radiation hardened by design. I can speak with some authority on this as I worked on instruments for space craft for the last 25 years before I retired, and this was one of the critical ways of determining if a chip was radhard from the outset.

That I have been wondering as well. Hardware changes suggests to me that there are some boards that have chips on them that are not radiation hardened. These could be analogue or digital chips. To radiation harden them one trick we used was to glue a piece of tantalum sheet to the top of the chip and another piece on the bottom of the PCB under the chip. Tantalum is one of the densest metals around and is often used for this purpose. We would do this when we had to use a commercial chip with no radiation testing provenance and also on the windows of EPROMS for the software.

1 Like

I think you have this backwards. The DDR5 Wikipedia page has “on-die error-correction code is not the same as true ECC memory with extra chips for correction data on the memory module.”

I said what I did because DDR5 has signal training and anticipates imperfect data-transit integrity to reach its high bandwidth numbers, but it’s up to the chips to assert that data hasn’t been altered (say by cosmic rays or rowhammer) at rest in the silicon. Only ‘Unbuffered ECC’ and ‘Registered ECC’ DDR5 DRAM have the extra chips to track checksums for at-rest data.

The Airbus A320 story is getting a lot of traction for good reason. The underlying issue is signal integrity across a path. The sun rays are radiation that is penetrating the cable signals and causing a signal to be shifted or come through with a value that was not originally sent.

The key item to take away from this is that the fix was entirely a software (firmware really) fix for the communication and control systems. It would be interesting to see more detailed information on what steps they took to improve signal integrity when exposed to solar radiation events.

Do not forget too that all airplanes have mutiple paths for critical signals in the event one of them is corrupted or cut off. This is likely an issue that has been identified before by engineers and designers though something important enough happened that rapidly deployed this update en-masse. Some of this is from the egg on Boeing’s face in the last few years, and overall to quell the fear that airtravel is unsafe. Airplanes/air travel is a multi-trillion dollar industry worldwide. There are whole countries that depend on the continued success of airlines as a part of business, tourism, and even infrastructure.

There are other means of achieving signal integrity without hardware ECC; I would be willing to bet that some of the critical subsystems on airplanes were already using ECC and this still was an issue. The balance is making sure ensuring signal integrity/data purity does not consume all the resources for a system to operate reliably.

I would like to see details also.
Some news stories are saying the fix is just to downgrade to a previous software version. Some aircraft need new hardware.
As an aside, satellites sometime need “hardened” to protect them from cosmic rays, but that is to protect them not only from bit flips, but also from latching. Latching is where the hardware logic gets stuck.
I cannot see how software can fix hardware latching.
So, to me the A320 cause being solar rays does not match up with a software only fix.
In general, software can protect against bit flips on sensor readings sent over a network. One just adds crc checksums or similar to each message.
The problem then becomes the cpu/ram of the device calculating/checking the checksum. If that bit flips, one has a problem. For that, one then needs ram ecc and the calculations being duplicated by duplicate paths and results compared, in case one bit flipped.
As per:

The reason some planes need the hardware changed, is because the firmware/software is not field upgradable/downloadable on those units. So, the whole unit needs replacing to downgrade the firmware from L104 to L103+

So, the fix is definitely a software downgrade, rather than a bug fix. So maybe a bug was introduced in L104 and they are now rolling it back.

Some explanation as to why ECC is useful on a laptop or desktop.

I know, I know, its too many Linus’s in one video…

1 Like

Nah, nothing to do with signals in cables, it is all in the electronics.

See my post two before yours for my reasons for speaking with authority on this.

All satellites need hardened electronics. Anything in Earth orbit will have the highest radiation rated chips, with stuff on planetary missions it is possible to use lower radiation chips. I would expect any electronics on aircraft, especially anything flying above 20,000 feet to have a radiation rating, although I doubt it will be at the rating required for satellites.

Latching cannot be protected against by a software fix. The only way to fix that is to turn it off, wait a few seconds then turn it on again. Latchup causes a short circuit across the power rails and this is guarded against by having appropriate current limit circuits so that a latchup doesn’t short out the battery.

This doesn’t strike me as a bit flip on a sensor message. There is enough redundancy in sensors to mitigate against such problems.

I suspect that the software fix involves having data stored in multiple places so that there is redundancy in storage, and may also involve having parallel processing of control system signals with multithreaded processing with the result of each thread then going into a ‘voting system’ to make sure they all come up with the same result. There are various ways of doing this, some being purely software, some involving hardware.

This assumes the reports are correct that the software is being changed to a previous version. It doesn’t need to be a bug introduced, it could be a different scenario.

I suspect that the software has been modified by a new team of programmers who didn’t have the background experience of the previous team, and possibly removed some code that appeared to be redundant, but was there to deal with this exact situation. I suspect that the electronics that needs replacing is possibly new production that has had redundancy items removed for the same reason - the team has done an update to the hardware without having the background of previous teams and eliminated some necessary components that the software needs.