Bad Instruction running SITL

timtuxworth · May 27, 2022, 4:30pm

I just updated to the latest master and I’m getting “Bad Instruction” when running SITL with FlightAxis. I don’t get the error if I don’t connect to flight axis. Is this a known problem?

This is Plane, latest master as of 2022 May 27th on my Mac M1 (ARM), the model is the AeroScout running on RealFlight 9.5S on another machine SITL_Models/RealFlight/Tridge/Planes/Aeroscout at 5fba09e8a50c424ead78295cbf2cd06405bee8d7 · ArduPilot/SITL_Models · GitHub

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e390008)
    frame #0: 0x00000001001fa15c arduplane`AP_ESC_Telem_SITL::update(this=0x0000600000010010) at AP_ESC_Telem_SITL.cpp:45:23
   42  	
   43  	#if HAL_WITH_ESC_TELEM
   44  	    for (uint8_t i = 0; i < sitl->state.num_motors; i++) {
-> 45  	        update_rpm(i, sitl->state.rpm[sitl->state.vtol_motor_start+i]);
   46  	    }
   47  	#endif
   48

tridge · May 27, 2022, 9:29pm

please share the model you are using and the parameters, and any other info on how to reproduce

timtuxworth · May 27, 2022, 10:50pm

Hi @tridge , I did some digging and found the problem. the update_rpm() function was expecting a uint16_t parameter for rpm whereas sitl->state.rpm[] contains floats.

It seems the bug is that it is using an uint16_t parameter as the rpm value which is eventually stored in an RpmData structure also contains floats. (so its converting from float to uint16_t then back to float). This is likely truncating any fractional value stored in the RPM values on the way through and discarding (or “undefined behavior”?) negative values. I’m assuming fractional values must be important otherwise why store them as floats on the Backend?

I have fixed it (compiles, runs, doesn’t crash) by changing the update_rpm() functions in AP_ESC_Telem.cpp and AP_ESC_Telem and AP_ESC_Telem_Backend to pass RPM as floats (should it be AP_Float?), it also works if RPM is passed as int32_t.

I’m also thinking an alternative fix is to cast the float to a uint16_t before passing it in, but this just masks the issues mentioned above (tried this - didn’t work).

timtuxworth · May 28, 2022, 4:46am

I think the problem is happening due to the difference between the size of a uint16_t (2 bytes) and a float (4 bytes) on the M1.
Nope - after further testing, it’s because SITL is passing an RPM of -1 and the parameter is an unsigned int.

timtuxworth · May 29, 2022, 10:43pm

This problem surfaces something bigger. There is generally inconsistent use of rpm in the code base. It seem most places it is calculated and stored, an integer value is used, but the values are stored as floats and then most places that use the values are expecting floats. I think (I can’t find where) that telemetry is outputting rpm as float values although I’m pretty sure they will only ever contain whole numbers.

The fix I have made is only a bandaid on what looks like bigger problem. @peterbarker asked “Why is it different on MacOSX” - I would say the answer is - pure luck. This looks to me like a classic example of how easy it is to have C/C++ code that is fundamentally flawed, but compiles and runs apparently fine - until it’s not.

Unless I’m missing something, I think the way it’s working now, values < 0 (e.g. marker values like -1 or 2) or > 65,535 and any fractional RPM are going to be discarded.

I’m thinking there needs to be a call made - should RPM be calculated and stored as float or integer value? I don’t know enough to say whether fractional RPM is important, but I’m thinking that if no-one has complained up to now that maybe float is not required?

timtuxworth · May 31, 2022, 7:02pm

I’ve created a PR for what I’m thinking is the most unobtrusive way to fix this problem AP_ESC_Telem:Fix SITL Crash on M1 Mac caused by floating point to uin… by timtuxworth · Pull Request #20876 · ArduPilot/ardupilot · GitHub

timtuxworth · May 31, 2022, 9:26pm

@andyp1per has also created a PR for a slightly different solution. This is a good solution too.

timtuxworth · May 31, 2022, 10:01pm

As far as I can see both @andyp1per s and my PR would fix the specific issue. I can’t see that one is better than the other.

What neither of them do is address the bigger issue that RPM is managed inconsistently across the code base, sometimes as a float, sometimes as an integer or unsigned integer and the numbers get passed around with often undefined behavior as to what the result should be, and perhaps confusing results, for example RPM passed as floats in mavlink or logging will likely never contain anything but whole numbers.

overflowing an unsigned integer in a floating-to-integer conversion is undefined behavior:
Implicit conversions - cppreference.com

andyp1per · June 1, 2022, 6:45am

Note this is only one-way - integer to float is fine - which I believe is what my change does

timtuxworth · June 1, 2022, 1:44pm

Agreed on both counts.
So what’s left is - why calculate rpm as an integer and then store it as a float? I’m thinking your change would imply that it would make sense for code like the example in AP_BLHeli to be refactored to calculate RPM as a float. (not now I guess, but at some point),
I think that would make sense.

andyp1per · June 1, 2022, 2:41pm

The data comes directly from the ESC in integer form

timtuxworth · June 1, 2022, 3:38pm

Maybe it’s just the 16 bits that is the problem.
uint16_t new_rpm = ((buf[7]<<8) | buf[8]) * 200 / motor_poles;
This calculation will drop any values > 65535.
e.g. lets say the value coming from the ESC is integer values but[7]=16 and buff[8]=10 and motor poles = 8. This should calculate to 102650, but because it’s being stored in an unsigned 16 bit integer, I’m not sure what the result will be. If it was cast to a float, or calculated as a regular integer the correct value of 102,650 would be passed down to the backend.
It’s only going to be a problem for very fast motors, but they do exist, for example.

which (am I right?) seem like they would have top speeds of 80,000 rpm or so on 3S

I’m suspicious of the / motor_poles, because if it’s guaranteed to always be an even number, then the rpm can never be fractional by this calculation. Is it ever possible to have an fractional rpm value? If not then this raises the question - if RPM can never be fractional, why are we incurring the overhead of using floats to pass and store it up and down the stack?