Location One drops into Maintenance mode after extended periods with RTCM corrections

mike_d · November 18, 2020, 4:15pm

We have noticed a recent issue with our copters and I was wondering if anybody else has seen it. We have copters with an RTK gps (like Here2+) as well as a mRo Location One (flashed with AP_Periph 1.1.0). We noticed that if we leave our copter powered on and receiving RTCM corrections for long periods of time (around 120 minutes), we consistently find that the Location One disappears from ArduCopter. When we inspect the Location One with a zubax babel, we can see that it is in Maintenance mode. If we used the babel to reboot it, it will go into operational mode again and function in ArduCopter again.

In our troubleshooting, we have tried multiple aircraft and multiple combinations of GPS units. It turns out that we can keep the problem from happening all together if we set the GPS_INJECT_TO param to exclude the location one.

We’re using Mission planner to survey in and send corrections, so I’m not sure if it’s a bit of bad data that we don’t see very frequently, or the accumulation of data over a long period of time, or if the error is within AP_Periph, or the Location One in particular, or Mission Planner even.

tridge · November 19, 2020, 12:48am

@mike_d thanks for reporting this! I’ll try and reproduce this locally and see if I can work out why it is happening

DroneWrangler · November 23, 2020, 11:47pm

Hi @tridge,

I have a bit more info about the problem in some notes here: https://docs.google.com/document/d/1LEJ7MNJeqaVn2XXcKeOk2Ze5EPh5-BLndPzJ1QcV9T4/edit?usp=sharing

Those notes are a bit messy but there is a summary at the end with the useful information. I want to add that we have observed this on two different aircraft. In each case we mitigated the problem by using the GPS_Inject_To param to prevent the corrections from being sent to that unit.

We have now observed a similar problem on a ZedF9 that is connected to the flight controller through an mRo KitCAN M10025B. The GPS position, compass data, and baro data all stopped being sent. It looks like the node crashed, but we are not yet sure. I will test more with this node tomorrow to see if it is the same issue and if we can narrow it down any further.

pkocmoud · November 26, 2020, 12:49am

@DroneWrangler Can you remove the restrictions on your notes?

tridge · November 26, 2020, 12:49am

I’ve requested access to this doc

mike_d · November 26, 2020, 1:01am

@tridge @pkocmoud fixed the access control on @DroneWrangler’s document for you

tridge · November 26, 2020, 1:11am

thanks, that is very helpful.
There are actually 2 issues we need to track down:

why does the node reboot into maintainence mode
why doesn’t it immediately come out of maintainence mode and reset itself

The AP_Periph design is supposed to be robust against these types of failures. If it ever does reset and it had been running for more than 30s before the reset then it should bypass the bootloader wait and go straight to operating again. I need to work out why that isn’t happening. The 30s check is to handle the case of a bad fw load, and you want to wait in the bootloader to get a fixed fw loaded.
The good news is your debug images in that doc tell me the exact firmware version the GPS is running (it is c2dce806). That means I can setup a node here to try and reproduce the issue.
I’ve now built and loaded the exact firmware you were running in those debug logs, and I’ll see if I can reproduce. I suspect the bug is one I’ve already fixed in the UART driver, but I won’t know for sure unless I can reproduce and then show it doesn’t happen with a newer fw.
I’m running the next now. Hopefully my node will lockup in the next couple of hours.
Cheers, Tridge

mike_d · November 26, 2020, 1:25am

Thanks so much for investigating! Let me know if there’s anything else we can help you with.

tridge · November 26, 2020, 1:29am

well, if you could test this fw that would be very helpful:
https://firmware.ardupilot.org/AP_Periph/latest/f303-M10070/
there is a good chance the bug is already fixed.
I’ve setup my test to send the RTCMv3 data at 6x the normal rate in the hope I can reproduce the issue more quickly, but if you could test the latest fw from the above link in parallel that would be great.

DroneWrangler · November 26, 2020, 1:58am

The problem certainly is difficult to replicate. Thank you for looking into it. And sure, I will test the new firmware ASAP. Unfortunately I have already broken down the RTK base and test rigs for the day here. I should have results Friday or Monday.

The second problem that we observed with the zedF9 and the kitcan module often occurred within 30 seconds of sending rtcm messages. If you have that hardware available, it may be worth attempting that setup. However, we have not yet inspected the uavcan messages during that failure so it may not be the same.

Sending data faster seems like it might help. We were using a Here+ as the base for the first configuration and an F9P for the second configuration (which sends significantly more rtcm data).

tridge · November 26, 2020, 2:24am

I’m using a NTRIP server for RTCMv3. I’ve been running it for 90 mins so far with no sign of the issue.

tridge · November 28, 2020, 4:53am

I have found the bug and fixed it here:

I’ve put a fixed firmware here to test:
http://uav.tridgell.net/M10070/
load the AP_Periph.bin with MissionPlanner CAN UI or the UAVCAN GUI tool.
Cheers, Tridge

DroneWrangler · November 29, 2020, 3:43am

That looks like an insidious bug. Thank you for tracking it down! I will test the fix ASAP on Monday. I’ll do a bit more digging on the zedF9 problem as well to see if it is the same. Thank you again

tridge · December 1, 2020, 10:45pm

This fix has now been merged into master. You should now use this firmware:
https://firmware.ardupilot.org/AP_Periph/latest/f303-M10070/
I will be doing a new stable release of AP_Periph soon, once we’ve done some more testing. Test reports welcome!

DroneWrangler · December 1, 2020, 11:47pm

@tridge We did some more testing with the zedF9 running through the mRo kitcan adapter. With only some rtcm corrections being sent the unit ran fine for a while. I then used u-center to add the messages for more constellations. Less than two minutes later the unit got stuck in maintenance mode. Of course this was right after the dev call.

The can node was running the latest version of the f303-M10070 from the build server. I put the commit version below.
Branch: commit 54bae68e02d4db76406869e55f3ecc494724341c

Let me know what I can do to gather more info and to help.

RTCM Messages Sent Before Crash
1005
1077
1087
1097
1127
1230

RTCM Messages Added Right Before Crash
1074
1084
1124

tridge · December 2, 2020, 8:08pm

thanks. Are you on discord? I’d like to get your help in reproducing this. Screen share on discord would be good. See ArduPilot if you are not familiar.
I’d also like to see the debug output from the node when this happens. Do the following:

install latest master on the flight controller
set CAN_LOGLEVEL=1 on the flight controller
on the kitcan adapter, set the DEBUG parameter to 1

then inject RTCM to reproduce the issue. We should end up with some messages in the log on the flight controller (and in the messages tab in MissionPlanner) giving information about stack usage on the CAN node.

DroneWrangler · December 2, 2020, 9:21pm

Sure. I’ll send you my discord handle. I am going to get everything setup here (RTK base, master flashed). Should be ready within one hour if that works for you. If not I’ll gladly schedule to talk at another time.

tridge · December 2, 2020, 9:22pm

sounds good, I’m around all day. Send me a msg on discord