Copter EKF altitude estimation error due to heavy uart communication load

guglie · May 7, 2019, 1:19pm

I confirm that I can see this “altitude estimation climb on arming” even in telemetry without enabling logs.

This begins before props start spinning.

James_Megariotis · May 7, 2019, 1:32pm

Yes I have tried stable firmware version. It is present in the current stable release as well.

There seems to be some delay after ekf height reset and this in turn leads to a vertical velocity error. If I add additional delay in arming sequence, the error gets much larger. This problem was not present in the 3.5 firmware branch, since I had some short delays in the arming sequence on that branch.

Seems to definitely be related to dataflash logging since arming using log while disarmed feature leads to much smaller errors. Serial overhead also seems to affect the magnitude of the error.

Also if you download logs via mavlink on current stable release while running ekf2, the ekf goes bad very quickly. If running ekf3, the ekf goes into an unrecoverable state with vertical pos variance errors and compass variance issues. While running ekf3, the pixhawk must be rebooted after downloading logs as the ekf cannot recover from whatever errors are introduced during log download. This also never used to happen on the 3.5 arducopter branch.

Something strange is going on in 3.6 and beyond leading to really bizarre ekf behavior.

I may try the master branch and see if this same problem exists.

Also just fyi, it is possible to minimize these errors using certain ekf params. Decreasing ALT_M_NSE seems to reduce the error. Also if using a rangefinder, RFND_USE_HGT set to a nonzero value has a similar effect since the default RFND_M_NSE param is smaller than the default ALT_M_NSE param. Tinkering with these params can help mitigate the errors by decreasing the magnitude but does not eliminate them.

James_Megariotis · May 8, 2019, 2:18pm

Just to update, I posted an issue on the ardupilot repository related to the issues I described in this forum post.

rmackay9 · May 8, 2019, 11:33pm

Nice analysis and report.

I guess that the dataflash logging is causing some loops to run very log which will upset the EKF. We report this in the dataflash log’s PM message.

When we start logging we dump a whole ton of information (like the parameter values and FMT messages) so this would explain why enabling logging while disarmed would reduce the problem.

I don’t know how to resolve the problem though, at the risk of passing the buck, @peterbarker is our logging maintainer.

Perhaps we could gather more information on the loop timing…

James_Megariotis · May 9, 2019, 5:21am

Yes I inspected the PM message, and loop time was incredibly long as logging starts. I think I saw values of about 60,000 on log start, when the average loop time was about 2,600. I am assuming this is a value in microseconds. That would mean loops are running about 25x longer than they should.

The thing is, that would indicate that something is blocking the loop, but i did a test where I introduced a delay using hal.scheduler->delay_microseconds(n_us) during the arming sequence. Increasing the delay increased the magnitude of the error. I was under the impression that scheduling a delay using the scheduler would not block the main loop. But I could be misinterpreting the results of that test or the behavior of that delay function.

I will try to gather more info tomorrow on loop timing and post it if I find anything useful.

rmackay9 · May 9, 2019, 5:23am

@James_Megariotis, ok, adding a delay in the arming function would block the main loop… so that fits with what you’re seeing.

James_Megariotis · May 9, 2019, 2:44pm

@rmackay9 That’s good to know. I was not aware that hal scheduler delays were a blocking delay. So if I use that function to delay code, it blocks updates to the ekf and main loop execution. Is there a non-blocking form of delay? I was under the impression that the scheduler delay would just schedule the current thread to continue after x microseconds, yielding to other threads in the meantime.

The only reason I am a little confused is that in my custom copter-3.5, I used that delay in the arming sequence and disarming sequence to allow me to generate a specific digital signal on an aux servo port to arm a relay that required an unusual signal to throw. I never ran into any similar vertical ekf errors when using that method on previous branches.

I have found a temporary workaround for the vertical ekf errors, where after arming the vehicle, I wait 20 seconds before taking off to allow the ekf errors to settle. I plan to investigate a little more to try and find a more graceful and less time consuming fix.

I also need to find a way to reintegrate the relay arming code into 3.6 without triggering ekf issues. If you have any ideas on a possible workaround, any suggestions or advice would be awesome. Thanks!

Edit: If you are interested, this is a plot of the PM.MaxT and DSF.Bytes. We discussed some of this yesterday. This is what I see in my log. It appears that a much larger amount of data is being written on arming and max loop time suffers as a result.

I plan to experiment with log bufferzise and mavlink based dataflash logging to see if that will solve these issues in the short term.

rmackay9 · May 9, 2019, 11:58pm

@James_Megariotis, we don’t have a non-blocking delay function I’m afraid so instead you’ll likely need to write a function that gets called on each iteration of the main loop (or put it into the scheduler table) and then perhaps add a counter (or timer) within the function so that it can decide itself what it should do next.

amilcarlucas · June 21, 2019, 9:08am

Any progress on this one? We also have the issue.

James_Megariotis · July 18, 2019, 8:54pm

No, I am still investigating the issue hoping to solve it, but no luck as of yet. This may be a problem that is out of my league. I am not familiar with what code has changed relating to arming and the start of dataflash logging, or more specifically log file creation. I know for sure it doesn’t happen in 3.5 versions of arducopter and seemed to appear only in 3.6 versions.

My best solution thus far is to increase disarm delay to 30 seconds. And wait 20 seconds after arming to send any sort of guided or takeoff commands. Since my use case requires velocity controlled takeoffs, this gives the EKF time to settle after the introduction of the initial error. Also EKF3 seems to exhibit a smaller altitude/z-velocity error than EKF2 by about 50%. Using rangefinder for a primary alt source at low altitudes also significantly reduces the error.

chobitsfan · March 25, 2020, 6:22am

I have similar issue in https://github.com/ArduPilot/ardupilot/issues/13882. I closed it because it seems a duplicate of https://github.com/ArduPilot/ardupilot/issues/11304

In my case, it did not happen when logging is disabled.

chobitsfan · March 25, 2020, 1:49pm

Hi @rmackay9

github.com

ArduPilot/ardupilot/blob/b3ff81648b6cff8f87f1a30d1a0ecad82d2f0097/libraries/AP_Logger/AP_Logger_Backend.cpp#L103-L105


if (!_startup_messagewriter->finished() &&
    !hal.scheduler->in_main_thread()) {
    // only the main thread may write startup messages out

May I ask why only write startup msgs in main thread? Isn’t it cost main/EKF loop time? Thank you very much

peterbarker · March 25, 2020, 10:33pm

While the backends are supposed to be re-entrant, I believe this does
actually avoid some corruption issue we had.

It was rather easier to just let the main thread do the work and disallow
logging from threads than trying to cope with a thread trying to write a
log message while we’re in the middle of running a messagewriter.

We should be constrained in the amount of time we spend in here by the
size of the output buffers. Have you measured a problem?

chobitsfan · March 26, 2020, 1:45am

Thank you @peterbarker

As @James_Megariotis reported in EKF Vertical Position and Velocity Errors on Arming Events · Issue #11304 · ArduPilot/ardupilot · GitHub and in this thread. Start logging in first arm (log_disarm=0) seems make EKF altitude incorrectly climbing for a few seconds while copter still on the ground. It is not happened if you disarm and then arm again. I also have similar issue in EKF2: altitude climbing in first arm without takeoff · Issue #13882 · ArduPilot/ardupilot · GitHub (logs attached). I found it seems caused by lots of log writing during first arm (param, format, etc…). If I set log_disarm=1, I can see EKF altitude incorrectly climbing after system started. But it is not problem it is unlikely to arm copter that soon.

It was rather easier to just let the main thread do the work and disallow
logging from threads than trying to cope with a thread trying to write a
log message while we’re in the middle of running a messagewriter.

Thank you. But why only “startup message” ?

peterbarker · March 26, 2020, 2:09am

Just the state within the MessageWriter objects, as far as I can recall.

chobitsfan · March 26, 2020, 2:35am

Just the state within the MessageWriter objects, as far as I can recall.

I am sorry but I did not understand you. Did you mean all log writing are done in main thread ( not only startup message)? Thank you

peterbarker · March 26, 2020, 2:54am

No, we do allow log writes in other threads. It’s just the startup
messagewriters we don’t allow to be run from other threads.

It might be that we need to add some time constraints in around the log
message writers. Each of the while() loops in each of the ::process
methods would need a check to see if we’re out of time for writing more
messages.

I think a bit of profiling might be a good idea first; we also open the
file in the main thread - that could also be causing the issue.

chobitsfan · March 26, 2020, 5:47am

I think a bit of profiling might be a good idea first; we also open the
file in the main thread - that could also be causing the issue.

Thank you for pointing me in the right direction. I will try to do it

chobitsfan · April 9, 2020, 2:50am

Hi @James_Megariotis
I have a possible solution https://github.com/ArduPilot/ardupilot/pull/14030
It works well in my flight tests. Would you be able to test it? Thank you very much

James_Megariotis · July 6, 2020, 5:43pm

@chobitsfan Awesome! I just saw this. I have been very busy lately but I am very happy to see that this issue has not been forgotten. I will pull this change into my code asap and test it over the next few days. Hopefully your work will solve this issue because this seemingly insignificant issue has caused me lots of trouble.

Thanks and I will let you know how it goes.