Servers by jDrones

Copter EKF altitude estimation error due to heavy uart communication load

Sorry for the 3 posts in a row, but just wanted to confirm that the signature I posted in image form above is related to the start of dataflash logging. If I log while disarmed and run the same test, that signature still exists just as the log begins. I will attach the same arming only log gathered while log while disarmed is enabled. you can see that arming still exhibits a much smaller error in vertical velocity and position.


arming takes place at about 00:30 on the time axis. you can see that there is still a small error in vertical pos and vel but significantly smaller than when the log starts at arming. The signature seen here will also get smaller/non-existent if all telem/serial ports are disabled (except gps and usb ports). But you can see that the same 1.5 meter vertical pos error exists at the point where the log starts. It also seems to get larger as parameters are downloaded upon connecting to mission planner(between 58:00 and 59:00). Once again, this cannot be replicated in SITL. There is definitely some strangeness related to uart load and dataflash adversely affecting the vertical ekf estimates.

I have attempted to attach the log but it is too large to include in this post.

Have you tried the stable firmware without any modifications and got same result? From your log I can’t find the reasons. Your accel z and barometer seems to be fine, only the innovation tells you that EKF height is wrong. How about changing ek2_gps_type to 1, which does not use 3d gps velocity? If it’s not sensor issue, I suspect that the reason will be related to EKF fusion time problem caused by the delay I do not know.

I confirm that I can see this “altitude estimation climb on arming” even in telemetry without enabling logs.

This begins before props start spinning.

Yes I have tried stable firmware version. It is present in the current stable release as well.

There seems to be some delay after ekf height reset and this in turn leads to a vertical velocity error. If I add additional delay in arming sequence, the error gets much larger. This problem was not present in the 3.5 firmware branch, since I had some short delays in the arming sequence on that branch.

Seems to definitely be related to dataflash logging since arming using log while disarmed feature leads to much smaller errors. Serial overhead also seems to affect the magnitude of the error.

Also if you download logs via mavlink on current stable release while running ekf2, the ekf goes bad very quickly. If running ekf3, the ekf goes into an unrecoverable state with vertical pos variance errors and compass variance issues. While running ekf3, the pixhawk must be rebooted after downloading logs as the ekf cannot recover from whatever errors are introduced during log download. This also never used to happen on the 3.5 arducopter branch.

Something strange is going on in 3.6 and beyond leading to really bizarre ekf behavior.

I may try the master branch and see if this same problem exists.

Also just fyi, it is possible to minimize these errors using certain ekf params. Decreasing ALT_M_NSE seems to reduce the error. Also if using a rangefinder, RFND_USE_HGT set to a nonzero value has a similar effect since the default RFND_M_NSE param is smaller than the default ALT_M_NSE param. Tinkering with these params can help mitigate the errors by decreasing the magnitude but does not eliminate them.

Just to update, I posted an issue on the ardupilot repository related to the issues I described in this forum post.

2 Likes

Nice analysis and report.

I guess that the dataflash logging is causing some loops to run very log which will upset the EKF. We report this in the dataflash log’s PM message.

When we start logging we dump a whole ton of information (like the parameter values and FMT messages) so this would explain why enabling logging while disarmed would reduce the problem.

I don’t know how to resolve the problem though, at the risk of passing the buck, @peterbarker is our logging maintainer.

Perhaps we could gather more information on the loop timing…

Yes I inspected the PM message, and loop time was incredibly long as logging starts. I think I saw values of about 60,000 on log start, when the average loop time was about 2,600. I am assuming this is a value in microseconds. That would mean loops are running about 25x longer than they should.

The thing is, that would indicate that something is blocking the loop, but i did a test where I introduced a delay using hal.scheduler->delay_microseconds(n_us) during the arming sequence. Increasing the delay increased the magnitude of the error. I was under the impression that scheduling a delay using the scheduler would not block the main loop. But I could be misinterpreting the results of that test or the behavior of that delay function.

I will try to gather more info tomorrow on loop timing and post it if I find anything useful.

@James_Megariotis, ok, adding a delay in the arming function would block the main loop… so that fits with what you’re seeing.

@rmackay9 That’s good to know. I was not aware that hal scheduler delays were a blocking delay. So if I use that function to delay code, it blocks updates to the ekf and main loop execution. Is there a non-blocking form of delay? I was under the impression that the scheduler delay would just schedule the current thread to continue after x microseconds, yielding to other threads in the meantime.

The only reason I am a little confused is that in my custom copter-3.5, I used that delay in the arming sequence and disarming sequence to allow me to generate a specific digital signal on an aux servo port to arm a relay that required an unusual signal to throw. I never ran into any similar vertical ekf errors when using that method on previous branches.

I have found a temporary workaround for the vertical ekf errors, where after arming the vehicle, I wait 20 seconds before taking off to allow the ekf errors to settle. I plan to investigate a little more to try and find a more graceful and less time consuming fix.

I also need to find a way to reintegrate the relay arming code into 3.6 without triggering ekf issues. If you have any ideas on a possible workaround, any suggestions or advice would be awesome. Thanks!

Edit: If you are interested, this is a plot of the PM.MaxT and DSF.Bytes. We discussed some of this yesterday. This is what I see in my log. It appears that a much larger amount of data is being written on arming and max loop time suffers as a result.

I plan to experiment with log bufferzise and mavlink based dataflash logging to see if that will solve these issues in the short term.

@James_Megariotis, we don’t have a non-blocking delay function I’m afraid so instead you’ll likely need to write a function that gets called on each iteration of the main loop (or put it into the scheduler table) and then perhaps add a counter (or timer) within the function so that it can decide itself what it should do next.

Any progress on this one? We also have the issue.

1 Like

No, I am still investigating the issue hoping to solve it, but no luck as of yet. This may be a problem that is out of my league. I am not familiar with what code has changed relating to arming and the start of dataflash logging, or more specifically log file creation. I know for sure it doesn’t happen in 3.5 versions of arducopter and seemed to appear only in 3.6 versions.

My best solution thus far is to increase disarm delay to 30 seconds. And wait 20 seconds after arming to send any sort of guided or takeoff commands. Since my use case requires velocity controlled takeoffs, this gives the EKF time to settle after the introduction of the initial error. Also EKF3 seems to exhibit a smaller altitude/z-velocity error than EKF2 by about 50%. Using rangefinder for a primary alt source at low altitudes also significantly reduces the error.

I have similar issue in https://github.com/ArduPilot/ardupilot/issues/13882. I closed it because it seems a duplicate of https://github.com/ArduPilot/ardupilot/issues/11304

In my case, it did not happen when logging is disabled.

Hi @rmackay9

May I ask why only write startup msgs in main thread? Isn’t it cost main/EKF loop time? Thank you very much

While the backends are supposed to be re-entrant, I believe this does
actually avoid some corruption issue we had.

It was rather easier to just let the main thread do the work and disallow
logging from threads than trying to cope with a thread trying to write a
log message while we’re in the middle of running a messagewriter.

We should be constrained in the amount of time we spend in here by the
size of the output buffers. Have you measured a problem?

Thank you @peterbarker

As @James_Megariotis reported in https://github.com/ArduPilot/ardupilot/issues/11304 and in this thread. Start logging in first arm (log_disarm=0) seems make EKF altitude incorrectly climbing for a few seconds while copter still on the ground. It is not happened if you disarm and then arm again. I also have similar issue in https://github.com/ArduPilot/ardupilot/issues/13882 (logs attached). I found it seems caused by lots of log writing during first arm (param, format, etc…). If I set log_disarm=1, I can see EKF altitude incorrectly climbing after system started. But it is not problem it is unlikely to arm copter that soon.

It was rather easier to just let the main thread do the work and disallow
logging from threads than trying to cope with a thread trying to write a
log message while we’re in the middle of running a messagewriter.

Thank you. But why only “startup message” ?

Just the state within the MessageWriter objects, as far as I can recall.

Just the state within the MessageWriter objects, as far as I can recall.

I am sorry but I did not understand you. Did you mean all log writing are done in main thread ( not only startup message)? Thank you

No, we do allow log writes in other threads. It’s just the startup
messagewriters we don’t allow to be run from other threads.

It might be that we need to add some time constraints in around the log
message writers. Each of the while() loops in each of the ::process
methods would need a check to see if we’re out of time for writing more
messages.

I think a bit of profiling might be a good idea first; we also open the
file in the main thread - that could also be causing the issue.

1 Like

I think a bit of profiling might be a good idea first; we also open the
file in the main thread - that could also be causing the issue.

Thank you for pointing me in the right direction. I will try to do it

Servers by jDrones