That makes a lot of sense. Dataflash logging would explain the issues with the first arming event after a power cycle since that is the point at which the log file is created. Also since I cant replicate these vertical velocity and position errors that occur on arming in a SITL simulation, something hardware related seems likely.
Seems like our symptoms are very different since I don’t experience long term divergence, just large instantaneous errors in vertical position and velocity on arming. I will try to get more information in my testing today.
Just as an added note, I did test a theory in which I placed a hal.scheduler->delay_microseconds(n_ms) in the arming code copter:: init_arm_motors() just before the function returns to see it increases the vertical errors, and it does. The longer the delay, the greater the vertical velocity and position errors produced on arming.
I think there is some new delay just after ekf reset that causes drift in vertical position and velocity. This did not happen when I used a 3.5.5 version of arducopter.
Also just for info, below I have attached an image of the signature I see on the first arming event of every power cycle
2019-05-06 12-30-26.log (370.2 KB)
Just following up with a log of an arming event that results in vertical pos and vel errors. It is just a single arming with no flight in case anyone wants to investigate
Sorry for the 3 posts in a row, but just wanted to confirm that the signature I posted in image form above is related to the start of dataflash logging. If I log while disarmed and run the same test, that signature still exists just as the log begins. I will attach the same arming only log gathered while log while disarmed is enabled. you can see that arming still exhibits a much smaller error in vertical velocity and position.
arming takes place at about 00:30 on the time axis. you can see that there is still a small error in vertical pos and vel but significantly smaller than when the log starts at arming. The signature seen here will also get smaller/non-existent if all telem/serial ports are disabled (except gps and usb ports). But you can see that the same 1.5 meter vertical pos error exists at the point where the log starts. It also seems to get larger as parameters are downloaded upon connecting to mission planner(between 58:00 and 59:00). Once again, this cannot be replicated in SITL. There is definitely some strangeness related to uart load and dataflash adversely affecting the vertical ekf estimates.
I have attempted to attach the log but it is too large to include in this post.
Have you tried the stable firmware without any modifications and got same result? From your log I can’t find the reasons. Your accel z and barometer seems to be fine, only the innovation tells you that EKF height is wrong. How about changing ek2_gps_type to 1, which does not use 3d gps velocity? If it’s not sensor issue, I suspect that the reason will be related to EKF fusion time problem caused by the delay I do not know.
Yes I have tried stable firmware version. It is present in the current stable release as well.
There seems to be some delay after ekf height reset and this in turn leads to a vertical velocity error. If I add additional delay in arming sequence, the error gets much larger. This problem was not present in the 3.5 firmware branch, since I had some short delays in the arming sequence on that branch.
Seems to definitely be related to dataflash logging since arming using log while disarmed feature leads to much smaller errors. Serial overhead also seems to affect the magnitude of the error.
Also if you download logs via mavlink on current stable release while running ekf2, the ekf goes bad very quickly. If running ekf3, the ekf goes into an unrecoverable state with vertical pos variance errors and compass variance issues. While running ekf3, the pixhawk must be rebooted after downloading logs as the ekf cannot recover from whatever errors are introduced during log download. This also never used to happen on the 3.5 arducopter branch.
Something strange is going on in 3.6 and beyond leading to really bizarre ekf behavior.
I may try the master branch and see if this same problem exists.
Also just fyi, it is possible to minimize these errors using certain ekf params. Decreasing ALT_M_NSE seems to reduce the error. Also if using a rangefinder, RFND_USE_HGT set to a nonzero value has a similar effect since the default RFND_M_NSE param is smaller than the default ALT_M_NSE param. Tinkering with these params can help mitigate the errors by decreasing the magnitude but does not eliminate them.
I guess that the dataflash logging is causing some loops to run very log which will upset the EKF. We report this in the dataflash log’s PM message.
When we start logging we dump a whole ton of information (like the parameter values and FMT messages) so this would explain why enabling logging while disarmed would reduce the problem.
I don’t know how to resolve the problem though, at the risk of passing the buck, @peterbarker is our logging maintainer.
Perhaps we could gather more information on the loop timing…
Yes I inspected the PM message, and loop time was incredibly long as logging starts. I think I saw values of about 60,000 on log start, when the average loop time was about 2,600. I am assuming this is a value in microseconds. That would mean loops are running about 25x longer than they should.
The thing is, that would indicate that something is blocking the loop, but i did a test where I introduced a delay using hal.scheduler->delay_microseconds(n_us) during the arming sequence. Increasing the delay increased the magnitude of the error. I was under the impression that scheduling a delay using the scheduler would not block the main loop. But I could be misinterpreting the results of that test or the behavior of that delay function.
I will try to gather more info tomorrow on loop timing and post it if I find anything useful.
@rmackay9 That’s good to know. I was not aware that hal scheduler delays were a blocking delay. So if I use that function to delay code, it blocks updates to the ekf and main loop execution. Is there a non-blocking form of delay? I was under the impression that the scheduler delay would just schedule the current thread to continue after x microseconds, yielding to other threads in the meantime.
The only reason I am a little confused is that in my custom copter-3.5, I used that delay in the arming sequence and disarming sequence to allow me to generate a specific digital signal on an aux servo port to arm a relay that required an unusual signal to throw. I never ran into any similar vertical ekf errors when using that method on previous branches.
I have found a temporary workaround for the vertical ekf errors, where after arming the vehicle, I wait 20 seconds before taking off to allow the ekf errors to settle. I plan to investigate a little more to try and find a more graceful and less time consuming fix.
I also need to find a way to reintegrate the relay arming code into 3.6 without triggering ekf issues. If you have any ideas on a possible workaround, any suggestions or advice would be awesome. Thanks!
Edit: If you are interested, this is a plot of the PM.MaxT and DSF.Bytes. We discussed some of this yesterday. This is what I see in my log. It appears that a much larger amount of data is being written on arming and max loop time suffers as a result.
@James_Megariotis, we don’t have a non-blocking delay function I’m afraid so instead you’ll likely need to write a function that gets called on each iteration of the main loop (or put it into the scheduler table) and then perhaps add a counter (or timer) within the function so that it can decide itself what it should do next.
No, I am still investigating the issue hoping to solve it, but no luck as of yet. This may be a problem that is out of my league. I am not familiar with what code has changed relating to arming and the start of dataflash logging, or more specifically log file creation. I know for sure it doesn’t happen in 3.5 versions of arducopter and seemed to appear only in 3.6 versions.
My best solution thus far is to increase disarm delay to 30 seconds. And wait 20 seconds after arming to send any sort of guided or takeoff commands. Since my use case requires velocity controlled takeoffs, this gives the EKF time to settle after the introduction of the initial error. Also EKF3 seems to exhibit a smaller altitude/z-velocity error than EKF2 by about 50%. Using rangefinder for a primary alt source at low altitudes also significantly reduces the error.
While the backends are supposed to be re-entrant, I believe this does
actually avoid some corruption issue we had.
It was rather easier to just let the main thread do the work and disallow
logging from threads than trying to cope with a thread trying to write a
log message while we’re in the middle of running a messagewriter.
We should be constrained in the amount of time we spend in here by the
size of the output buffers. Have you measured a problem?
It was rather easier to just let the main thread do the work and disallow
logging from threads than trying to cope with a thread trying to write a
log message while we’re in the middle of running a messagewriter.