Watchdog reset in flight and Ardupilot does not complete startup and hangs

Hi,
Hardware: Matek H743 wing, all external hardware unplugged, powered via USB.
Software: ArduPlane 4.0.6
Symptoms: The bord does not complete startup. It starts as usual, red LED on, blue flashing. Then red on and green flashing with startup melody. just before the melody ends (one of the last tones) the FC hangs up: green LED stays on or off (randomly) constantly and the buzzer stays on with a constant tone or is silent without playing the last tones of the startup music. USB serial is available on windows but missionplanner throws an error upon connection.

Problem:
A few weeks ago, the symptoms became apparent for the first time. I installed the 4.0.7 beta plane via missionplanner, no change. I installed the copter firmware (resetting the parameters) and the board booted again. I changed back to plane 406, set all parameters as disired and everything worked at that point. Since then I flew a couple of times (around 3 flight hours and >10 hours runtime on the ground for setting up parameters, experimenting with mavftp, scripts etc.). Today I had a moment in manual flight when the plane was uncontrollable (a watchdog reset I presume, as the GCS showed some messages conaining something like that)
After landing and some fiddling with the GCS, I powercycled the plane (maybe a few times, I do not remember exactly) and had the same symtoms as a few weeks ago. I do not feel comfortable to just reset the FC with copter firmware again and just keep flying. Maybe someone can help me finding the problem?

(At 90.2% of the telemetry log the watchdog event happend, for the bin logs, it was right at the end of 75/begin of 76)
https://drive.google.com/file/d/1rbLchmfjQhBBdPWLA52jtJzt86Xd_k4E/view?usp=sharing

I’m not sure but I think I may have had the same issue as you using an h743 on a custom board, I thought it might have been a solitary issue with my hardware but if it’s happening again then it may well be an issue.
On my hardware I would fly/setup a few times and then the fc would lock up, couldn’t even get it to boot, re-flash the fw and then it would work for a while then the same thing would happen, does that sound like the same issue you are having?

That sounds exactly like the problem I was having (just without the WDT reset, which might have nothing to do with this anyway). I had it lockup twice now. I have been playing around with it for close to a month now, with the two lockups around 2 weeks apart. Lets see, if it continues to do that next week :stuck_out_tongue:

I think I have the same problem with Matek H743WING.
Rover 4.1.0-dev.
Sometimes hangs at boot. Could only bring it back to work by flashing the HEX with Betaflight configurator in DFU mode.

Also having problems with ICM20602. AccY gets an offset of -400 after a while (about 10-30 min). After “Preflight-Reboot” AccY is normal again for a while without power off/on.

that is this fault:
WDOG {TimeUS : 144958330, Tsk : -3, IE : 0, IEC : 0, IEL : 0, MvMsg : 0, MvCmd : 0, SmLn : 0, FL : 122, FT : 3, FA : 136061728, FP : 59, ICSR : 4196355, LR : 135415613, TN : stor}

it means it is a fault in the storage code (the code that saves parameters and waypoints). It seems to happen immediately after a parameter set of TECS_LAND_THR. So I suspect a bug in the flash storage backend used on the MatekH743.

I have not seen this issue and have no idea it’s related, but when I compile the matekh743 I see this warning related to SDMMC, so might be related, or not
arducopter-matekh743-warnings :

that warning is harmless. I put it in to remind me to work out how to setup the clock in SDMMC for H7 to produce a faster SD card transfer rate. I really should go back and do that.
The watchdog reported above is for the storage thread, which is for parameters/mission, not the IO thread, which deals with the sdcard

I’ve ordered a production MatekH743 for testing. I only have a pre-production board which doesn’t have the same baro setup (so can’t run our releases).
Meanwhile, I modified the hwdef.dat to allow building for my pre-production board and I’ve set it up to continously change parameters in the hope i can reproduce the issue

  • It seems to happen immediately after a parameter set of TECS_LAND_THR

What do you mean by that? I can ensure, that the gcs was not touched inflight and defintly no parameters or mission items were changed.

ahh, sorry, I got the timestamps wrong. The set of TECS_LAND_THR was after the watchdog (at 1:44:18 in my timezone). The watchdog was at 1:41:43.

Yup, same situation today. we are at number 3…

I’ve ordered a production MatekH743 for testing. I only have a pre-production board which doesn’t have the same baro setup (so can’t run our releases).

I was wondering why the hwdef doesn’t mention the DPS310 baro (which I believe is what the board is equipped with), and no backend code exists for this baro?

it uses the DPS280 driver, which works the same as the DPS310

I’ve now received my production H743-Wing, and I’ve set it in a loop loading missions to try to reproduce the issue.
Meanwhile, I am still interested in getting more logs showing this issue. I have one so far that points at the storage code, but more logs may help me narrow it down if I can’t reproduce it myself.

Which issue are you talking about exactly? The WDT reboot or the FC hanging on startup?

I am interested in both. There are a few things we need to determine:

  • are these issues only on specific boards, or on all boards (ie. why can’t I reproduce?)
  • is it related to peripherals connected? Or related to history of writes to flash storage?

As you seem to be able to reproduce, can you try to strip back to the smallest set of attached cables that can reproduce the issue. Ideally just USB, nothing else connected. The idea is to make it easier for me to try and reproduce.
Note that the two issues may be one issue. For example, if it is getting a hard fault and it sometimes happens during startup and sometimes after startup then it would explain both issues. We don’t enable the watchdog timer until after startup completes, so a fault during startup will cause the startup symptoms you describe.
If we can’t find a way to transfer the issue to me to I can reproduce it then I may need to ask you to send the board to either myself or @sampson to look at. In that case I’d be happy to pay for a new board for you.
If you can find a way to reproduce then I’d also like to know if it happens on our latest builds from ArduPilot firmware : /Plane/latest/MatekH743

@tridge when it happened on my board and the last time it did it just fell for about 3 meters, I solved the issue by saving params to sdcard. it’s been fine now since the summer, no issues. I’ll try and find the logs.
Don’t know if it’s any help but thought i would let you know. It’s a custom h743 board.

As the problem occurs randomly, I can not reproduce it whenever I want. The only thing I can say for shure is, that after it hangs up, it does not boot unless I reset it using another firmware (copter instead of plane)

I can try the following:

  • Flash the latest dev firmware (it also occured with 4.1.0dev from mid november, something like the 14th)
  • Enable logging during disarmed state
  • Unplug everything but USB and SD and let it run for a few days (occasionally replugging and rebooting, hoping that the issue is captuerd on dataflash and telemetry logs.

Anything to add/to do differently? as the isse happend about 2 weeks apart each time, it does not do so very frequently…

So, reproduced once again:

  • Latest Firmware, disarmed logging enabled, no devices connected (only sd and USB)
  • Running for about a day (Log file 2.4GB)
  • no abnormal msgs on missionplanners messages tab- USB unplugged
  • USB replugged, no connection possible (mission planner error)
  • USB replugged agian, still no connection possible. Melody and Green LED hanging.
  • connected OLED, USB replugged, OLED reads INIT in the top line forever

Logfiles:
https://168.119.233.154/nextcloud/index.php/s/3ToQbDGC2f5mbyK
https://168.119.233.154/nextcloud/index.php/s/K64PGffnPq69MED
https://168.119.233.154/nextcloud/index.php/s/aXaBeeH8stE7BDN

Similar problem with Matek H743-WING

  • changed telemetry hardware form SIK to ESP8266
  • forgot to change the baudrate for SERIAL3
  • connected USB, while rover still powerd by battery
  • changed telemetry baudrate im Mission Planner, saved
  • made a preflight reboot, rover still powered by battery and USB
  • rover is now in watchdog reboot loop
  • disconnect battery and USB power
  • rover hangs on boot
  • can’t flash rover over mission planner. Had to flash the hex+bl with betaflight configurator

But I’m not sure, if the board has a problem.
ICM20602 AccelY gets an offset of about -400 after a while, but movement seems to be detected.I deactivated IMU0 for now. The remaining MPU6000 is good enough for a rover.