ArduPilot:master
← tridge:pr-fix-can-deadlock
opened 11:00AM - 24 Sep 24 UTC
This fixes a set of issues found on investigating a watchdog from @lthall
PR… depends on: https://github.com/dronecan/libcanard/pull/72
The situation was:
- a H7 with mavlink over serial over DroneCAN setup, MissionPlanner connected to the serial port
- SERIAL1_PROTOCOL set to GPS, with GPS1_TYPE=AUTO, but no GPS plugged in
There were several interacting bugs:
- the canard layer could run out of buffer in its allocator part way through a message send, resulting in sending just some of the frames onto the bus, which results in a corrupt message, wasted bandwidth and broadcast() returning false
- MissionPlanner sends CAN_FORWARD requests on all connections, regardless of type, which on a CAN serial port means it is requesting to send it's own packets over CAN, resulting in an "infinite" number of packets and complete bus saturation
- MissionPlanner issue here: https://github.com/ArduPilot/MissionPlanner/issues/3417
- once the bus was saturated, serial.update() in the DroneCAN thread would not make any progress, which meant it would keep trying to send the same bytes forever, but no full messages could ever be sent, so from the users point of view the CAN bus is dead (the UI shows no messages, as all messages are corrupt)
- the lack of a sleep in the DroneCAN thread meant that all threads of lower priority would stop running, including UART threads, the main thread keeps running
- the main thread then tries a begin() in AP_GPS for a new baudrate, the UART thread is still stuck in TX from the last set of GPS probe bytes and can't make progress due to the higher priority DroneCAN thread running non-stop
- the main thread then gets stuck waiting on the _in_tx_timer boolean lock, causing the main thread to lockup
- this triggers a watchdog