UART jitter issues

Hello

I’m not doing this often, but this time I just can’t figure this issue out and would like to seek your input:

I do have a ArduPilot fork with special additions to operate with a STorM32 gimbal. A key element is that ArduPilot has to send a MAVLink message (AUTOPILOT_STATE_FOR_GIMBAL_DEVICE) to the gimbal with 25 Hz and with as little jitter as possible. Some few ms jitter are ok, but it shouldn’t be much worse and messages should not be skipped or other weird things happening.

My implementation works perfectly fine for copter, but NOT for plane! For plane I do see massive jitter and drop outs, and for the heck I cannot figure out why.

In copter, what the code is doing is to use the mount->fast_update() function, which is called in copter’s fast_loop(), and to count down a counter to decimate from its 400 Hz loop rate to 25 Hz. This gives perfectly fine results, as I can directly proof by measuring the incoming data as the STorM32 see them by using the STorM32’s own logging capability (more below), but also in other ways like spying the messages on the mavlink channel, or some dedicated dataflash logging messages which I have added to the ardupilot code (also more below).

In plane, the very same code is used, except that the mount->fast_update() function is hooked into plane’s ahrs_update() function (as plane doesn’t appear to have a fast_loop()). The code adapts to the different loop rate in plane of 50 Hz. I have tested however that the issue does not go away when changing to different loop rates (tested up to 400 Hz to match copter). I even made plane to also use the fast_loop() function to make it as close as possible to copter, but which the very same results.

The copter code I test on a MatekH743, and the plane code on a pixhawk1, but I have tested that I get the very same results when I run the plane code on the MatekH743, and I also briefly tried a AUAXV2. I have no indication whatsoever that the type of flight controller matters. I should say that I do not have a plane and hence do only bench tests for plane and am also substantially less knowledged with the plane code (copter I think I know reasonably well).

I have added logging messages to the ardupilot code, which stores some timing values, such as the time then the “do the message” code part is called, the time differences which give info on the jitter, and the time it takes from the point the “do the message” is entered until it has done it’s job, i.e. after the call of mavlink_msg_xxxx_send().

These ardupilot logs yield perfectly regular timing, for both the copter and plane implementations. Yet, while for copter the STorM32 receives the messages at perfectly regular times, it doesn’t at all for plane.

From the fact that all I change is the code in the flight controller I’m convinced that the issue is not with STorM32 but something related to plane. From the fact that in both copter and plane I use exactly the same code additions, except that the mount->fast_update() is hooked in differently, I am led to think that the issue is not directly in my code additions.

In fact for plane it looks as if sending out the message on the uart is in some way not occurring when the mavlink_msg_xxx_send() function is called but delayed in some irregular ways (it actually looks a bit like beating, as if this is done by some incommensurate frequency) - but this so only in plane but not copter! I frankly admit that this doesn’t make any sense to me, given that all this is in the library part, so it’s entirely speculation, but it would be an interpretation.

Any idea or hint towards what the origin of the issue could be would be very much appreciated !

(I’m working off my ass to figure this out since several weeks now, so I’m really desperate)

Many thx, Olli


Here some representative plots, to illustrate the issue.

1. STorM32 logs. It’s not important to know what exactly it shows, only that each spike means that the STorM32 has received and digested said message. So, look at the regularity of the spikes. (for each three zooms are shown)

1.a Copter

1.b Plane

The irregularity for plane should be obvious, especially when comparing to copter. It also seems as if there is some periodicity, the large gaps seem to reappear every ca 9.5 secs or so.

2. DataFlash logs. The plots show the time differences between each entry of the “do the message” part. The y axis is in secs.

2.a Copter

2.b Plane

For both copter and plane the time differences between each call is very close to 40 ms, corresponding to 25 Hz

3. DataFlash logs. The plots shows the time from entry of the “do the message” part to after the call of mavlink_msg_xxx_send(). The y axis is in secs, but note that the scaling is 1e-6, i.e., the values are microsecs.

3.a Copter

3.b Plane

For both copter and plane the “execution” times are very similar, and especially never take longer than few 100 us or so, so never reach multiple ms or more to account for the observed jitter in plane.

It might not be completely relevant, but can you link the specific branch you’re using?

ah, good idea, thx
I do not have the plane branch pushed, but the copter branch: https://github.com/olliw42/BetaPilot/tree/BetaCopter41. This is the development branch if you want to say so. It is not exactly the code used for the testing, since in the testing I kind of constantly change a field or flag here and there, but it is very very close and should accurately give the main idea. The main code addition is in this file https://github.com/olliw42/BetaPilot/blob/BetaCopter41/libraries/AP_Mount/BP_Mount_STorM32_MAVLink.cpp, and the fast_update() function is here https://github.com/olliw42/BetaPilot/blob/BetaCopter41/libraries/AP_Mount/BP_Mount_STorM32_MAVLink.cpp#L218-L282. You generally can find all my changes by searching for //OW (or some git diff vodoo of course)

did some more testing, and it can’t be clearer now IMHO

for plane, I do see clear jitter with a period of 1 sec and a period of ca 9.8 sec, with jitter also occurring at other times which I can’t classify that cleanly.

I believe that it is also now unambiguously clear that it is plane, and not storm, and that it must be something with that the sending out on the uart is postponed with respect to when it is pushed into the uart buffer.

I did one experiment which I think is as “perfect” as I can think off:

  • I do have some timing in the code which is logged in the dataflash log: I have especially added a flag to see if the have_space prevents a message being pushed to the uart buffer. From that dataflash log I can conclude that there is nothing trouble some happening up to this point. That is, the loop is called regularly at 25Hz, it is executed within dozens 10us, and the mavlink message is pushed into the buffer, i.e., no messages are lost.
  • I sniffed the traffic on the uart from the flight controller to the gimbal controller, which unambiguously shows that the traffic as it is coming from the flight controller has the jitter, and that STorM32 isn’t doing this. This also means that the STorM32 logs are representative of the issue.
  • I do have the STorM32 log. Here too I can see, by simply counting the expected number of spikes per time, that no messages are lost.

So, messages are not lost, they are pushed at time into the uart buffer, but they are not always sent out at time. With quite big delays of above 100 ms.

The 1 sec period jitter is very simply explained by the fact that arduplane pushes out a bunsch of mavlink messages with no time gap every 1 sec. Which makes it that one message is send delayed, with ca 20 ms. You can see that in the pictures below.

I do not have any explanation for the ca 9.8sec period jitter, and the other occurrences of jitter. They are not related to some additional traffic.

The exact code used here is in https://github.com/olliw42/BetaPilot/tree/betaplane41-apstormlinktest

Three more pics please:

this pic shows the ca 9.8sec period jitter, where quite big “gaps” open, as the zoom into one of them shows. This big gap leads then to a bunching of the messages.

this pic shows the 1 sec period jitter. here only one message is delayed. It also shows one of the gaps of the ca 9,.8sec period jitter.

this pic shows the sniff on the uart line from the flight controller to the storm32. The first column is the time, the second the time difference between the AUTOPILOT_STATE_FOR_GIMBAL message, which should be 0.040secs. On the left you can see the traffic at a 1sec period jitter. It’s clearly produced by arduplanes behavior to push quite some additional messages in a short sequence . On the right you can see the occurrence of one of the ca 9.8sec period jitters. Massive delay of 150ms and according bunching of subsequent messages.
Here it is followed by a PARAM_VALUE message. It looks coincidence to me, in all other places in the sniff it has no effect or is “surrounded” jitter of 0.001sec. Also, in arduplane it is send with 31 secs, so a relation to a ca 9.8 period isn’t obvious.

All this does not happen in copter.
Arduplanes behavior to send out a whole bunch of messages at the very same time is IMHO not good streaming behavior and should be clearly improved, but that’s another cup of tea. So, the question is what is this 9.8 period jitter, and the other jitter.

150 us * 2^16 = 9.83 sec
no idea what this means though

things get clearer and clearer. Did the same experiment but now with all SR parameters set to 0, in order to avoid the influence of the stream bunch. And the jitter pattern is now extremely clean and repetitive and reproducible

it allowed me to also obtain a better estimate of periodicity, which is 9.750(2)

the jitters occur perfectly periodic with said periodicity of 9.75 secs, see pic1

in each period exactly 4 jitters occur, there the one with the largest gap of well above 100 ms I denoted as main, and the other three shorter ones as A,B,C, see pic2, for each main it really looks exactly like in this pic

in the sniff of the traffic I see only the messages expected, i.e. the AUTOPILOT and the RC_CHANNELS messages the code is supposed to send, and a HEARTBEAT, TIMESYNC and PARAM_VALUE every then and now. There is no distortions due to the latter three. As said, it’s all very clean now.

I guess that’s about all I can achieve in order to characterize the issue. Hopefully that last info will give anyone an idea what its origin could be.

I’d like to try to track this down, but it is tricky to disentangle it from your setup. I loaded the betaplane41-apstormlinktest branch on a CubeBlack with default parameters, then setup SERIAL5 as MAVLink2 and MNT_TYPE=4 for Storm32 gimbal. I don’t see any AUTOPILOT_STATE_FOR_GIMBAL_DEVICE messages.
Looking at the code, it won’t send anything till it finds a gimbal via find_by_mavtype(). I don’t have a gimbal, so it doesn’t initialise.
I tried hacking the code to fake up find_by_mavtype() but still don’t see the messages.
Even if I got it to send the message, it’s not in the XML message set in mavproxy, so I wouldn’t see it for debugging.
I could keep hacking it, but it’s getting a fairly long way from your test case so the value is unclear as far as finding bugs in ArduPilot (as opposed to something in your branch).
So instead I setup a much simpler test case in this branch:


that sends the WATER_DEPTH message at 25Hz from plane Plane::ahrs_update(). The idea is that this mimics what you are doing with the gimbal state send at 25Hz without all the extra complications.
I loaded this on a CubeBlack, with these parameters:
http://uav.tridgell.net/tmp/mav-delay.parm
I used SERIAL5 (the cons port), and used a logic analyser to watch for the longest period with the uart in the idle state over 2 minutes:
image
the longest idle state was 33ms (Hmax in above).
Can you see if you can reproduce the issue starting with the above test branch using WATER_DEPTH?
Note that the issue may depend on what UART you are using. If you can reproduce then make sure you mention which uart is being used.
Cheers, Tridge

first off, MANY thx for looking at it and having spend so much effort already on it to track it down.
It is certainly true that it is very difficult for essentially all of you to create a testable test environment; my hope was that if someone more knowledged than me would hear the issue would - kind of immediately - have an idea what this is.

I will certainly try to get along with the test case you’ve created. It might take me some time however, since I obviously can’t use the STorM32 logs anymore as quick and nice tool to see the issue but need to create some other tool(s).

so that I can follow your thoughts, what do you conclude from this 33ms longest idle state?

many thx so far

BTW, the issue came up because rcgroups user shuricus some few months back tested the plane code with STorM32Link and observed weird twitching of the gimbal. He then provided STorM32Logs which showed this massive jitter. In my recollection he was the first trying the plane code.

I followed up on your comment that the issue may depend on the UART I am using, since this was quickly and easily to test by just replugging the connector … and I can’t believe it … di we hit the one unlucky point?

I changed the serial connected to the STorM32 from SERIAL2 (which was what I was using all the time in this test setup) to SERIAL1 to SERIAL3 to SERIAL4 and back to SERIAL2

I do see the jitter issues ONLY when I am using SERAIL2 !!!

for SERIAL1, SERIAL3, SERIAL4 it looks perfectly fine !!! (there is only very tiny jitter every once in a while) The spike pattern in the STorM32 logs look perfectly nice

On the copter setups where I don’t see issues I use SERIAL4 in the test bench setup and on the flying thing I am using SERIAL5.

there would be now many different settings one could test, i.e., if setting unused serials to -1 makes a difference, or if it makes a difference that some serials are related to gps, and so on … I obviously also should do tests with copter with different serials … but I feel I first like to hear your reaction as it would be somewhat like wading in mud waters.

Could you pl explain why you were expecting that this might depend on the UART? (DMA?)

many thanks, this might have been a most insightful little comment


here the screen shots of the SERIAL settings

btw, concerning AUTOPILOT_STATE_FOR_GIMBAL_DEVICE, it is one of the messages which was recently merged to ardupilot/mavlink (https://github.com/ArduPilot/mavlink/commit/9d28a9f784132b50614f08297d6b1094855067e7), so I guess you just would have to update your mavproxy to get it

DMA contention is definitely a possibility. You can try disabling DMA on that UART using SERIALn_OPTIONS
Still, even if it is DMA we want to find the cause. Such a long delay isn’t expected. Maybe we’re losing a DMA completion interrupt. So I’d still be interested in trying to reproduce.

33ms is a reasonable max, as we’re sending at 25Hz, so 40ms period, and 7ms is reasonable for the packet length at 57600, so 33ms as max idle time is in the right ballpark.

apart from DMA, the other likely reason for SERIAL2 having a problem is a false detection of RTSCTS. If your CTS pin is floating and you have BRD_SER2_RTSCTS=2 (ie. auto detection) then it could be stalling due to flow control.
Set BRD_SER2_RTSCTS=0 to eliminate this possibility

many thx for the explanations, I wanted to ensure that I understand correctly what you are saying

apart from DMA, the other likely reason for SERIAL2 having a problem is a false detection of RTSCTS

tested, did produce exactly the same issue, so not the cause

(btw this board has RTSCTS also for SERIAL1, which was also enabled, but SERIAL1 was fine, not that it proves anything, just wanted to mention)

(I would have felt terribly stupid if it would have been this LOL)

You can try disabling DMA on that UART using SERIALn_OPTIONS

tested, and HERE WE GO!!!

I kept BRD_SER2_RTSCTS = 2, but set SERIAL2_OPTIONS = 768, i.e. enabled Rx_NoDMA and Tx_NoDMA => NO ISSUE !!!

I then disabled again Tx_NoDMA, i.e., SERIAL2_OPTIONS = 512 => NO ISSUE !!!

I then enabled Tx_NoDMA and disabled Rx_NoDMA; i.e. SERIAL2_OPTIONS = 256 => JITTER JITTER JITTER

so, it’s the Tx DMA which is causing the issue in my setup

(I feel a bit stupid that I didn’t consider this myself, but on the other hand I probably would not have bothered with reporting if I would have found it myself, just trying to find an excuse for having been stupid LOL)

Maybe we’re losing a DMA completion interrupt

I would think that the extremely regular timing pattern should give some important clue as regards the cause. But you are infinitely more experienced in tracking down the DMA issues.

Still, even if it is DMA we want to find the cause. Such a long delay isn’t expected. Maybe we’re losing a DMA completion interrupt. So I’d still be interested in trying to reproduce.

I understand that, and also that it is most important to create a test bed which allows you guys to reproduce it. I will not jump out without having helped creating it :slight_smile:

The potentially sad thing is that it might be very board specific.

CubeBlack and pixhawk1 should be sufficiently similar however (if I recollect correctly they are exactly identical pin wise, aren’t they). This makes me mentioning that I am using configure --board fmuv3 to compile…

totally unrelated but I have to drop these comments:

  1. I always found it somewhat unfortunate that the RTSCTS parameters and SERIAL parameters, or generally potentially important and related parameters, are very differently located. It’s no excuse for that I forgot to consider RTSCTS, but this incident makes me wonder if there could be some path to deprecate the BRD_RCTCTS params and substitute them by some SERIALx_RTSCTS. E.g. the new SERIALx_RTSCTS could be set to -1 to indicate that the old BRD_RTSCTS should be used, or vice versa, or something along that lines.

  2. I am generally somewhat unhappy with how ArduPilot sometimes handles inputs. I strongly think that an input pin should be always configured such that issues are least likely, within what’s possible by the mcu hardware (this goes back to experiences on day one when I started using STM32 chips)(and also a pixracer setup I once had)
    here e.g. the pullup should IMHO be enabled on the Rx and CTS
    your comment that it could be RTSCTS reminded me on this, and I also must admit that I was surprised to see that in the SERIAL_OPTIONs the Rx-PullUp are NOT enabled when the SERIAL is enabled. I will frankly say that this is not what I ever would do or would consider best practice. Just a wire needs to break and storms can happen.
    I understand that for Rx this is obviously possible to do by the user, but what I mean is by default, so the user has less chances to screw up (and for other pins it’s not possible even if the user would think of it)

just want to have said it to make me feel better, pl don’t hesitate to ignore this post :slight_smile:

rcgroups user shuricus, who is the original discoverer of the jitter issue, just reported that he uses SERIAL4 (=uart4) on a Omnibus F4V3 AIO (not PRO) board

so, here more food, seems to be clear now

I did the testing with these settings for the SERIAls

SERIAL2_PROTOCOL = 2 always, SERIAL2_OPTIONS = 0
both RTSCTS = 2
SERIALx_OPTIONS = 0 unless otherwise stated

SERIAL1_PROTOCOL = 2
SERIAL3_PROTOCOL = 5
SERIAL4_PROTOCOL = 5
SERIAL5_PROTOCOL = -1
=> ISSUE (that’s the original case)

SERIAL1_PROTOCOL = -1
SERIAL3_PROTOCOL = -1
SERIAL4_PROTOCOL = -1
SERIAL5_PROTOCOL = -1
=> no issue (that’s the case in the above which indicated a DMA issue)

SERIAL1_PROTOCOL = 2
SERIAL3_PROTOCOL = -1
SERIAL4_PROTOCOL = -1
SERIAL5_PROTOCOL = -1
=> no issue

SERIAL1_PROTOCOL = -1
SERIAL3_PROTOCOL = 5
SERIAL4_PROTOCOL = -1
SERIAL5_PROTOCOL = -1
=> ISSUE

SERIAL1_PROTOCOL = -1
SERIAL3_PROTOCOL = -1
SERIAL4_PROTOCOL = 5
SERIAL5_PROTOCOL = -1
=> no issue

SERIAL1_PROTOCOL = -1
SERIAL3_PROTOCOL = 5 with RxPullup (SERIAL3_OPTIONS = 32)
SERIAL4_PROTOCOL = -1
SERIAL5_PROTOCOL = -1
=> ISSUE

SERIAL1_PROTOCOL = -1
SERIAL3_PROTOCOL = 5 with Rx&Tx no DMA (SERIAL3_OPTIONS = 768)
SERIAL4_PROTOCOL = -1
SERIAL5_PROTOCOL = -1
=> no issue

hence, proposal:
There is a DMA conflict between SREIAL2 and SERIAL3, and the jitter time pattern reflects what the GPS driver does in order to detect/configure a GPS

so here comes my attempt at a usable test version: https://github.com/olliw42/BetaPilot/tree/betaplane41-apstormlinktest2-stripdown
it probably will not fully satisfy you guys, but that’s the best I could come up for the moment (vide infra)

starting from the original I successively stripped down pieces, until I started to see differences in behavior. So, there are now three cases which can be enabled by two defines

original: sends AUTOPILOT_STATE_FOR_GIMBAL_DEVICE with the message fields suitably populated.

This produces the jitter behavior as reported so far.

option 1: sends AUTOPILOT_STATE_FOR_GIMBAL_DEVICE, but with the fields set to constants, i.e. the calculations to fill the fields suitably are not done.

This produces also marked jitter with the same periodicity as before, but the pattern appears to look somewhat differently. So, there seems to be more interactions going on than just with e.g. the GPS. That’s a totally new playing field however, so I left it in, but feel I shouldn’t spend much time on investigating the exact details.

option 2: sends WINCH_STATUS with fixed fields

This also produces marked jitter with the same periodicity as before, but the pattern yet again appears to look somewhat differently. Again I feel I shouldn’t spend much time on investigating the exact details.
Btw, I chose WINCH_STATUS since it happened that I didn’t had WATER_DEPTH in my py toolbox, and thought this should be good for all of us.

This test version has the drawback that it doesn’t send for until the serial port has seen at least one incoming heartbeat. That’s obviously due to the router, but I didn’t know how to get around it (one probably just needs to make the router object public, not private, and fake it, but…). So, not sure that’s really usable for you guys. But I think it should be very close, and I feel like that’s close to the best I possibly could contribute.

I’d like to really thank you so far for your comments, which steered me into what I in hindsight think I should have considered myself. I spend so much time to convince myself that it is not just my code and to explore what else in the code it could be, but I totally forgot about looking at these options provided through parameters. So one even could legitimately call it user error …

@olliw42 I think this PR will fix the issue for you:

yes, moving to SERIALn_RTSCTS does make sense. It was difficult when those parameters were first added due to a structural problem with how the existence of RTSCTS was known per board, but that has been fixed.

well, “issues least likely” is hard to define.

we do by default enable pullup on both RX and CTS pins, which is one of the reasons the auto-RTSCTS detection does actually work well. I suggested it as a possible cause not because I thought it likely, but just to eliminate it.
See this code: