Frsky passthrough telemetry WFQ packet scheduler proposal

yaapu · July 2, 2019, 2:25pm

Hi,
I would like to propose a new packet scheduler for the frsky passthrough telemetry.
I will start describing the current scheduler and why proposing a new one.
I will briefly analyze the packet scheduler of the mavlinkToPassthrough project.

Since passthrough telemetry is decoded with LUA scrips it’s important to note that the screen on OpenTX radios is updated at max 20Hz, we’ll see below why this is important.

There are 2 versions of the current scheduler, the original one and the revised one.
All stable releases run the original scheduler, master as of now implements the “revised” version.
The revised one includes a patch I’ve submitted to prevent messages from stalling all telemetry except attitude.

Quick and simple intro on how frsky s.port telemetry works.

The receiver sends on the bus a polling signal every 12ms asking for data from a specific sensor ID.
The receiver will poll the IDs in sequence to find which one is present. If only one physical ID is found, the receiver will alternate the sensor polling and the search (present sensor, next ID to search, present sensor, next ID and so on).
If more sensors are found, the poll sequence returns almost to a normal search pattern.
There are a total of 28 sensors.

Ardupilot responds as sensor 0x1B.
If ardupilot is the only “sensor” on the bus (no other frsky physical sensors) than it will get a chance to send telemetry every 24ms i.e. every 2 polling (slots)
If there are other frsky sensors on the bus, let’s say 2 more sensors, than ardupilot will have a chance to send telemetry (free slot) every 48ms.
A polling sequence would be:

0x1B,S1,S2,AA,0x1B,S1,S2,BB,0x1B,S1,S2,CC

0x1B is polled every 4 slots i.e.48ms.

If ardupilot could use more sensors by responding to polling for 0x1B and let’s say 0x0D available slots would increase and bandwidth as well

please note that:

as the number of external frsky sensors increase, the frequency of free slots and bandwidth decrease while the period between free slots increases.
by responding to polls for sensors 0x1B and 0x0D bandwidth increases

The passthrough telemetry sends 9 packet types, for each packet except for GPS it uses 1 slot.

the packet types are:

0x800 GPS coordinates (1 packet for Lat and 1 packet for Lon)
0x5000 statustext messages (1 packet = 1 chunk, messages are split in chunks of 4 bytes)
0x5001 AP status
0x5002 gps status
0x5003 battery 1
0x5004 home status
0x5005 vel and yaw
0x5006 attitude and range
0x5007 parameters
0x5008 battery 2

Packet scheduler in current stable releases

This is a simplified explanation of the logic behind the current scheduler:
50% of all available bandwidth is used for attitude packets
of the remaining bandwidth:
if there are messages send them otherwise share what’s left between all other telemetry packets.
if there’s nothing to send send attitude.
Note: messages are sent 3 times each

Packet troughputs with no external frsky sensors,1 battery, no pending messages to send:

0x800 1 packet/s
0x5000, 0 packets/s
0x5001, 1.5 packets/s
0x5002, 1 packet/s
0x5003, 1 packet/s
0x5004, 2 packets/s
0x5005, 2 packets/s
0x5006, 29.5 packets/s
0x5007, 1 packet/s

Packet troughputs with no external frsky sensors,1 battery, with pending messages to send:

0x800 1 packet/s
0x5000, 20 packets/s
0x5001, 0 packets/s
0x5002, 0 packet/s
0x5003, 0 packet/s
0x5004, 0 packets/s
0x5005, 0 packets/s
0x5006, 20.5 packets/s
0x5007, 1 packet/s

Notes:

the scheduler saturates all available bandwidth
if there are pending messages no other packets but parameters get a chance to be sent (stall)
bandwidth used for attitude is very high and all packets in excess of 20Hz are wasted, so with no pending messages 25% of the bandwith is wasted.
GPS requires 2 packets and there’s no guarantee they will be sent in a tight sequence, this leads to gps coordinates being sapled at different intervals
no built in fairness, packets with high frequencies can stall packets with lower frequencies

Packet scheduler in current master

This version of the scheduler has a patch to correct the biggest issue in the stable scheduler: telemetry stall while sending messages.

This is a simplified explanation of the logic behind the scheduler in master:
50% of all available bandwidth is used for messages only if there are pending messages.
of the remaining bandwidth:
50% is used for attitude packets
share what’s left between all other telemetry packets.
if there’s nothing to send send attitude.
Note: messages are sent 3 times each

Packet troughputs with no external frsky sensors,1 battery, no pending messages to send:

0x800 1 packet/s
0x5000, 0 packets/s
0x5001, 1.5 packets/s
0x5002, 1 packet/s
0x5003, 1 packet/s
0x5004, 2 packets/s
0x5005, 2 packets/s
0x5006, 29.5 packets/s
0x5007, 1 packet/s

Packet troughputs with no external frsky sensors,1 battery, with pending messages to send:

0x800 1 packet/s
0x5000, 20.5 packets/s
0x5001, 2 packets/s
0x5002, 1 packet/s
0x5003, 1 packet/s
0x5004, 1.5 packets/s
0x5005, 1.5 packets/s
0x5006, 11 packets/s
0x5007, 1 packet/s

Notes:

the scheduler saturates all available bandwidth
pending messages no longer stall other telemetry types
bandwidth used for attitude is very high and all packets in excess of 20Hz are wasted, so with no pending messages 25% of the bandwith is still wasted.
with pending messages bandwidth for attitude is around 10Hz
GPS requires 2 packets and there’s no guarantee they will be sent in a tight sequence, this leads to gps coordinates being sapled at different intervals
no built in fairness, packets with high frequencies can stall packets with lower frequencies

The teensy round robin scheduler

Thanks to the MavlinkToPassthroug project I had access to another well tested packet scheduler for passthrough telemetry.
To be able to compare Eric’s scheduler with the revised ardupilot one I implemented it in the current master branch.
This scheduler is round robin, each packet type is always given a chance to be sent in the same order. You can have slots that are not used but when used relative packet ordering is always guaranteed.
I slightly modified it to maximize message and attitude bandwidth by letting messages donate their slots to attitude when the message queue is empty.

Packet troughputs with no external frsky sensors,1 battery, no pending messages to send:

0x800 3 packets/s
0x5000 0 packets/s
0x5001 2 packets/s
0x5002 2 packets/s
0x5003 1 packet/s
0x5004 2 packet/s
0x5005 3 packet/s
0x5006 15 packet/s
0x5007 0.5 packet/s

Packet troughputs with no external frsky sensors,1 battery, with pending messages to send:

0x800 3 packets/s
0x5000 8 packets/s
0x5001 2 packets/s
0x5002 2 packets/s
0x5003 1 packet/s
0x5004 2 packet/s
0x5005 3 packet/s
0x5006 8 packet/s
0x5007 0.5 packet/s

Notes:

does not saturate bandwidth (uses roughly 66%)
very fair scheduler
messages and attitude do not interfere with other packet rates
inherent limit on max bandwidth for the round robin nature imposes to cycle all packet types before returning to the same one (I could not force it to have more than 15Hz for attitude for instance)
much better GPS coordinates sampling
simple and predictable

A note about messages

The current scheduler sends each message 3 times, in the idea of the original author sending each message 3 times would ensure transmission in case of bad link or low rssi.
I guess that this requirement imposed to dedicate a huge bandwidth to messages and model the original scheduler the way it is as a consequence.
But if we have a bad link and are experiencing packet loss why saturate what’s left of it with messages?
Wouldn’t this only make it worst?
With this in mind I propose to relax this constraint and send messages twice as an acceptable compromise.

A new scheduler proposal

What to expect from a good scheduler?

20 packets/s for attitude but not any higher
good data rate for messages without stalling other packet types
3 packets/s for GPS and vspeed for smooth vario performance
2 packets/s for AP,Home and GPS status
message chunks sent 2 times
user selectable rates or adaptive rates

My proposed scheduler tracks how long each packet has been waiting to be sent after the scheduled time and always sends the packet that waited the most.
If two packets are equally late send the one with the lower frequency (smaller chances to be sent).
Message slots can be donated to attitude
Battery 2 slots can be donated to battery 1

I implemented in ardupilot and this is what I got

Packet troughputs with no external frsky sensors,1 battery, no pending messages to send:

0x800 (GPS) 3 packets/s
0x5000, 0 packets/s
0x5001, 2 packets/s
0x5002, 2 packet/s
0x5003, 1 packet/s
0x5004, 2 packets/s
0x5005, 3 packets/s
0x5006, 20 packets/s
0x5007, 0.5 packet/s

Packet troughputs with no external frsky sensors,1 battery, with pending messages to send:

0x800 (GPS) 3 packets/s
0x5000, 10 packets/s
0x5001, 2 packets/s
0x5002, 2 packet/s
0x5003, 1 packet/s
0x5004, 2 packets/s
0x5005, 3 packets/s
0x5006, 10 packets/s
0x5007, 0.5 packet/s

I bench and flight tested it
In order to compare the results I added a parameter to ardupilot to switch from one scheduler to the others at run time.
The numbers above have been obtained by using a lua script that measured the actual packets received by a modified version of my telemetry script.

feedback is welcome.

Samuel_Tabor · July 2, 2019, 4:50pm

This sounds like a big improvement. My link is constantly saturated with messages especially in AUTO mode.

yaapu · July 2, 2019, 5:36pm

Yes, that’s the original reason I proposed a patch for the current scheduler: during autotune all telemetry would stall and no battery info until the Autotune:Success message

yaapu · July 4, 2019, 7:24am

Triggered by the issue reported by Colin (@yak-54) I did some investigation on how the schedulers perform in low bandwidth.
I simulated a setup where 1 to 4 extra frsky sensors are connected to the s.port bus and measured the scheduler performance.

This table summarizes the results

notes

S1 is the default scheduler
S2 is my new experimental scheduler
columns marked with an M at the top refer to bandwidth while sending messages
cells in red show packets that the scheduler is dropping (packets that are not sent)

So this pretty much confirms what Colin is experiencing, with 4 extra frsky sensors the S1 scheduler drops a lot of packets, in particular it’s dropping 0x800 GPS frames.

The S2 scheduler never drops a packet and scales linearly but with such a low bandwidth it probably needs a non linear behaviour.

yak-54 · July 4, 2019, 7:55am

Hi Alex
I will test soon i just walked in the door from a full day of flying at the field ( the fun of being retired )

yaapu · July 4, 2019, 7:56am

I could hate you for this

yak-54 · July 4, 2019, 11:11am

well the gps is alive but the currret is not
3-flvs-gas-suite-passthrough-new-experimental-scheduler.zip (31.6 KB)

yaapu · July 5, 2019, 8:33am

Happy to report that the new scheduler solved Colin’s (@yak-54) issue and all ardupilot telemetry is being received by the yaapu widget on the X12s.

The setup on his plane uses 3 FLVSS + GAS Suite so the available bandwidth is limited, you get a max trasmission rate of only 14Hz

which in turn yield a probably too low attitude (0x5006) and messages (0x5000) rates

Time to tune the scheduler for low bandwidth scenarios

Eric_Stockenstrom · July 5, 2019, 9:38am

Great work Alex! I’m tracking your research and will implement in the Teensy.

Eric_Stockenstrom · July 5, 2019, 9:47am

Hi Eric,
sure, here you get all 3 implementations, my proposal is the “fair scheduler”, I’ve been flying it today on Copter 3.6.9 and I like it.
I’ll share a fork for anybody interested in trying it.

Yes please

yaapu · July 5, 2019, 9:50am

Hi Eric,
the code is here, it’s based on plane 3.9.8

yaapu · July 10, 2019, 9:07am

I changed a bit the implementation of my new scheduler into a proper WFQ scheduler, now all bandwidth is used and packets rates are based on their weight rather than their scheduled delay.
This has the nice “side effect” that by dynamically changing weights I can tune relative packet bandwidth.

It’s very handy to get rid of pending messages fast without stalling all other messages, I simply increase packet 0x5000 priortity and all other packets rates adapt proportionally to their weight to this new condition in real time.

Schedulers based on delay cannot adapt in the same way because to guarantee fairness slower packets always limit faster packets, if for instance we dinamically lower the 0x5000 scheduled delay we eventually hit a point where all other packets steal bandwidth from us.

This is the reason a switched in my scheduler to a “normalized delay” approach that led to a WFQ style algorithm.

lets check the numbers.
This tables show how the WFQ scheduler performs

With no pending messages the scheduler uses a weight of 200 for message packets reserving 7.9Hz for messages.
Since there are no pending messages this bandwidth is donated to attitude packest 0x5006 yielding a rate of 20Hz for attitude

When the message queue is filled the scheduler dinamically increases message packet priority by lowering the weight to 50, this yields a data rate for messages of 20Hz which quickly empties the message queue.

All this works even for slower links since the scheduler uses relative weights.

I added one last feature to guarantee that GPS lat and lon coordinates packets are sampled “almost” at the same time with a max distance in time equal to the frsky polling period.

Edit
The biggest drawback of a WFQ scheduler is that it always uses all available bandwidth, it’s not a shaper

yaapu · July 10, 2019, 10:51am

What I’ve done so far pretty much fixes the transport of the passthrough telemetry, i.e. I defined a fair and predictable scheduler on top of which packets with different priorities are guaranteed to be delivered.

Eric Stockenstrom (@Eric_Stockenstrom) gave me a very simple but effective suggestion to lighten the load on the telemetry link: do not send what has not changed.
This heuristic will probably leave some empty slots on the s.port bus and this alone can potentially improve the link quality radio wise.

I’ll do some testing and report back.

Edit
Skipping duplicate packets might be a bad idea.
If for any reason we miss the changed packet, for instance because of a bad link or link loss, there’s no way to know if the data we have is fresh or not, even if we could detect a link loss there’s no way to request fresh packets for in it’s current implementation the passthrough telemetry is one way.
A possible workaround is to limit the maximun silence time to guarantee a very low packet rate for each packet type.

yaapu · July 10, 2019, 4:35pm

A note about wfq and queues
Right now my scheduler never checks if a packet has changed and always sends a fresh sample even if it’s a duplicate of the previous one.
It’s just like serving queues of a fixed size of 1 element and queue size is not taken into account when searching for the next packet to send.
By checking packets for changed values one would implement a pool of queues of max 1 element served by a weighted fair scheduler, empty queues would have a lower priority than filled queues.

yaapu · July 11, 2019, 10:30am

I updated the code to reflect my findings and I guess I’m close to something I can call version 1 of the scheduler:

it tracks the rx polling period to estimate the number of external frsky sensors
it adapts to available bandwidth
bandwidth is shared based on weights rather than on predefined rate (WFQ algorithm)
messages and battery 2 packets use bandwidth only if there’s something to send (queues can have a size of 0 or 1), when there’s nothing to send bandwidth is shared proportionally among all other packet types
all other queues have a fixed size of 1 element
the number of times a message is duplicated to ensure transmission in case of link corruption varies based on polling period from 3 to 1
GPS coordinates are sampled with a max time difference equal to the polling period

The weights I’m using are summarized in the tables below

No pending messages, single battery configuration

Pending messages, single battery configuration

By lowering the number of times a message chunk is duplicated we can keep a pretty much constant rate at sending messages.

The code is not polished yet but here’s a peek at it

/*
 * Alternative scheduler
 * for FrSky SPort Passthrough (OpenTX) protocol (X-receivers)
 */
void AP_Frsky_Telem::passthrough_wfq_adaptive_scheduler(uint8_t prev_byte)
{
    uint32_t now = AP_HAL::millis();
    uint8_t max_delay_idx = TIME_SLOT_MAX;
    float max_delay = 0;
    float delay = 0;
    bool packet_ready = false;

    // build message queue for sensor_status_flags
    check_sensor_status_flags();
    // build message queue for ekf_status
    check_ekf_status();

    // search the packet with the longest delay after the scheduled time
    for (int i=0;i<TIME_SLOT_MAX;i++) {
        //normalize packet delay relative to packet period
        delay = (now - _passthrough.packet_timer[i])/static_cast<float>(_passthrough.packet_period[i]);
        // use >= so with equal delays we choose the packet with lowest priority
        // this is ensured by the packets being sorted by desc frequency 
        if (delay >= max_delay) {
            // check queue depth for selected packet types only
            switch (i) {
                case 0:
                    packet_ready = !_statustext_queue.empty();
                    break;
                case 5:
                    packet_ready = gcs().vehicle_initialised();
                    break;
                case 8:
                    packet_ready = AP::battery().num_instances() > 1;
                    break;
                default:
                   packet_ready = true;
                   break;
            }
            if (packet_ready) {
               max_delay = delay;
               max_delay_idx = i;
            }
       }
    }
    _passthrough.packet_timer[max_delay_idx] = AP_HAL::millis();
    // send packet
    switch (max_delay_idx) {
        case TIME_SLOT_MAX: // nothing to send
            break;
        case 0: // 0x5000 status text
            if (get_next_msg_chunk()) {
                send_uint32(DIY_FIRST_ID, _msg_chunk.chunk);
            }
            break;
        case 1: // 0x5006 Attitude and range
            send_uint32(DIY_FIRST_ID+6, calc_attiandrng());
            break;
        case 2: // 0x800 GPS lat
            send_uint32(GPS_LONG_LATI_FIRST_ID, calc_gps_latlng(&_passthrough.send_latitude)); // gps latitude or longitude
            // force the scheduler to select GPS lon as packet that's been waiting the most
            // this guarantees that gps coords are sampled at max _passthrough.avg_polling_period time separation
            _passthrough.packet_timer[3] = _passthrough.packet_timer[2] - 10000;
            break;
        case 3: // 0x800 GPS lon
            send_uint32(GPS_LONG_LATI_FIRST_ID, calc_gps_latlng(&_passthrough.send_latitude)); // gps latitude or longitude
            break;
        case 4: // 0x5005 Vel and Yaw
            send_uint32(DIY_FIRST_ID+5, calc_velandyaw());
            break;
        case 5: // 0x5001 AP status
            send_uint32(DIY_FIRST_ID+1, calc_ap_status());
            break;
        case 6: // 0x5002 GPS Status
            send_uint32(DIY_FIRST_ID+2, calc_gps_status());
            break;
        case 7: // 0x5004 Home
            send_uint32(DIY_FIRST_ID+4, calc_home());
            break;
        case 8: // 0x5008 Battery 2 status
            send_uint32(DIY_FIRST_ID+8, calc_batt(1));
            break;
        case 9: // 0x5003 Battery 1 status
            send_uint32(DIY_FIRST_ID+3, calc_batt(0));
            break;
        case 10: // 0x5007 parameters
            send_uint32(DIY_FIRST_ID+7, calc_param());
            break;
    }
}

Eric_Stockenstrom · July 12, 2019, 3:35pm

Just to comment on the possibility of losing packets. It’s a very real problem, brought home to me when I was testing status text messages again yesterday. Remember we were discussing whether they should be sent 3 times?

Well, in the end I resorted to sending them 3 times to get them through without errors in the text.

I think the fact that I had WiFi enabled, and the Horus was quite near, and 5 computers running nearby might have contributed

yaapu · July 12, 2019, 4:29pm

I understand, but what would an acceptable tradeoff be when bandwidth is “very” limited?
What % of the channel would you consider enough and not too much to be dedicated to text messages?

Eric_Stockenstrom · July 13, 2019, 5:29am

What worked for me is pushing them out 2 seconds, and then burdening each successive chunk a further 20 mS. This allowed immediatetraffic through, and then fed in the chunks interleaved.

But, to answer your question, I would take a stab at 15% and observe the results, especially at Pixhawk boot. Unfortunately, like in the quantum world, the results change when you observe them.

Eric_Stockenstrom · July 13, 2019, 5:33am

I bought a jtag debugger for platformio but it’s taking an age to ship.

yaapu · July 13, 2019, 8:45am

Hi Eric,
I studied your recent code changes and agree the solution you came up with better fits the particular use case of matching provider and consumer rates.
Your new scheduler is pure FIFO because the rates are inherited from the mavlink stream, you had to introduce heuristics to force messages and parameters to be processed in a tight loop.
But I’m afraid the same reasoning cannot be applied to the ardupilot scheduler where there’s no master rate, the scheduler is both the provider and the consumer.
In my testing when the scheduler samples the FC status to check for new data about 90% of the times it finds it has changed especially while flying.