Yonah Ops Ground Control (OGC) - Part 0

TianChang_Yonah · September 12, 2020, 3:12am

Yonah Ops Ground Control (OGC): Part 1

In our previous post, we provided an overview of our Ops Ground Control architecture. Today, we go through the workings of the triple-redundant link system in greater detail.

Triple-redundant link system

As mentioned previously, one of the requirements of Yonah’s OGC was to provide reliable, long-range communication that can operate in rural conditions. Could it survive without Internet? Or worse still, when cellular coverage is intermittent? Our design had to accommodate for such scenarios.

We settled on the following three links:

Telegram: Handles communication over cellular data, using the Telegram API. This is the primary, low-latency link of the system when there is internet access

SMS: Handles communication over SMS. This serves as a medium-latency backup when internet access is down.

SBD: Handles communication via Iridium Short-Burst-Data (SBD) Satellite Telemetry. This is a high-latency backup when both internet and cellular connection are down.

Our triple-redundant link system highlighted in green

Telegram

The ideal communication link would be low latency and low cost. We decided to use Telegram as our backend for this link as it was free and open source. Considering the popularity of Telegram as a platform, we could also count on its reliability. While playing around with the API, we found that Telegram could also be a lot more flexible than regular SMS - there was no character limit, could send files, and could send a high volume of messages at minimal cost (we apologize to the people we spammed while testing this link).

The full details of the Telegram link are documented in our Wiki; today we share more insights into the development journey.

The start of the journey

The first issue we had to tackle was deciding how to identify our various accounts. Telegram internally uses its own user_id or chat_id. This was not easy to find without diving into the code so we decided to use phone numbers to identify the devices. This meant we could use the same identifying information for both SMS and Telegram links (phone number).

Unfortunately, just using phone numbers was not that easy; we needed the aforementioned user_id to actually send a message to another account. The first challenge was getting the user_id. In the early days of this work, the link initialization was convoluted. There were no guarantees for the link to start up properly because it depended on too many factors. Here is a short description of how the link started up:

Link starts
Get the 10 most recent chats
Get details about each chat
If chat is a private conversation, check the phone number of the account
If phone number matches, get the user_id and finish initialization
Otherwise go to step 2

There were obviously many issues with the way this worked. If the phone number we want to talk to was not within the first 10 chats, the link would not start. If there was no prior conversations between the two numbers, the link would not start. If the phone number was hidden because of the other end’s privacy settings, the link would not start.

How can we improve the terrible first version

The first step to improve this behaviour was to search the account’s contacts instead of the chats. Searching the contacts meant we did not need to care about the receiving ends privacy settings (if the phone number is stored in the contacts). This still required the phone number to be added to the contacts in the first place. The next step was to add numbers to the contacts if it was not already there.

These changes fixed majority of the problems with the first Telegram link, but it still needed to query Telegram everytime it initialized. There was no way around this until we started working on the Identifiers package, which gave us a permanent method of storing and accessing link related data. Storing the user_id into the identifiers file meant the link no longer had to query Telegram everytime it started, unless it needed to. We will talk more about identifiers in Part 3 of this series.

The end of the journey

In the current setup, the Identifiers node checks which phone numbers do not have an associated user_id available and request Telegram for these numbers. As a result, the link is now guaranteed to start and it does so much faster than before.

There were two instances during this journey where we were pleasantly surprised by how fast the link worked. The first was our first test for the link where the messages were received almost instantaneously and the second was how fast the link started after implementing all the above changes. This was also the only time we removed so much code without feeling like we were removing features.

Despite all of Telegram’s benefits, there was one major disadvantage that we must account for: Telegram requires Internet connection to work. Additionally, although messages were sent almost instantaneously in our tests, the actual conditions in a rural environment could be expected to be much worse than our testing conditions (think of regions that still rely on 2G!). Thus, we still needed other links as a backup form of communication.

SMS

SMS is the second line of defense in our triple-redundant link system. Short Message Service is globally available, albeit losing out in popularity to Internet based messaging services such as Telegram, Whatsapp and the likes. Despite this, it is still reliable and has a low enough latency to serve as our backup in the event that the Telegram link fails. The documentation for the SMS link can be found here.

We selected the Teltonika RUT955 cellular router to manage this link, due to its in-built SMS functions (RuTOS) that could be accessed through a command line interface instead of a web UI. This was crucial in allowing us to manipulate SMS commands programmatically.

Slow, but secure

The very first version of this link used the python ‘subprocess’ library to execute command line actions from within the code. This allowed us to securely enter the router’s OS via SSH and send messages. However there were some problems that surfaced with this iteration.

The router’s SMS management system appended new messages to the list instead of being pushed up to the top, which meant that given that it can only store 30 messages, message numbers may change once the 31st message is received and the first message is deleted, shifting all the messages 1 number up.

We found this to be inconvenient as there would always need to be checks for the largest message number with a message. Instead, we decided to always work on an empty message list. Since messages are time sensitive (we don’t want to receive outdated commands and still execute them), we would delete messages once they are read so that the first message will always be the newest message.
Subprocess closed each SSH connection once the line of code was executed. This caused initialisation of the node to be slow, as we had to iteratively delete all messages in storage after the aircraft has started up in order to start on a clean slate. Each delete command would require an SSH connection to be established with the router and then delete the message, taking roughly 1 second each. Hence, 30 seconds were spent during startup waiting for all messages to be deleted, before being able to properly receive messages.

Fast and secure

After experiencing the pain of the 30-second startup time in the field (live testing), we made the switch to Paramiko, a SSH-based python library which we also used in RFF development (RFF, or Return-From-Remote, will be talked about in Part 3 of our release). Paramiko allowed us to open a SSH session into the router and keep it alive until we finish executing all desired commands. This eliminated the time required to establish a new SSH session for every command. After “upgrading” the SMS link from subprocess to paramiko, we observed a significant decrease in the amount of time taken for the node to start up.

That said, the main limitations of both SMS and Telegram are that they both require cellular coverage, and also require the router to be working in order to function properly. There is a need for a third method in the event that the router fails, or the plane flies into an area with low cellular coverage. This leads us to the final link, the SBD link.

SBD

Iridium Short-Burst-Data. The final line of defense, when all other forms of communication have broken down. The SBD link takes advantage of Iridium’s global satellite network to offer truly global communication even in regions without cellular coverage.

In our setup, we chose the Rockblock 9603 satellite modem, which was recommended by other projects such as SPL Global Telemetry and PX4.

With its small form factor, relatively low cost (compared to other satellite modems!) and ease of operation, it is not difficult to see why the 9603 is so popular among hobbyists. Most satellite modems in the market were designed for large scale, industrial usage - they were too bulky to be fitted on our drones, had minimum-order-quantities (MOQs) running into the hundreds, and their APIs were closed-source. Rock Seven’s solution was one of the few that catered to hobby and R&D purposes. See here for the comparison table that we built as we did our sourcing.

The Rockblock 9603 modem uses the Iridium Short-Burst-Data technology provided by Rock Seven Mobile Services. Interaction with the modem is done using AT commands.

The documentation for our SBD package can be found here. In short, we employed 2 different methods of communication:

Communication between an aircraft-based Rockblock and our web-hosting solution. Lower latency, and used when there is ground-based internet connection
Communication between and aircraft-based Rockblock and GCS-based Rockblock. Higher latency than Method 1, and used when there is no ground-based internet connection.

To accommodate these asynchronous communication methods, our SBD package is split into two nodes; an air node and a ground node.

Makersnake’s pyRockblock

While figuring out how to integrate the Rockblock into our system, we came across this excellent tutorial by Makersnake, alongside his pyRockblock module. His work proved to be of enormous help to our development, as it handled the nitty-gritty of interfacing with the Rockblock through AT commands, allowing us to focus on “higher-level” logic.

Nevertheless, we did have to modify certain parts of the module before integrating it, for example:

Converting the code to Python 3
Figuring out how to pack the data in binary format before sending it to the modem
Implementing a messaging service between two Rockblocks

In particular, the 2nd and 3rd point proved to be a real pain, as elaborated below.

Sending binary-compressed data between two Rockblocks

There were instructions on how to binary-pack the data. There were also instructions on how to send plaintext messages between two Rockblocks. But how about sending binary-compressed data between two Rockblocks? That was something that took us a long time to decipher.

There were simple instructions on Rock Seven’s documentation on how this should be done:

Binary Message Prerequisites

In the binary message that will be streamed, it is mandatory to include:

The RB prefix (“5242” in Hex)

The exact serial number of the unit (ex. if “1234” then “0004d2” in Hex)

The message body (ex. if “Hello” then “48656c6c6f” in Hex)

But what would have made our lives easier were concrete examples on how to implement this. Searching around on Github revealed insufficient data; most projects simply circumvented the problem by sending data in plaintext. This wasn’t an option for us; we needed to compress our data to save money! (Rock Seven charges one credit for every 50 bytes transferred)

Eventually, we deciphered this mystery (you can see our implementation on Github); when sending binary data between Rockblocks, the key was that the receipient’s serial number had to be packed into 3 bytes, in big endian format. Heads-up if you are implementing such a solution for your own projects!

Balancing between latency and cost

Another thing we realized was that every Iridium mailbox check, regardless of whether any data was transferred or not, would cost at least one credit. This meant that we could not indiscriminately poll the Iridium service around the clock to check for messages (like what was done in the SMS link), else there would be massive cost overruns. Rock Seven’s customer service even warned us against setting too low a latency when we enquired about it!

Ring Alerts were a possibility, except that they were viable only if the Rockblock was in the last known location detected by the Iridium constellation. For a modem mounted on a drone zipping through the air at 100km/h, this did not seem like a very safe bet…

The method we settled upon was designed to balance between cost and latency; the system would periodically conduct Iridium mailbox checks to check for incoming messages, as well as deliver any queued outgoing messages. We recommend setting the time interval to anywhere between 5 - 10 minutes to balance between cost and latency.

Moving forward

There are two improvements which we need to make to our link system.

The first improvement is one of security. While all our links employed encryption, not all this encryption was end-to-end. Telegram only encrypts data from the user to the server, and server to recipient; any data passing within the server would be unencrypted and readable by Telegram itself. Similarly, SMS only provides encryption from the user to the cell tower. We effectively placed our trust in Telegram’s servers and the cell towers to safeguard our information…

Based on our threat model, is end-to-end encryption a must? If so, what is the best way to implement this; switch to more secure methods (e.g. Signal), or establish our own encrypted tunnel? This is something which we have yet to discover.

The next improvement is to design an intelligent link-switching algorithm to detect if a link has failed, so that the system would automatically fall over to the next available link. In the OGC’s present configuration, we can only manually configure (prior to initialization) which link the system would listen on during runtime, which is far from user-friendly. We are currently working on designing this; taking into account the costs and latencies involved, it seems that this task will be more complicated than expected…

Conclusion

Throughout the development of our link system, we were fortunate to receive much help from the following:

The Tdlib API developers
Teltonika’s well documented wiki
Parmiko’s developers
Makersnake for his excellent tutorial and pyRockblock module
Rock Seven customer services for their prompt assistance when we needed it. Those guys were helpful and responsive!

This concludes our coverage of our triple-redundant link system. In the next part, we will shift our focus on the ground side of things - the RQT user interface.

Our list of contributors can be found here

Signing out,

@disgruntled-patzer
@rumesh986
@huachen24
@ZeGrandmaster

Yonah