A PR has been proposed to add “cyphal” support, which is a new CAN protocol touted as a successor to DroneCAN. I’ve opening this discussion to explain why the ArduPilot dev team are not keen on incorporating cyphal support.
The PR is here:
First off, I’d like to say that @ponomarevDA has put a lot of effort into this PR. The HITL demonstration is particularly nice. Normally we would be delighted to merge a new feature like this, but in the case of cyphal there are some issues.
The PR itself just adds ESC support, but @ponomarevDA has done support for GPS, mag, baro and rangefinder in the cyphal-hitl branch:
that branch gives a much better idea of what the future ArduPilot implementation would look like as such a complex CAN protocol that can only do ESCs would be fairly pointless. I can understand however that the initial PR would try to start simple.
Background
The effort to create what is now called cyphal has been going on for many years. Initially called “UAVCAN v1”, it got renamed to cyphal when DroneCAN split off from UAVCAN development.
The key person behind cyphal (and UAVCAN) is @Pavel_Kirienko. Pavel has written a summary of why he thinks cyphal is superior here:
If you haven’t read that then you should probably read it now.
Along with others in the ArduPilot dev team, I’ve been involved in the discussion of UAVCAN-v1 (what is now cyphal) for years. We’ve pointed out the fundamental issues with it for a long time. Much of that discussion is no longer available publicly, as Pavel moved the discussion from the public section of the uavcan forums to a private section, making all those discussions no longer available.
Luckily I kept some notes locally before I posted some of the critiques of the proposed message structure. I’ve put those (very rough!) notes up here:
http://uav.tridgell.net/cyphal/
that was from May 2021, and much of the criticism I gave then is still valid with the current cyphal design.
Key Objections to Cyphal
The most important objections to cyphal are:
- it is incompatible with DroneCAN, so you cannot mix DroneCAN and cyphal on the same CAN bus
- it fragments the ecosystem for CAN peripherals just when we finally have great traction in getting a really good set of peripherals available for users and vendors
- the design of the cyphal protocol is very poor, it is driven by a fundamentally flawed philosophy
Compatibility
When UAVCAN-v1 was first proposed Pavel decided to do it as a clean-sheet protocol. This is very tempting for protocol designers, as designing a new protocol from scratch gives you a lot of freedom and lets you ignore all the hard compatibility issues that are involved with evolving an existing widely used protocol.
Contrast what happened with cyphal and with MAVLink. From mavlink 0.9 to 1.0 to 2.0 we put the effort in to make it as seamless as possible for end users while still providing major benefits of new features. This has resulted in a great user experience.
During the discussion of uavcan-v1 (a discussion which is now invisible unfortunately) I pointed this out as a fundamental issue. Pavel said that he hadn’t realised that uavcan-v0 was widely used (which was a surprise, giving just how much it has dominated the industry for many years now) and he agreed to try to make it compatible.
Pavel came up with a modified encoding of uavcan-v1 that used a trick to allow uavcan-v1 and uavcan-v0 to co-exist on the same CAN bus. It did weaken the protections in the protocol a bit, but was a reasonable solution.
After that was proposed I asked that this compatibility be a central focus of the development, with real-world testing of mixed networks and devices. As far as I am concerned cyphal is dead in the water without this compatibility, proven on real vehicles, so a DroneCAN GPS can be on the same bus as a cyphal baro or ESC as well as any other combination. The network analysis tools need to support viewing mixed protocols (like mavlogdump and MAVExplorer support multiple versions of mavlink) and the whole process of changing between DroneCAN and cyphal needs to be as seamless as possible for the end user.
Unfortunately the proposed PR has not done that. It has stayed with the clean-sheet approach. That is vastly easier from a development point of view, but much much worse for end users with large investments in expensive DroneCAN peripherals.
Ecosystem Fragmentation
Given the limitations of CAN bus protocols that mixing protocols on the same bus is hard (and in many cases impossible) it is absolutely critical that we not fragment the ecosystem, or we end up losing the ecosystem scaling property that is needed to make CAN peripherals viable.
A good example is a conversation I had recently with a major ESC vendor looking to do CAN ESCs. They told me that looked and saw lots of different versions of protocols and they didn’t know what one to pick, so they decided to do their own proprietary protocol which just makes the situation worse. That type of decision gets repeated for scores of vendors.
Poor Design
The cyphal protocol is a protocol driven by buzzwords and philosophy. Real-world devices are jammed into the philosophical framework like jamming a square peg into a round hole.
All the “modern”, “data-centric-publish-subscribe” and other stuff is just window dressing around a poor design. Unfortunately discussions I’ve attempted to have with Pavel on this have led to what feels like rather patronising religious preaching, not protocol engineering. I know what publish subscribe models are. I also know when I see the principles being applied blindly without thinking about the context it is being applied in.
Let’s look at some real examples in the proposed cyphal PR to understand why the design of cyphal is an issue. I’ll use the cyphal-hitl branch as the base for discussions as a lot of the issues only become clear when you go beyond the simple case of ESCs (although the simple ESC example also has some very fundamental issues).
This image from the PR gives a good idea of how the philosophy of cyphal pans out:
It has the following parameters for the first ESC:
- CAN_D1_UC1_DYN1
- CAN_D1_UC1_EH1
- CAN_D1_UC1_FB1
- CAN_D1_UC1_POW1
In the branch it has support for maximum of 4 ESCs, so it is 16 parameters for max 4 ESCs. Each of these has to be assigned and match the corresponding parameters of each of the ESC nodes. There are tools to help do this.
There are a number of real issues with this approach. First off, it doesn’t scale. We support up to 32 actuators in ArduPilot. With the DroneCAN approach the existing SERVOn_xxxx actuator IDs are what is used on the CAN bus. So the end users only has to know the one ID. All of the different types of data associated with an ESC are all associated with that one ID.
With the cyphal approach, the reason there are so many parameters is that each little piece of information is a separate topic and a separate message on the (overloaded!) CAN bus. So the RPM comes back using reg_udral_physics_dynamics_rotation_PlanarTs. That contains:
- a 56 bit timestamp
- and a reg_udral_physics_dynamics_rotation_Planar value
then inside the reg_udral_physics_dynamics_rotation_Planar value we have a reg.udral.physics.kinematics.rotation.Planar and a uavcan.si.unit.torque.Scalar torque value.
The reg.udral.physics.kinematics.rotation.Planar value a uavcan.si.unit.angle.Scalar for angular position, a uavcan.si.unit.angular_velocity.Scalar for angular velocity and a uavcan.si.unit.angular_acceleration.Scalar for angular acceleration.
As a tree it looks like this:
reg.udral.physics.dynamics.rotation.PlanarTs.0.1 length=23
timestamp: uavcan.time.SynchronizedTimestamp.1.0 length=7
uavcan.time.SynchronizedTimestamp.1.0 length=7
microsecond: truncated uint56 length=7
value: reg.udral.physics.dynamics.rotation.Planar.0.1 length=16
reg.udral.physics.dynamics.rotation.Planar.0.1 length=16
kinematics: reg.udral.physics.kinematics.rotation.Planar.0.1 length=12
reg.udral.physics.kinematics.rotation.Planar.0.1 length=12
angular_position: uavcan.si.unit.angle.Scalar.1.0 length=4
uavcan.si.unit.angle.Scalar.1.0 length=4
radian: saturated float32 length=4
angular_velocity: uavcan.si.unit.angular_velocity.Scalar.1.0 length=4
uavcan.si.unit.angular_velocity.Scalar.1.0 length=4
radian_per_second: saturated float32 length=4
angular_acceleration: uavcan.si.unit.angular_acceleration.Scalar.1.0 length=4
uavcan.si.unit.angular_acceleration.Scalar.1.0 length=4
radian_per_second_per_second: saturated float32 length=4
torque: uavcan.si.unit.torque.Scalar.1.0 length=4
uavcan.si.unit.torque.Scalar.1.0 length=4
newton_meter: saturated float32 length=4
total length is 23 bytes. Now go look at CyphalDynamicsSubscriber::handler() in the proposed pull request. After all of that complex tree of information, all it actually gets is a 16 bit RPM. That’s it. Everything else is thrown away because it just isn’t useful information and ESCs mostly can’t provide it anyway.
That is just for one of the 4 topics for ESCs. The same over-engineering happens at every level.
Now let’s look at what happens when commands are sent to ESCs. As ESCs are the only thing supported in the proposed PR you might expect it to be pretty well developed. Unfortunately we see things like this:
for (auto esc_idx = 0; esc_idx < 4; esc_idx++) {
so, it is assuming a maximum of 4 ESCs. Only quadcopters then.
This is reinformated further down:
for (uint_fast8_t sp_idx = 0; sp_idx < 4; sp_idx++) {
_vector4_sp.value[sp_idx] = (hal.rcout->scale_esc_to_unity(srv_config[sp_idx].pulse) + 1.0) / 2.0;
}
it is encoding the demanded speed using a vector4 which looks like this:
reg.udral.service.actuator.common.sp.Vector4.0.1 length=516
value: saturated float16[4] length=2
yep, ESC commands are encoded with a vector4 (as an aside, why does the length show up as 516 bytes? It should be 8 bytes).
The design is that if commanding 8 ESCs you use a Vector8, for 16 ESCs you use Vector16, etc. Why do it like this? It just leads to code like in the PR which can only control a quadcopter.
I should say that the ESC command messages in DroneCAN right now aren’t great, but if you have the opportunity to re-design then why go to such great lengths to make it worse than what we have now?
Philosophy Driven Design
Really the fundamental problem isn’t the PR. The problem is the philosophy driven protocol design that ignores the real world properties of devices we actually use. The philosophy touts re-use and composability as an advantage, but also demonstrates why applying those principles blindly is such a bad idea. The philosophy also touts only using SI units, ignoring the fact that SI units are often not what is really needed. An example is the above with RPM. Most ESCs don’t know the RPM. They know the eRPM which is scaled by the number of motor poles which they often don’t know. The philosophy of cyphal means it transmits angular vecocity in radians/second, but eRPM doesn’t fit that mold. The PR just ignores that and assumes the angular velocity can be directly scaled to RPM ignoring motor pole count.
Having a flag that says “this is an eRPM and needs to be scaled by number of poles” would allow for ESCs to either be motor agnostic, and send rRPM, or know the motor and send RPM. That would be great, but where would this flag fit into the cyphal mold? Yet another topic?
Cyphal and GNSS Devices
Now let’s look at the GPS implementation in the cyphal-hitl branch. As the data from a sensor becomes more complex the problems with the cyphal approach become more apparent.
The AP_GPS_CYPHAL GPS driver implements 5 topics. They are:
- position with reg.udral.physics.kinematics.geodetic.PointStateVarTs
- yaw with uavcan_si_sample_angle_Scalar
- num sats with uavcan_primitive_scalar_Integer16
- status with uavcan_primitive_scalar_Integer16
- pdop with uavcan_primitive_scalar_Integer16
Let’s expand each of those:
reg.udral.physics.kinematics.geodetic.PointStateVarTs.0.1 length=67
timestamp: uavcan.time.SynchronizedTimestamp.1.0 length=7
uavcan.time.SynchronizedTimestamp.1.0 length=7
microsecond: truncated uint56 length=7
value: reg.udral.physics.kinematics.geodetic.PointStateVar.0.1 length=60
reg.udral.physics.kinematics.geodetic.PointStateVar.0.1 length=60
position: reg.udral.physics.kinematics.geodetic.PointVar.0.1 length=36
reg.udral.physics.kinematics.geodetic.PointVar.0.1 length=36
value: reg.udral.physics.kinematics.geodetic.Point.0.1 length=24
reg.udral.physics.kinematics.geodetic.Point.0.1 length=24
latitude: saturated float64 length=8
longitude: saturated float64 length=8
altitude: uavcan.si.unit.length.WideScalar.1.0 length=8
uavcan.si.unit.length.WideScalar.1.0 length=8
meter: saturated float64 length=8
covariance_urt: saturated float16[6] length=2
velocity: reg.udral.physics.kinematics.translation.Velocity3Var.0.2 length=24
reg.udral.physics.kinematics.translation.Velocity3Var.0.2 length=24
value: uavcan.si.unit.velocity.Vector3.1.0 length=12
uavcan.si.unit.velocity.Vector3.1.0 length=12
meter_per_second: saturated float32[3] length=4
covariance_urt: saturated float16[6] length=2
uavcan.si.sample.angle.Scalar.1.0 length=11
timestamp: uavcan.time.SynchronizedTimestamp.1.0 length=7
uavcan.time.SynchronizedTimestamp.1.0 length=7
microsecond: truncated uint56 length=7
radian: saturated float32 length=4
uavcan.primitive.scalar.Integer16.1.0 length=2
value: saturated int16 length=2
uavcan.primitive.scalar.Integer16.1.0 length=2
value: saturated int16 length=2
so 86 bytes total and split over 4 messages. The thing is, this is missing lots of absolutely critical information you need for a CAN GPS. For example, no accuracy information. Nothing to support moving baseline yaw. No GPS time information (time of week and MS time within week, which is critical).
The cyphal driver just fills in the missing information with arbitrary values, for example this code:
state.horizontal_accuracy = 0.1;
state.vertical_accuracy = 0.1;
state.speed_accuracy = 0.1;
these numbers really matter as they affect the EKF fusion process, you can’t just make them up.
Following the cyphal philosophy, each of these missing pieces will be yet more topics. By the time we got a useful cyphan GPS we’d end up with a dozen or more topics (possibly a lot more?), each with their own ID parameters, for each GPS, so more things for end users to mess up.
Ease of Development vs Ease of Use
Another way to look at cyphal is where it comes down on the ease of use vs ease of development spectrum.
Doing a clean-sheet protocol was an “ease of development” decision. It is vastly easier (and much more fun!) to do a protocol from scratch, but it is awful for end user deployment.
Same with not having the messages have an identifier in them to say what the message is. Pavel quite rightly points out that having an identifier in the protocol means you can get collisions and mistakes can be made. Of course that is true, but it is developer pain, whereas putting the IDs into user facing parameters is user pain. There are vastly more users than developers and the ratio of users to develops rapidly increases as the protocol becomes more widespread. Part of the job of a good developer is to take on the pain so that end users get an easier time.
I know you can develop tools to try to check the users work, but you shouldn’t need to do that and it will still be error prone, and the end user won’t have the same sophisticated knowledge needed to debug things that the developers are expected to have.
Think about dyslexic users (and yes, they are quite common). Getting CAN_D1_UC1_DYN3=2374 instead of 2347 could mean that ArduPilot could decode a GPS latitude/longitude as a ESC RPM. This sort of “network cast” is at the heart of the problems that cyphal creates.
If we merged the cyphal PR then the expectation would be that we would also merge support for mag, baro, GPS, airspeed, rangefinder etc etc. Following the cyphal philosophy we’d end up with many hundreds of topics, and as part of the philosphy is that topics don’t carry IDs that associate with actual devices, you need to have a parameter for each of these topics for each instance you could have of the device.
So if we need 5 topics for servos (likely we’d need quite a few more as servos give a lot of information) and we support 32 servos then that is 160 parameters just for servos! Once we have a full implementation that matches the current DroneCAN one then we’d end up with maybe 500 or so topic ID parameters?
I hope this makes it clear why we aren’t keen on cyphal, and wish that Pavel had listened more years ago when we made it abundantly clear what the issues were.