## Feature request
**Is your feature request related to a problem? Please des…cribe.**
I'd like to suggest a potential small optimization for Cortex-M3 and M4, possibly M7.
**Describe the solution you'd like**
This ticket tracks the testing and tuning of different loop alignment options for ArduPilot and CMSIS. I'd like feedback on proposed loop tuning.
**Describe alternatives you've considered**
An alternative would be to keep it as-is, which results in no additional loop padding(wasted space).
**Platform**
[ ] All
[ ] AntennaTracker
[X] Copter
[ ] Plane
[ ] Rover
[ ] Submarine
**Additional context**
On several M4 devices, I've run into a per-iteration penalty cycle for loops that start with a 32-bit instruction that is only 16-bit aligned. I've noticed a few places this occurs in CMSIS in ArduPilot, and wanted to suggest considering tuning the -falign-loops=4 parameter, which would result in adding a narrow NOP(1 loop entry overhead) just before each unaligned loop(even those starting with a narrow instruction). Here's an example from CMSIS:
```
08138138 <arm_mult_f32>:
8138138: ea5f 0c93 movs.w ip, r3, lsr #2
813813c: b4f0 push {r4, r5, r6, r7}
813813e: d033 beq.n 81381a8 <arm_mult_f32+0x70>
8138140: f100 0610 add.w r6, r0, #16
8138144: f101 0510 add.w r5, r1, #16
8138148: f102 0410 add.w r4, r2, #16
813814c: 4667 mov r7, ip
//loop to process 4 samples per iteration
813814e: ed15 7a04 vldr s14, [r5, #-16] //This loop starts with a 16-bit aligned 32-bit wide instruction, which requires two 32-bit instruction fetches
8138152: 3f01 subs r7, #1
8138154: ed56 7a04 vldr s15, [r6, #-16]
8138158: f105 0510 add.w r5, r5, #16
813815c: f106 0610 add.w r6, r6, #16
8138160: f104 0410 add.w r4, r4, #16
8138164: ee67 7a87 vmul.f32 s15, s15, s14
8138168: ed44 7a08 vstr s15, [r4, #-32] ; 0xffffffe0
813816c: ed15 7a07 vldr s14, [r5, #-28] ; 0xffffffe4
8138170: ed56 7a07 vldr s15, [r6, #-28] ; 0xffffffe4
8138174: ee67 7a87 vmul.f32 s15, s15, s14
8138178: ed44 7a07 vstr s15, [r4, #-28] ; 0xffffffe4
813817c: ed15 7a06 vldr s14, [r5, #-24] ; 0xffffffe8
8138180: ed56 7a06 vldr s15, [r6, #-24] ; 0xffffffe8
8138184: ee67 7a87 vmul.f32 s15, s15, s14
8138188: ed44 7a06 vstr s15, [r4, #-24] ; 0xffffffe8
813818c: ed56 7a05 vldr s15, [r6, #-20] ; 0xffffffec
8138190: ed15 7a05 vldr s14, [r5, #-20] ; 0xffffffec
8138194: ee67 7a87 vmul.f32 s15, s15, s14
8138198: ed44 7a05 vstr s15, [r4, #-20] ; 0xffffffec
813819c: d1d7 bne.n 813814e <arm_mult_f32+0x16> //Branch target is aligned on 2
813819e: ea4f 140c mov.w r4, ip, lsl #4
...rest of function handles remaining 0-3 samples, and has been omitted...
```
I suspect M7 has a 64-bit or wider fetch interface, so this would may impact it when a 32-bit instruction stradles two instruction fetches. And Cortex-M7 has a branch predictor, so may not suffer from this as often as the M4.