APMrover2-release v3.1.2 regular segmentation fault

I’m using a Pi 3 (Frambuesca, PXFmini-RaspberryPi3-frambuesa-25-10-2016.img), PXFmini, no additional hardware, Mission Planner (responsive UDP connection), and APMrover2-release, ie. v3.1.2.

I’m posting this as a separate thread as requested by @peterbarker here (@gmorph might be interested too…?): https://discuss.ardupilot.org/t/scan-669-scan-past-end-of-eeprom/19504/5. That thread involves APMrover2-v3.2.0-dev (ie. the master yesterday).

This Segmentaion Fault is very rare in the current master, but I need to use Release v3.1.2, so would like to track down the fix.

gdb shows the Segmentation Fault:

$ sudo gdb --args ~/a-r/APMrover2/APMrover2.elf -A udp:10.0.0.2:14550 -B /dev/ttyAMA0 -l /home/erle/APM/logs -t /home/erle/APM/terrain/

GNU gdb (Raspbian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “arm-linux-gnueabihf”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from /home/erle/a-r/APMrover2/APMrover2.elf…done.

(gdb) run

Starting program: /home/erle/a-r/APMrover2/APMrover2.elf -A udp:10.0.0.2:14550 -B /dev/ttyAMA0 -l /home/erle/APM/logs -t /home/erle/APM/terrain/
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/arm-linux-gnueabihf/libthread_db.so.1”.
Raspberry Pi 2/3 with BCM2709!
[New Thread 0x76c9f450 (LWP 2003)]
[New Thread 0x76c5f450 (LWP 2004)]
[New Thread 0x76aff450 (LWP 2005)]
[New Thread 0x76abf450 (LWP 2006)]
[New Thread 0x76a7f450 (LWP 2007)]
[New Thread 0x75f4a450 (LWP 2008)]
MS5611 found on bus 0 address 0x00
[New Thread 0x75dff450 (LWP 2009)]
[New Thread 0x75dbf450 (LWP 2010)]
MPU: temp reset 4874 0

Not many seconds to wait (this happens very frequently):

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x76aff450 (LWP 2005)]
Linux::RCInput_RPI::_timer_tick (this=0xe24f8 )
at /home/erle/a-r/libraries/AP_HAL_Linux/RCInput_RPI.cpp:476
476 void *x = circle_buffer->get_virt_addr((ad + j)->dst);

cicle_buffer is always involved. Also seen at RCInput_RPI.cpp:504: circle_buffer->get_page.

(gdb) list

471 }
472
473 // Now we are getting address in which DMAC is writing at current moment
474 dma_cb_t *ad = (dma_cb_t *)con_blocks->get_virt_addr(dma_reg[RCIN_RPI_DMA_CONBLK_AD | RCIN_RPI_DMA_CHANNEL << 8]);
475 for (int j = 1; j >= -1; j–) {
476 void *x = circle_buffer->get_virt_addr((ad + j)->dst);
477 if (x != nullptr) {
478 counter = circle_buffer->bytes_available(curr_pointer,
479 circle_buffer->get_offset(circle_buffer->_virt_pages, (uintptr_t)x));
480 break;

(gdb) bt

#0 Linux::RCInput_RPI::_timer_tick (this=0xe24f8 )
at /home/erle/a-r/libraries/AP_HAL_Linux/RCInput_RPI.cpp:476
1 0x000b4c54 in operator() (this=0xe23fc <schedulerInstance+436>)
at /home/erle/a-r/libraries/AP_HAL/utility/functor.h:54
#2 Linux::PeriodicThread::_run (this=0xe23f8 <schedulerInstance+432>)
at /home/erle/a-r/libraries/AP_HAL_Linux/Thread.cpp:265
#3 0x000b4dcc in Linux::thread::_run_trampoline (arg=0xe23f8 <schedulerInstance+432>)
at /home/erle/a-r/libraries/AP_HAL_Linux/Thread.cpp:41
#4 0x76de3e90 in start_thread (arg=0x76aff450) at pthread_create.c:311
#5 0x76d6e128 in ?? () at …/ports/sysdeps/unix/sysv/linux/arm/nptl/…/clone.S:92
from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

In the valgrind output below, note the following:

  • APMrover2.cpp:586
  • HAL_Linux_Class.cpp:354
  • RCInput_RPI

All of the “strchr.S” entries also exist in the valgrind of apm-rover-erlebrain which does not give Segmentation Faults.

valgrind --leak-check=full --show-reachable=yes ./a-r/APMrover2/APMrover2.elf

==2816 == Memcheck, a memory error detector
==2816== Copyright (C) 2002-2011, and GNU GPL’d, by Julian Seward et al.
==2816== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==2816== Command: ./a-r/APMrover2/APMrover2.elf
==2816==
==2816== Conditional jump or move depends on uninitialised value(s)
==2816== at 0x40194F0: index (strchr.S:99)
==2816== by 0x40080A3: expand_dynamic_string_token (dl-load.c:425)
==2816== by 0x4008237: fillin_rpath (dl-load.c:495)
==2816== by 0x4008A07: _dl_init_paths (dl-load.c:872)
==2816== by 0x4002C77: dl_main (rtld.c:1348)
==2816== by 0x4016B8B: _dl_sysdep_start (dl-sysdep.c:249)
==2816== by 0x4004CD3: _dl_start_final (rtld.c:331)
==2816== by 0x4004F63: _dl_start (rtld.c:559)
==2816== by 0x4000D6F: ??? (in /lib/arm-linux-gnueabihf/ld-2.19.so)
==2816==
==2816== Conditional jump or move depends on uninitialised value(s)
==2816== at 0x40194F4: index (strchr.S:101)
==2816== by 0x40080A3: expand_dynamic_string_token (dl-load.c:425)
==2816== by 0x4008237: fillin_rpath (dl-load.c:495)
==2816== by 0x4008A07: _dl_init_paths (dl-load.c:872)
==2816== by 0x4002C77: dl_main (rtld.c:1348)
==2816== by 0x4016B8B: _dl_sysdep_start (dl-sysdep.c:249)
==2816== by 0x4004CD3: _dl_start_final (rtld.c:331)
==2816== by 0x4004F63: _dl_start (rtld.c:559)
==2816== by 0x4000D6F: ??? (in /lib/arm-linux-gnueabihf/ld-2.19.so)
==2816==
==2816== Conditional jump or move depends on uninitialised value(s)
==2816== at 0x4019830: strlen (strlen.S:76)
==2816== by 0x40057EB: local_strdup (dl-load.c:161)
==2816== by 0x4008237: fillin_rpath (dl-load.c:495)
==2816== by 0x4008A07: _dl_init_paths (dl-load.c:872)
==2816== by 0x4002C77: dl_main (rtld.c:1348)
==2816== by 0x4016B8B: _dl_sysdep_start (dl-sysdep.c:249)
==2816== by 0x4004CD3: _dl_start_final (rtld.c:331)
==2816== by 0x4004F63: _dl_start (rtld.c:559)
==2816== by 0x4000D6F: ??? (in /lib/arm-linux-gnueabihf/ld-2.19.so)
==2816==
==2816== Conditional jump or move depends on uninitialised value(s)
==2816== at 0x4019834: strlen (strlen.S:78)
==2816== by 0x40057EB: local_strdup (dl-load.c:161)
==2816== by 0x4008237: fillin_rpath (dl-load.c:495)
==2816== by 0x4008A07: _dl_init_paths (dl-load.c:872)
==2816== by 0x4002C77: dl_main (rtld.c:1348)
==2816== by 0x4016B8B: _dl_sysdep_start (dl-sysdep.c:249)
==2816== by 0x4004CD3: _dl_start_final (rtld.c:331)
==2816== by 0x4004F63: _dl_start (rtld.c:559)
==2816== by 0x4000D6F: ??? (in /lib/arm-linux-gnueabihf/ld-2.19.so)
==2816==
–2816-- WARNING: unhandled syscall: 357
–2816-- You may be able to write your own handler.
–2816-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
–2816-- Nevertheless we consider this a bug. Please report
–2816-- it at Valgrind: Bug Reports.
Failed to create epoll: Function not implemented
Scheduler: failed to set scheduling parameters: Operation not permitted
==2816== Invalid write of size 4
==2816== at 0xB0CE0: Linux::RCInput_RPI::teardown() (RCInput_RPI.cpp:248)
==2816== by 0xB63C3: AP_HAL::panic(char const*, …) (system.cpp:33)
==2816== by 0xB33D3: Linux::Scheduler::init() (Scheduler.cpp:93)
==2816== by 0xB480F: HAL_Linux::run(int, char* const*, AP_HAL::HAL::Callbacks*) const (HAL_Linux_Class.cpp:354)
==2816== by 0x128E7: main (APMrover2.cpp:586)
==2816== Address 0x0 is not stack’d, malloc’d or (recently) free’d
==2816==
==2816==
==2816== Process terminating with default action of signal 11 (SIGSEGV)
==2816== Access not within mapped region at address 0x0
==2816== at 0xB0CE0: Linux::RCInput_RPI::teardown() (RCInput_RPI.cpp:248)
==2816== by 0xB63C3: AP_HAL::panic(char const*, …) (system.cpp:33)
==2816== by 0xB33D3: Linux::Scheduler::init() (Scheduler.cpp:93)
==2816== by 0xB480F: HAL_Linux::run(int, char* const*, AP_HAL::HAL::Callbacks*) const (HAL_Linux_Class.cpp:354)
==2816== by 0x128E7: main (APMrover2.cpp:586)
==2816== If you believe this happened as a result of a stack
==2816== overflow in your program’s main thread (unlikely but
==2816== possible), you can try to increase the size of the
==2816== main thread stack using the --main-stacksize= flag.
==2816== The main thread stack size used in this run was 8388608.
Raspberry Pi 2/3 with BCM2709!
==2816==
==2816== HEAP SUMMARY:
==2816== in use at exit: 2,033 bytes in 25 blocks
==2816== total heap usage: 27 allocs, 2 frees, 2,489 bytes allocated
==2816==
==2816== 0 bytes in 12 blocks are still reachable in loss record 1 of 2
==2816== at 0x4833970: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==2816==
==2816== 2,033 bytes in 13 blocks are still reachable in loss record 2 of 2
==2816== at 0x4835770: calloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==2816==
==2816== LEAK SUMMARY:
==2816== definitely lost: 0 bytes in 0 blocks
==2816== indirectly lost: 0 bytes in 0 blocks
==2816== possibly lost: 0 bytes in 0 blocks
==2816== still reachable: 2,033 bytes in 25 blocks
==2816== suppressed: 0 bytes in 0 blocks
==2816==
==2816== For counts of detected and suppressed errors, rerun with: -v
==2816== Use --track-origins=yes to see where uninitialised values come from
==2816== ERROR SUMMARY: 5 errors from 5 contexts (suppressed: 0 from 0)

I notice RCInput_RPI.cpp was last changed on 22 May 2017 (the only change since the v3.1.2 release on 15 March 2017), but the change does not seem to be related to circle_buffer:
https://github.com/ArduPilot/ardupilot/blob/master/libraries/AP_HAL_Linux/RCInput_RPI.cpp

AP_HAL_Linux: RCInput_RPI: ignore unwanted signals
We are setting a termination handler for some signals which are of not
interest. Just ignore them. Ignoring SIGWINCH allows for example to
run on a screen and change the window size later without killing
ardupilot.

I’ve found this:
https://discuss.ardupilot.org/t/copter-3-5-0-rc7-unexpected-crash-on-erlebrain/18465/7

Copter-3.5.0-rc7 unexpected crash on ErleBrain
June 15

That suggests that the issue has been fixed, but I can’t find it. I’d like to do a small surgical fix to be copy of v3.1.2.

Also another valgrind a month ago by @LanderU:

Thanks in anticipation, Geoff :wink:

Hi @GeoUAV,

did you try it with the new “frambuesa” OS?

I will try it on my PXFmini.

Regards,

@LanderU

Hi @LanderU,

Yes I’m using PXFmini-RaspberryPi3-frambuesa-25-10-2016.img (I’ve edited my original posts).

Regards, Geoff

Hi @GeoUAV,

I get the same segfault, but if I apply these changes:

In this branch: https://github.com/ArduPilot/ardupilot/tree/APMrover2-release

I’m using this OS: http://forum.erlerobotics.com/t/update-frambuesa-os-for-erle-brain-3/3057

And for now, no crashes.

I have already the deb package for that version with the changes if you want.

Regards,

@LanderU

1 Like

Thanks @LanderU :grinning:

I just applied the change to my copy of Scheduler.cpp in APMrover2-release (v3.1.2) just to increase the stack to 1024, and it’s been working for almost 10 minutes without Seg Faults, so I’d say that was a fix, thanks again :wink:

- t->thread->set_stack_size(256 * 1024);
+ t->thread->set_stack_size(1024 * 1024);

Regards,
Geoff

[SOLVED]
void *x = circle_buffer->get_virt_addr((ad + j)->dst);
issue is at (ad + j)->dst
the J +1 does not always guarantees a valid control block .
I made a fix in RCInput_RPI.cpp
and RPIZero worked for hours with no crash.