GSoC 2026 — AI-Assisted Log Diagnosis: Proposal + A Few Questions

Hey everyone,

I’m Mohamed, a final-year engineering student from Egypt . I’m applying for the GSoC 2026 AI-Assisted Log Diagnosis & Root-Cause Detection project and wanted to share my approach and get some feedback before I finalize the proposal.
My background relevant to this:
I’ve been working on an autonomous robot for my graduation project — ROS 2, EKF localization, AMCL, LiDAR-based navigation, the works. A big chunk of what I’ve been doing is building a pipeline that logs sensor events (navigation failures, EKF state changes, localization confidence drops) and then diagnoses what went wrong using an LLM-backed query system. It’s basically the same problem as log diagnosis, just in a hospital instead of a field. So when I saw this project, it felt like a natural continuation of what I’m already doing.
Proposed approach in short:
• Parse .bin logs using pymavlink, extract features from VIBE, GPS, MAG, EKF, POWR, RCOUT messages
• Train a multi-class LightGBM classifier on labeled failure patterns (vibration, GPS glitch, compass interference, ESC failure, param misconfiguration, etc.)
• Build a retrieval layer over labeled Discuss forum threads + GitHub issues so the tool can suggest specific fixes with links to the original source
• Output a structured JSON report: root cause, confidence score, log evidence with timestamps, fix suggestions
My main concern / question for the mentor:
The biggest open question for me is the labeled dataset. My plan is to:

  1. Manually label ~300 forum posts where a developer confirmed a root cause
  2. Auto-label logs using ArduPilot’s published threshold values (VibeX > 30, HDOP > 2.5, etc.)
  3. Use SITL to inject synthetic failures for rare categories

Is this approach realistic, or is there an existing labeled log dataset I’m not aware of? And are SITL-generated synthetic logs considered acceptable training data, or do mentors prefer real-world logs only? Also happy to hear if there’s a failure category that the community finds hardest to diagnose — I’d prioritize that early.

Thank you in advance

Mohamed

Hi Mohamed, in case you are not aware proposals are due in about 1.5 hours from the time I write this message… I don’t think you’ll have the time needed to turn something around considering that. So maybe for next year if you’re interested in GSoC with ArduPilot, reach out a bit sooner!

We still invite you to submit what you have to be considered for this year.