Systems Integration Engineer - SW Focused Issue Triage & RCA

Agility RoboticsFremont, CA
Hybrid

About The Position

Agility’s commercially deployed humanoids operate alongside teams in warehouses, manufacturing facilities, and distribution centers—tackling physically demanding and repetitive tasks while enabling workers to focus on higher-value work. With industry-leading safety standards and years of proven deployment data, we're pioneering a new era of automation that enhances human potential. Role Overview: We are seeking a Systems Integration Engineer specialized in Software Issue Triage and Root Cause Analysis (RCA). Your main function is to conduct remote triage, utilizing log parsing, telemetry data, and video analysis, to identify failures with software root causes and ensure they are accurately dispositioned to the appropriate SW development teams. You will conduct deep-dive root cause analysis on novel failures occurring at the hardware-software interface, while simultaneously architecting the diagnostic scripts and tools required to streamline these investigations. In this role you will move beyond basic data review to navigate ambiguous failure modes, develop automated diagnostic scripts, and create the technical documentation that drives software reliability across the fleet.

Requirements

  • 4+ years of experience in Systems Integration, Software-Hardware interface, or R&D with a focus on software on complex mechatronic or autonomous systems.
  • Proven experience using monitoring and observability platforms (e.g., Datadog, Splunk, or New Relic) to track system health and identify performance anomalies across a fleet.
  • Experience interacting with cloud-based storage and databases (e.g., AWS S3, SQL, or NoSQL) to retrieve and manage large-scale telemetry and video datasets.
  • Proven track record of navigating highly ambiguous software-hardware intersections to find definitive root causes.
  • Experience creating technical documentation or bug reports intended for software engineering audiences.
  • Mastery of log parsing via CLI and proficiency in using Python or similar scripting languages for data visualization and failure trend analysis.
  • Familiarity with database environments, specifically regarding data retrieval and log management.
  • Experience correlating video and/or HW symptoms with system telemetry to identify physical manifestations of software bugs.
  • Strong understanding of software stacks in robotics, including communication protocols (e.g., EtherCAT, CAN) and how they manifest in system logs.
  • Ability to tackle ambiguous, unprecedented problems and create reusable, scalable solutions.
  • Capacity to operate independently on initiatives and proactively anticipate the needs for effective and efficient triage and RCA.
  • Exceptional ability to synthesize complex telemetry and video data into clear, actionable insights for software engineering stakeholders.
  • Bachelor’s or Master’s degree in Computer Science, Robotics, Electrical Engineering, or a related field.

Nice To Haves

  • Experience with HW/SW integration and design on HiL.
  • Experience with characterizing or troubleshooting HW/SW interactions such as cameras, encoders, IMUs, or other sensors.

Responsibilities

  • Serve as a lead voice in the triage process, providing the expertise required to classify complex failures specifically as software, firmware, or system-level regressions.
  • Effectively disposition identified issues to the software organization, providing clean tickets (logs, video clips, and analysis) that allow developers to act quickly.
  • Manage and prioritize escalated SW-related investigations, making informed trade-offs to ensure that critical safety or performance risks are addressed first.
  • Lead end-to-end investigations into novel failures using deep-dive log review, telemetry analysis, and video diagnostics to pinpoint bugs at the software/hardware interface or unexpected system behaviors.
  • Develop and execute scripts or other data visualization tools to parse massive log sets and identify intermittent failure trends.
  • Leverage structured methodologies such as 5-Whys or Fishbone to move from a surface-level symptom to a definitive root cause.
  • Author and maintain "Gold Standard" RCA reports and troubleshooting guides that improve the technical autonomy of the broader triage team.
  • Promote a culture of rigorous documentation and data-driven problem-solving.
  • Create reusable diagnostic frameworks that automate the identification of known software issues, increasing the efficiency of the entire R&D loop.

Benefits

  • 401(k) Plan: Includes a 6% company match.
  • Equity: Company stock options.
  • Insurance Coverage: 100% company-paid medical, dental, vision, and short/long-term disability insurance for employees.
  • Benefit Start Date: Eligible for benefits on your first day of employment.
  • Well-Being Support: Employee Assistance Program (EAP).
  • Exempt Employees: Flexible, unlimited PTO and 10 company holidays, including a winter shutdown.
  • Non-Exempt Employees: 10 vacation days, paid sick leave, and 10 company holidays, including a winter shutdown, annually.
  • On-Site Perks: Catered lunches four times a week and a variety of healthy snacks and refreshments at our Salem and Pittsburgh locations.
  • Parental Leave: Generous paid parental leave programs.
  • Work Environment: A culture that supports flexible work arrangements.
  • Growth Opportunities: Professional development and tuition reimbursement programs.
  • Relocation Assistance: Provided for eligible roles.
  • Annual Discretionary Bonus: Provided for eligible roles.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service