Operations Platform Engineer

Meta•Redmond, WA

About The Position

Meta is seeking a highly motivated and experienced Operations Platform Engineer to own and evolve the robotics test and data infrastructure that underpins motor validation, experimentation, and downstream analysis. We are building next-generation robotic systems that operate in the real world and generate large volumes of high-fidelity data. As these systems scale, we need robust software infrastructure to ensure that motor control, testing, telemetry, and data pipelines are repeatable, observable, and usable across teams. This is a hands-on, builder role at the intersection of robotics hardware, system software, and data platforms. You’ll help research and engineering teams scale beyond one-off solutions by turning early prototypes into reliable, extensible systems that increase test throughput, improve data quality, and accelerate iteration.

Requirements

7+ years of experience in software engineering, systems engineering, robotics engineering, or related fields
3+ years of experience working close to hardware, including motors, sensors, actuators, embedded systems, and/or embedded Linux environments
Proven ability to design and build test frameworks or infrastructure for physical systems (labs, manufacturing tests, reliability rigs, end-of-line, or similar)
Experience building data ingestion pipelines for high-frequency and/or real-time telemetry (including time sync, buffering, backpressure, and schema evolution)
Systems engineering fundamentals: APIs, data schemas, failure modes, reliability, operational discipline and maintainable interfaces
Ability to operate effectively in ambiguous, fast-moving environments with evolving requirements
Proven communication and collaboration skills across hardware, software, and research disciplines

Nice To Haves

Experience in industrial robotics, automation, embedded Linux, or real-time systems
Experience with robotics data formats, replay systems, simulation pipelines, or log-based debugging at scale
Familiarity with observability tooling and practices (metrics, logging, tracing, dashboards, alerting)
Experience supporting ML or research teams through infrastructure (data capture, labeling support, dataset generation, evaluation pipelines) rather than model development
Prior experience bringing prototype systems into scaled, multi-team usage, including documentation, onboarding, and operational support
Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

Responsibilities

Design and build motor and actuator test infrastructure, including control loops, data capture, and validation tooling
Develop and standardize repeatable test stations that scale across hardware variants, labs, and teams
Define and implement telemetry schemas and data contracts for robotic systems (commands, feedback, environment, failures), ensuring consistency across programs
Build time-synchronized data pipelines to support debugging, replay, offline analysis, and training workflows
Establish observability standards for robotic systems, including metrics, logging, diagnostics, anomaly detection, and dashboards
Partner closely with robotics hardware, firmware, research, safety, and operations teams to ensure systems are reliable, safe, and extensible
Identify and eliminate bottlenecks in data quality, test throughput, and system reliability as usage scales to more teams and more robots
Drive architecture decisions that balance rapid experimentation with long-term maintainability, operational robustness, and scalability
Support fleet and lab validation workflows by enabling consistent test execution across platforms (e.g., Lithium, Ber, Boron, Carbon, Aloha, Mimmic, Trossen)
Contribute to system-level failure understanding by enabling instrumentation and workflows that accelerate failure triage and root cause analysis