Production Systems Engineer, Automation

MetaMenlo Park, CA
$144,000 - $204,000

About The Position

Meta is seeking a Production Systems Engineer, Tooling to join our Production Systems Engineering organization, where you will help drive the reliability, efficiency, and scalability of Meta's large-scale hardware infrastructure through improvements by test automation. You will design and build the systems tooling, test automation, and frameworks that keep Meta's global production fleet — spanning compute, storage, networking, and custom silicon — operating at peak performance. Working at the intersection of hardware and software, you will partner with data center operations, hardware engineering, platform teams, and ODM/vendor partners to drive systemic improvements across the full infrastructure stack.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 3+ years of experience in production systems engineering or infrastructure software engineering, including development in C, C++, or Python for Linux-based environments
  • 3+ years of experience with large-scale hardware infrastructure systems, including fleet automation, hardware lifecycle management, or data center operations software
  • 3+ years of experience in designing and operating distributed systems software at scale, including monitoring, alerting, and automated remediation pipelines
  • 3+ years of experience in communicating system designs and technical decisions through written documentation and cross-functional stakeholder engagement
  • Demonstrated troubleshooting skills across hardware products and automation software

Nice To Haves

  • Master's Degree in Computer Science, Computer Engineering, or similar field
  • 6+ years of experience across a variety of infrastructure components such as network, and compute in a datacenter or large-scale production environment
  • 3+ years of experience in building or operating CI/CD pipelines and test automation frameworks for infrastructure software
  • Familiarity with custom silicon or accelerator platform integration, including firmware and platform management interfaces
  • Expertise guiding cross-functional teams or ODM/vendor partners through the setup, integration, and execution of automation and validation frameworks at scale

Responsibilities

  • Design, build, and scale test orchestration and validation tooling, CI/CD pipelines, and automation frameworks that qualify large-scale AI hardware platforms at cluster scale — spanning provisioning, monitoring, and lifecycle management of compute, storage, and networking infrastructure
  • Develop tooling for hardware lifecycle management, fleet health observability, and automated remediation of production system failures across Meta's data center fleets
  • Identify and resolve systemic reliability and performance issues by analyzing hardware telemetry, failure patterns, and system-level diagnostics at scale
  • Collaborate with hardware engineering teams to define software interfaces, firmware integration requirements, and bring-up workflows for new server and accelerator platforms
  • Lead cross-functional efforts to evaluate, qualify, and integrate new hardware technologies into the production environment, including validation and qualification workflows
  • Develop scalable infrastructure automation that reduces operational toil and accelerates hardware deployment and remediation across the global fleet
  • Mentor other engineers on systems software design, debugging methodologies, and production infrastructure best practices
  • Communicate technical designs and architectural decisions through written documentation and cross-functional stakeholder alignment

Benefits

  • bonus
  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service