Staff DevOps Engineer

Atomic MachinesSanta Clara, CA
3h

About The Position

As a Staff DevOps Engineer, you will design and own the infrastructure that delivers and operates software on autonomous manufacturing systems. This role spans cloud services, CI/CD, embedded Linux platforms, fleet management, and production observability—ensuring that code moves safely from commit to factory floor and behaves predictably once deployed. You'll define deployment architecture, release and rollback strategies, and reliability standards for distributed systems that directly control industrial equipment. That includes building tooling for reproducible builds, staged rollouts, remote diagnostics, and rapid failure isolation across nodes running in live production environments. Infrastructure decisions here affect machine behavior, safety boundaries, and production throughput. We're looking for someone who thinks in terms of system integrity and operability—not just pipelines—and who is motivated by making complex, real-world systems reliable at scale.

Requirements

  • 7+ years of industry experience building and operating complex, distributed systems, with demonstrated ownership of architecture and reliability outcomes.
  • Strong systems design experience, including designing software that runs across cloud services, edge nodes, and hardware-constrained environments.
  • Deep experience building and operating CI/CD systems and infrastructure-as-code platforms (e.g., Terraform, CloudFormation, Kubernetes), with a track record of making deployments safer and more predictable over time.
  • Practical experience with containerization and orchestration (Docker, Kubernetes, ECS/EKS or similar), and the operational realities that come with running them in production.
  • Comfort working across the stack — from Linux internals and networking to build systems (C/C++/CMake, Python) and application-level behavior — with the ability to debug issues that cross abstraction boundaries.
  • A strong observability mindset, including fluency with metrics, logging, tracing, and tooling such as Prometheus, Grafana, and OpenTelemetry.
  • Experience operating in hybrid environments that blend cloud infrastructure with on-prem or edge systems.
  • Solid computer systems fundamentals: operating systems, networking, concurrency, security principles, and an understanding of how software interacts with hardware and constrained environments.
  • First-principles thinking with sound instincts for strategic tradeoffs (ex, latency, consistency, resource limits, failure domains) and designing solutions grounded in those constraints.
  • Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field (or equivalent experience).

Responsibilities

  • Design and operate the infrastructure that runs our manufacturing platform, using infrastructure-as-code to make environments reproducible, auditable, and safe to change.
  • Build and evolve CI/CD systems that move software from commit to fleet deployment, including artifact management, staged rollouts, version coordination across services and edge nodes, and fast, reliable rollback paths.
  • Enable hardware-in-the-loop and other device-connected testing environments, working directly with hardware and firmware teams to ensure software can be validated against real equipment before it reaches production.
  • Define and implement secure-by-default infrastructure patterns, including access controls, secrets management, image hardening, and regular reviews of deployment and runtime configurations.
  • Partner with application teams to turn ambiguous requirements and failure modes into concrete deployment architectures, reliability safeguards, and operable production systems.
  • Establish observability standards across services and edge systems—metrics, logs, traces, and alerting—so issues are detectable, diagnosable, and learnable rather than mysterious.
  • Lead by example: set technical direction through hands-on architecture work, thoughtful tradeoff decisions, and mentorship that raises the bar for how we design, ship, and operate systems.
  • Design and maintain internal platform tooling that enables engineers to deploy, test, and operate their systems independently—codifying best practices into paved roads that improve velocity, safety, and consistency across the organization.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service