Staff Site Reliability Engineer

FigureSunnyvale, CA
52d$175,000 - $250,000

About The Position

Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA. We are looking for a Site Reliability Engineer to own our internal systems infrastructure. This role is responsible for setting up and managing cloud and on-prem infrastructure to deliver highly available, reliable, and automated systems.

Requirements

  • Strong experience with Linux/Unix systems administration
  • Proficiency in programming/scripting
  • Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
  • Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems.
  • Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
  • Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
  • Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
  • Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets.
  • Ability to work in cross-functional teams with developers, infra, and product teams
  • Excellent verbal and written communication skills

Responsibilities

  • Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more.
  • Migrate SaaS to self-hosted solutions to enhance security and reliability.
  • Implement monitoring and alerting systems, and define incident response plans and runbooks.
  • Reduce human workload through automation to automate deployment and scaling.
  • Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives.
  • Use a data driven approach to demonstrate service robustness and track optimization work.
  • Partner with the security team to ensure that security remediations and updates are applied in a timely manner.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service