Senior/Staff Software Engineer (SRE)

Chainguard
75d$144,000 - $200,000

About The Position

As a Site Reliability Engineer, help us design, automate, and scale secure‑by‑default cloud infrastructure so uptime stays exciting and on‑call stays uneventful. We are seeking a talented and experienced SRE to join our team to develop and maintain cloud-based infrastructure. You will be responsible for designing, building, and scaling robust infrastructure, including observability, metrics and alerting. You will also ensure our work is sustainable by promoting best practices around deployment, incident management and disaster recovery.

Requirements

  • Comfortable working and thriving within a Linux ecosystem
  • Experience supporting high availability distributed production systems
  • Experience with database administration and support
  • Treated infrastructure as code utilizing tools like Terraform, Ansible, Chef, Puppet, and SaltStack
  • Familiarity working in a public cloud platform (GCP, AWS, Azure)
  • Software development skills in at least one of the following languages: Python, Go, Javascript, and/or Ruby
  • B.S. or M.S. in Computer Science or related field or equivalent in related work experience
  • Strong English language skills and ability to work independently, as an effective part of a globally distributed team
  • Ability to learn about the supply chain security space

Nice To Haves

  • Experience scaling services in a performant and cost-effective manner
  • Implemented incident management and disaster recovery playbooks
  • Knowledge of microservices architecture and containerization (Docker/OCI, Kubernetes)
  • Familiarity across multiple public cloud platforms (GCP, AWS, Azure)
  • Operated a multi-tenant capable software defined network (SDN)
  • Linux systems troubleshooting and debugging skills
  • Solid understanding of data structures, algorithms, API design, and software design patterns
  • Interest in open source software projects and communities

Responsibilities

  • Practice continuous improvement, by iterating on how services are deployed, configured, monitored, and maintained on our platform
  • Lead incident response, diagnosis, and follow-up on system outages and alerts
  • Help develop an operational focus and act as thought leaders for the rest of engineering
  • Maintain and optimize infrastructure for performance, scalability, and cost
  • Analyze system metrics and identify opportunities for improvement in reliability and efficiency

Benefits

  • Flexible & Remote-First Culture: Work remotely with team meetup opportunities, bi-annual destination summits, and a monthly stipend for coworking spaces, phone and internet costs
  • Our Approach to Equity: Receive stock options upon hire and promotion. Plus, you can participate in secondary offerings and have 10 years to exercise your options
  • 100% Covered Health Insurance: We cover 100% of your health, vision and dental insurance premiums for you and your dependents
  • ∞ Flexible Time Off: Take the time you need to recharge and reset
  • 18 Weeks Paid Parental Leave: 18 weeks for birthing parents and 12 weeks for non-birthing parents
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service