Staff Software Engineer, Site Reliability (SRE)

Character.AISan Francisco, CA
8d

About The Position

As one of the founding members of our Site Reliability Engineering function here at Character, you’ll have the opportunity to support our infrastructure with thousands of nodes, terabytes of data and millions of daily active users on our site. You’ll be responsible for ensuring our product's reliability, scalability, and performance as we aggressively grow our user base, with a goal of growing to 3 billion users. Work closely with our development team to design and implement processes and systems that ensure the stability and availability of our service.

Requirements

  • 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale
  • Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang
  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base.
  • Experience working with multiple cloud computing platforms such as GCP is also a must
  • Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems
  • Experience with incident management and event postmortems

Nice To Haves

  • Familiarity with GPU clusters and/or HPC environments is preferred
  • Experience with monitoring and logging tools such as Prometheus and Grafana
  • Hands-on experience scaling a consumer product from early days into hypergrowth

Responsibilities

  • Maintain production services and keep them operational.
  • Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.
  • Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.
  • Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.
  • Establish and support SLAs and SLOs for our site
  • Provide system monitoring and incident alerts
  • Participate in on-call rotations to provide support for critical incidents and outages.
  • Develop plans for site reliability and disaster recovery
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service