About The Position

We need a Director of Site Reliability Architect (SRA) to be responsible for designing and implementing scalable, reliable, and efficient systems that support the organization's software applications and services. As a key technical leader, you will work closely with development, operations, and product teams to ensure that systems are designed with reliability, performance, and scalability in mind. You will also play a crucial role in establishing best practices for site reliability engineering (SRE) and fostering a culture of operational excellence. This is a full-time role and will work a hybrid schedule based in Lehi, UT.

Requirements

  • Bachelor's or Master’s degree in Computer Science, Engineering, or a related field.
  • 8+ years of experience in software engineering, systems engineering, or site reliability engineering.
  • Strong understanding of cloud computing platforms (e.g., AWS, Azure, Google Cloud) and container orchestration technologies (e.g., Kubernetes, Docker).
  • Experience with configuration management and automation tools (e.g., Terraform, Ansible, Puppet).
  • Proficient in programming and scripting languages (e.g., Python, Go, Bash) for automation and tool development.
  • Extensive knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) and practices.
  • Solid understanding of networking concepts, distributed systems, and microservices architecture.
  • Excellent problem-solving skills and the ability to work effectively under pressure.

Responsibilities

  • Design and implement robust, scalable, and high-availability systems that meet business and technical requirements.
  • Collaborate with software engineering teams to integrate reliability into the software development lifecycle, ensuring that applications are built with operational excellence in mind.
  • Develop and maintain service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs) to measure system performance and reliability.
  • Lead incident response efforts, including post-mortem analysis and root cause investigations, to improve system reliability and prevent future incidents.
  • Automate operational processes to improve efficiency and reduce manual intervention, leveraging tools and technologies such as Infrastructure as Code (IaC).
  • Monitor system performance and reliability using appropriate metrics and monitoring tools, proactively identifying and addressing potential issues.
  • Advocate for and implement best practices in site reliability engineering, including capacity planning, disaster recovery, and incident management.
  • Train and mentor engineering and operations teams on SRE principles and practices, fostering a culture of continuous improvement.

Benefits

  • Unlimited PTO
  • Paid Holidays
  • Onsite Fitness Center
  • Company Paid Life Insurance
  • Casual Dress Code
  • Competitive Pay
  • Health, Vision, and Dental Insurance
  • 401(k) match. Pattern matches 100% of the first 3% in eligible compensation deferred and 50% of the next 2% in eligible compensation deferred.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service