Manager, Engineering Ops (Devops)

Talent Systems
5h$170,000 - $180,000Hybrid

About The Position

We are seeking an experienced Head of Engineering, Engineering Operations to lead our engineering operations which includes areas such DevOps, Site Reliability Engineering (SRE), CI/CD, Release management etc for our cloud-based systems and applications. This role is pivotal in ensuring the reliability, security, scalability, and availability of our systems while driving innovation in automation, CI/CD pipelines, and operational efficiency. You will be responsible for crisis management, improving system performance, cost and fostering a culture of operational excellence.

Requirements

  • 10+ years of experience in software engineering, with 5+ years in leadership roles
  • Proven track record of improving system reliability, availability, and performance for cloud-based applications.
  • Extensive experience with CI/CD pipelines and automation tools.
  • Demonstrated expertise in crisis management and incident response in high-pressure environments.
  • Deep knowledge of cloud platforms (such as AWS) and container orchestration tools (Kubernetes, Docker).
  • Strong proficiency in monitoring and observability tools like Grafana.
  • Excellent problem-solving and decision-making skills under pressure.
  • Exceptional communication and collaboration skills, with the ability to influence stakeholders across engineering and business teams.
  • Proven ability to lead and grow high-performing teams in a fast-paced environment.
  • A strong focus on fostering a culture of accountability, learning, and operational excellence.
  • Influence partner engineering teams like platform and product engineering.

Responsibilities

  • Lead and mentor teams in DevOps, SRE, and Engineering Operations, fostering a culture of collaboration, ownership, and innovation.
  • Develop and execute the strategic roadmap for engineering operations, aligning with business goals and product requirements.
  • Advocate for and implement industry best practices in system reliability, DevOps, and automation.
  • Drive initiatives to improve the reliability, availability, and performance of cloud-based applications and infrastructure.
  • Establish performance measurements for various system health metrics.
  • Ensure robust incident management and crisis response processes to minimize downtime and customer impact.
  • Oversee the design, implementation, and optimization of CI/CD pipelines to enable seamless and automated deployment processes.
  • Leverage automation tools and practices to reduce manual interventions and improve operational efficiency.
  • Collaborate with product and engineering teams to enable rapid and reliable feature delivery.
  • Implement and maintain advanced monitoring, logging, and alerting systems to gain deep insights into system health and performance.
  • Use observability tools (e.g., Grafana) to proactively identify and resolve issues before they impact customers.
  • Lead crisis management efforts during high-severity incidents, ensuring quick resolution and effective communication with stakeholders.
  • Conduct root cause analyses and drive post-mortem reviews to identify and address operational gaps.
  • Build, grow, and retain a high-performing engineering operations team with expertise in DevOps and SRE practices across multiple geolocations.
  • Foster close collaboration with development, data, and product teams to align engineering operations with overall business objectives.
  • Promote a blameless post-mortem culture to encourage continuous learning and improvement.
  • Optimize cloud infrastructure costs while maintaining system reliability and scalability.
  • Implement robust security practices in operations to ensure compliance with industry standards and regulations.

Benefits

  • bonus
  • benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service