About The Position

Peraton is seeking a Site Reliability Engineer (SRE), Manager- a highly experienced professional responsible for ensuring the availability, reliability, and performance of complex systems in a multi-vendor environment. This role combines deep technical expertise in infrastructure, automation, and system architecture with leadership and collaboration skills to drive reliability frameworks, proactive monitoring, and incident response across diverse platforms and teams. The Site Reliability Engineer, Manager operates with significant autonomy, architecting solutions that enhance system observability, scalability and fault tolerance. They lead reliability initiatives, mentor engineering teams, and collaborate with multiple vendors and internal stakeholders to align reliability strategies with business objectives and customer needs. This role is ideal for a highly skilled engineer who excels in technical leadership, complex system architecture, and multi-stakeholder environments. Principal Site Reliability Engineers are key to building resilient systems that scale efficiently while minimizing downtime and risk. This opportunity will support the modernization of a large-scale multi-tenant cloud ecosystem, providing critical enterprise-wide support for more than 40 million users in a complex stakeholder environment. This position requires senior level leadership skills combined with modern cloud and industry leading technical capabilities including product development, strict security compliance, latest technology cloud solutions, reliable application delivery with SaaS and Artificial Intelligence integrations and rapid continuous delivery.

Requirements

  • Extensive experience (10+ years) in site reliability engineering or related roles, preferably in multi-vendor and complex environments.
  • Deep knowledge of cloud-native infrastructure, container orchestration (e.g., Kubernetes), and automation tools such as Terraform, Ansible, or Chef.
  • Proficiency in observability technologies, such as Prometheus, Grafana, OpenTelemetry, log aggregation systems, etc.
  • Strong programming and scripting skills for automation and tooling (Python, Go, or similar).
  • Expertise in defining and implementing SLIs, SLOs, and error budgets.
  • Excellent communication skills for collaboration with diverse teams and external vendors.
  • Proven ability to lead large-scale reliability initiatives and mentor engineering teams.
  • Strategic thinker with a focus on aligning reliability engineering with business priorities and customer experience.
  • U.S. Citizenship required
  • Ability to obtain agency clearance (public trust)

Nice To Haves

  • Top Secret clearance preferred

Responsibilities

  • Design, implement, and oversee reliability frameworks, including SLOs, error budgets, and automated incident response systems.
  • Develop and maintain CI/CD pipelines to ensure seamless deployment and procedural efficiency.
  • Lead the creation and enhancement of observability platforms using metrics, logging, and tracing tools.
  • Utilize modern technologies like OpenTelemetry, AI/ML for anomaly detection, and streaming data platforms to proactively detect and resolve issues.
  • Coordinate with external vendors and internal teams to integrate and manage diverse systems and tools.
  • Ensure consistent reliability standards and practices are maintained across different technology stacks and service providers.
  • Drive incident response strategy by leading root cause analysis, post-mortem reviews, and continuous improvement efforts.
  • Identify potential risks and implement mitigation strategies to prevent service disruptions.
  • Mentor site reliability and engineering teams, fostering a culture of reliability, automation, and continuous learning.
  • Advocate for best practices in system design and reliability engineering.
  • Work closely with product development, DevOps, and security teams to integrate reliability into the software development lifecycle.
  • Influence platform strategy and roadmap based on reliability insights.
  • Collaborate with senior stakeholders and vendors on long-term reliability goals.
  • Prepare executive-level presentations that translate technical challenges into business impact.
  • Lead and refine agile workflows to enhance team productivity and reliability outcomes.
  • Champion DevOps methodologies to align development and cloud services efforts.
  • Support /work across multiple enterprise-wide efforts within Peraton.

Benefits

  • Overtime eligibility
  • Shift differential eligibility
  • Discretionary bonus
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service