About The Position

Peraton is seeking a Site Reliability Engineer (SRE), Supervisor, an experienced technical professional responsible for ensuring the availability, performance, and scalability of complex software systems and infrastructure. This role blends software engineering and systems administration expertise to design, automate, and maintain resilient production environments that support business-critical applications. The Site Reliability Engineer, Supervisor works closely with development, infrastructure, and product teams to build reliability frameworks, improve observability, and drive system health improvements. They lead efforts to automate manual processes, manage incident responses, and implement SLOs to maintain a seamless user experience. This position is ideal for a technically skilled and collaborative professional eager to lead reliability initiatives in complex environments, ensuring architectural, technical excellence and high service availability. This opportunity will support the modernization of a large-scale multi-tenant cloud ecosystem, providing critical enterprise-wide support for more than 40 million users in a complex stakeholder environment. This position requires senior level leadership skills combined with modern cloud and industry leading technical capabilities including product development, strict security compliance, latest technology cloud solutions, reliable application delivery with SaaS and Artificial Intelligence integrations and rapid continuous delivery.

Requirements

  • 6 years of experience, may have lead experience
  • Strong software engineering background with proficiency in languages such as Python, Go, or similar.
  • Deep understanding of distributed systems, cloud infrastructure (AWS, Azure, GCP), container orchestration (Kubernetes), and monitoring tools (Prometheus, Grafana, OpenTelemetry).
  • Experience defining and implementing SLOs, SLIs, and error budgets to measure and maintain service reliability.
  • Excellent problem-solving skills with a proactive approach to incident prevention and resolution.
  • Strong communication skills to effectively collaborate with diverse teams and present reliability insights.
  • 5+ years of experience in site reliability engineering, systems engineering, or related roles with a proven track record of delivering scalable, reliable systems.
  • U.S. Citizenship required
  • Ability to obtain agency clearance (public trust)

Nice To Haves

  • Top Secret clearance preferred

Responsibilities

  • Ensure high availability and responsiveness of services by designing and implementing monitoring, alerting, and automated remediation tools.
  • Analyze system metrics and logs to identify areas for improvement and optimize system performance.
  • Develop and maintain scripts, configuration management, and infrastructure-as-code to automate deployment, scaling, and management of infrastructure.
  • Lead efforts to reduce toil through automation and reliability engineering best practices.
  • Participate in on-call rotations to respond to incidents promptly.
  • Conduct thorough root cause analysis and collaborate with engineering teams to implement preventive measures.
  • Partner with software developers, product managers, and infrastructure teams to embed reliability into the software development lifecycle.
  • Provide guidance on system architecture, capacity planning, and disaster recovery strategies.
  • Mentor junior SREs and engineers on reliability engineering principles, tools, and technical excellence.
  • Lead by example in coding standards, system design, and incident response.
  • Articulate technical issues and reliability impacts to non-technical stakeholders.
  • Drive alignment on priorities and continuous improvements across teams.
  • Lead reliability-related projects and initiatives, managing timelines, resources, and stakeholder communication to deliver impactful results.
  • Promote agile practices to enhance team efficiency.
  • Advocate for continuous learning and process refinement in system reliability.

Benefits

  • Eligible for overtime
  • Eligible for shift differential
  • Eligible for a discretionary bonus in addition to base pay
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service