SRE Lead

Leidos

1d•Hybrid

About The Position

Leidos was awarded the U.S. Air Force Cloud One Architecture and Common Shared Services contract, and currently has an opening for a SRE Lead. This is an exciting opportunity to use your experience to modernize a leading, global-scale multi-cloud environment in support of a critical mission, supporting USAF system resiliency, security, and cost effectiveness. Location: These positions will be hybrid remote. Candidates will be required to work onsite as needed. Preferred candidates will be located near Hanscom AFB (Boston, MA) or work in Huntsville, AL. Primary Responsibilities Could Include: This individual will be responsible for developing in a scalable cloud-native solutions, and ensuring best practices across architecture, development, deployment, and security. This is a hands-on technical role that requires rolling up your sleeves to architect, code, debug, and mentor. •Manage and mentor the SRE team (pods) and FTEs, providing guidance, setting performance expectations, and fostering professional development. •Work collaboratively with SRE Resource Managers to staff and maintain engineering resources for your SRE vertical teams' reliability and scalability goals. • Manage the SRE team’s resources, including tool selection, and infrastructure investments to meet reliability and scalability needs. •Meet regularly with your team members, participate in performance reviews and interviews, and development planning. •Oversee the reliability, availability, and performance of critical systems by leading the SRE teams within the data center vertical in implementing monitoring, incident response, and performance optimization strategies. •Ensure the team adheres to best practices for system reliability, automation, and operational efficiency. •Drive continuous improvement initiatives by analyzing performance metrics (e.g., SLOs, MTTR, MTBF) and identifying areas for enhancement. •Collaborate with operations, quality, cybersecurity and other SRE engineering teams to define and enforce Service Level Objectives (SLOs) and manage error budgets. •Act as a liaison between the SRE team and other departments to prioritize reliability and operational needs in the product development process. •Collaborate with senior leadership to define the SRE strategy, set long-term reliability goals, and ensure alignment with business objectives •Lead efforts to reduce operational toil through automation. Work with the team to build or enhance automation tools that manage infrastructure, monitor systems, and respond to incidents. •Oversee the development and adoption of Infrastructure as Code (IaC) tools, CI/CD pipelines, and other automation processes. •Ensure that SRE practices align with organizational security policies and compliance requirements. •Collaborate with security teams to integrate reliability-focused security practices into the design and operation of systems. •Ensure systems meet or exceed agreed-upon service levels by proactively addressing potential issues and working with stakeholders to align on reliability expectations. •Work within a SRE team, collaborating with other Developers, Security, and Operations, to continuously deliver products and increase the value stream for the organization and customers. •Embrace and champion Agile development processes and adoption to modern Site Reliability Engineering workflows and practices while providing technical guidance to team members and coworkers on best practices •Stay up to date on the latest Site Reliability Engineering practices and technologies. •Strive to provide internal and external customers with excellent customer service and world-class service. •Resolve most conflicts between timeline, budget, and scope independently but intuitively raise sophisticated or consequential issues to senior management

Requirements

Bachelors and twelve (12) years or more of experience; Masters and ten (10) years or more of experience. Additional experience may be accepted in lieu of degree.
Secret clearance required
US citizenship required
Certifications: CompTIA Security+ or equivalent (IAT-2)
Familiarity with DevSecOps principles and practices.
Familiarity with Agile methodologies such as Scrum and/or Kanban.
Experience creating JIRA and/or Azure DevOps workflows, projects, custom configurations.
Solid experience with integrating/maintaining with various 3rd party CI/CD tools like Jenkins and Gitlab.
Experience with automated provisioning and configuration tools like Terraform, Cloud Formation, Ansible, or similar technologies.
Working knowledge of the Risk Management Framework (RMF), DISA STIGs.
Strong knowledge of security principles, including threat modeling, vulnerability assessments, and encryption techniques.
Ability to lead teams through strategic initiatives such as reliability maturity assessments, process automation, and tooling selection.
Solid understanding of SRE principles, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgeting.
Experience with commercial cloud infrastructure deployment environments such as AWS and Azure.
Strong knowledge of automation tools, CI/CD pipelines, and Infrastructure as Code (IaC).

Nice To Haves

Experience with USAF Cloud One or Platform 1
Experience with Zero Trust Architecture
Cloud certifications in AWS, Azure, Google, or Oracle clouds
Example certifications include: Industry Professional certification Certified Kubernetes Application Developer (CKAD), Kubernetes and Cloud Native Associate (KCNA), AWS Certified DevOps Engineer, Certified AWS SysAdmin, AWS Certified Advanced Networking, AWS Certified Security, Azure Developer Associate, Azure Solutions Architect

Responsibilities

Manage and mentor the SRE team (pods) and FTEs, providing guidance, setting performance expectations, and fostering professional development.
Work collaboratively with SRE Resource Managers to staff and maintain engineering resources for your SRE vertical teams' reliability and scalability goals.
Manage the SRE team’s resources, including tool selection, and infrastructure investments to meet reliability and scalability needs.
Meet regularly with your team members, participate in performance reviews and interviews, and development planning.
Oversee the reliability, availability, and performance of critical systems by leading the SRE teams within the data center vertical in implementing monitoring, incident response, and performance optimization strategies.
Ensure the team adheres to best practices for system reliability, automation, and operational efficiency.
Drive continuous improvement initiatives by analyzing performance metrics (e.g., SLOs, MTTR, MTBF) and identifying areas for enhancement.
Collaborate with operations, quality, cybersecurity and other SRE engineering teams to define and enforce Service Level Objectives (SLOs) and manage error budgets.
Act as a liaison between the SRE team and other departments to prioritize reliability and operational needs in the product development process.
Collaborate with senior leadership to define the SRE strategy, set long-term reliability goals, and ensure alignment with business objectives
Lead efforts to reduce operational toil through automation. Work with the team to build or enhance automation tools that manage infrastructure, monitor systems, and respond to incidents.
Oversee the development and adoption of Infrastructure as Code (IaC) tools, CI/CD pipelines, and other automation processes.
Ensure that SRE practices align with organizational security policies and compliance requirements.
Collaborate with security teams to integrate reliability-focused security practices into the design and operation of systems.
Ensure systems meet or exceed agreed-upon service levels by proactively addressing potential issues and working with stakeholders to align on reliability expectations.
Work within a SRE team, collaborating with other Developers, Security, and Operations, to continuously deliver products and increase the value stream for the organization and customers.
Embrace and champion Agile development processes and adoption to modern Site Reliability Engineering workflows and practices while providing technical guidance to team members and coworkers on best practices
Stay up to date on the latest Site Reliability Engineering practices and technologies.
Strive to provide internal and external customers with excellent customer service and world-class service.
Resolve most conflicts between timeline, budget, and scope independently but intuitively raise sophisticated or consequential issues to senior management

Benefits

Pay and benefits are fundamental to any career decision. That's why we craft compensation packages that reflect the importance of the work we do for our customers.
Employment benefits include competitive compensation, Health and Wellness programs, Income Protection, Paid Leave and Retirement.
More details are available here.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume