Cloud Operations Lead and SRE Manager

Empower AI Inc.•Camp Springs, MD

46d•Onsite

About The Position

Empower AI is AI for government. Empower AI gives federal agency leaders the tools to elevate the potential of their workforce with a direct path for meaningful transformation. Headquartered in Reston, Va., Empower AI leverages three decades of experience solving complex challenges in Health, Defense, and Civilian missions. Our proven Empower AI Platform® provides a practical, sustainable path for clients to achieve transformation that is true to who they are, what they do, how they work, with the resources they have. The result is a government workforce that is exponentially more creative and productive. For more information, visit www.Empower.ai. Empower AI is proud to be recognized as a 2024 Military Friendly Employer by Viqtory, the publisher of G.I. Jobs. This designation reflects the company’s commitment to hiring and supporting active-duty and veteran employees. The Cloud Operations Lead / SRE Manager (Cloud/SRE Mgr) provides enterprise-level operational management of cloud operations and Site Reliability Engineering (SRE) leadership for the Department of Homeland Security (DHS), U.S. Citizenship and Immigration Services (USCIS) information technology (IT) infrastructure. USCIS has over 27,000 Government employees and contractors working at over 250 offices worldwide. The USCIS Enterprise Infrastructure Division (EID) of the Office of Information Technology (OIT) provides IT infrastructure engineering, design, testing, implementation and operational support services for all USCIS enterprise components, to include networks, server rooms, data storage, telecommunications, video conferencing services and infrastructure security. The Cloud/SRE Mgr directly supports EID to coordinate, direct, manage, and oversee the design, development, integration, standards, operation and maintenance of cloud operations and SRE of the enterprise IT infrastructure that supports USICS operations. The Cloud/SRE Mgr shall oversee the Cloud Operation Team (est. 5 technicians) responsible for executing cloud operations of the USCIS IT infrastructure. This position is responsible for the delivery of the reliability, availability, and operational excellence of USCIS cloud platforms This role combines hands-on technical leadership with people management (est 5 technicians). The Cloud/SRE Mgr will also apply Site Reliability Engineering (SRE) principles to ensure highly available, secure, and compliant production systems. The ideal candidate brings a strong background in cloud infrastructure, automation, and DevOps, paired with proven experience leading operational teams, managing incidents, and driving reliability at scale in regulated environments.

Requirements

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
10+ years of experience in cloud engineering, SRE, or cloud operations roles, with deep hands-on expertise in AWS.
Proven experience operating production, mission-critical systems at scale.
Strong background in automation and IaC (Terraform and/or CloudFormation).
Experience with CI/CD pipelines (e.g., AWS CodePipeline, GitLab, Jenkins).
Deep knowledge of cloud security, IAM, encryption, monitoring, and compliance frameworks.
Experience designing high availability, disaster recovery, and fault tolerance.
Excellent communication skills with the ability to influence technical and non-technical stakeholders.

Nice To Haves

Prior experience managing SRE or Cloud Operations teams.
Experience supporting regulated or government environments (FedRAMP, SOC 2, ISO 27001).
Familiarity with SRE practices such as error budgets, capacity planning, and toil reduction.
Multi-cloud awareness (Azure, GCP).
Strong scripting or programming skills (e.g., Python, Bash).

Responsibilities

Own the reliability, availability, and performance of production cloud platforms and services.
Define, monitor, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical systems.
Lead incident response, including coordination during outages, root cause analysis, and blameless postmortems.
Establish and manage on-call rotations, escalation paths, and operational readiness standards.
Drive continuous reduction of operational toil through automation and process improvement.
Design and operate secure, scalable, and highly available AWS infrastructure, including multi-AZ and multi-region architectures.
Ensure platforms are resilient, fault-tolerant, and aligned with disaster recovery and business continuity requirements.
Partner with application teams to ensure production readiness and reliability by design.
Implement and enforce cloud security best practices, including IAM, encryption, logging, and audit controls.
Ensure compliance with government and regulatory frameworks such as FedRAMP.
Collaborate closely with security and compliance stakeholders to meet accreditation and audit requirements.
Lead development of infrastructure-as-code (IaC) using Terraform and/or AWS CloudFormation.
Build and maintain CI/CD pipelines supporting reliable, repeatable deployments.
Design and operate monitoring, alerting, logging, and observability solutions to ensure actionable insights and reduce alert fatigue.
Lead, mentor, and develop a team of Cloud / SRE engineers.
Support hiring, onboarding, performance feedback, and career growth.
Set technical direction, operational priorities, and reliability goals for the team.
Foster a culture of ownership, learning, and continuous improvement.
Partner with development, security, compliance, and business stakeholders to align reliability goals with delivery timelines.
Communicate reliability risks, incident outcomes, and improvement plans to senior leadership.
Produce and maintain clear operational documentation, runbooks, and architectural standards.