Site Reliability Engineer - Remote

ICF•Reston, VA

12d•$108,476 - $184,409•Remote

About The Position

ICF is a mission-driven company filled with people who care deeply about improving the lives of others and making the world a better place. Our core values include Embracing Difference; we seek candidates who are passionate about building a culture that encourages, embraces, and hires dimensions of difference. Our Health Engineering Solutions (HES) team works side by side with customers to articulate a vision for success, and then make it happen. We know success doesn't happen by accident. It takes the right team of people, working together on the right solutions for the customer. We are looking for a seasoned SRE to establish a culture of improvement in observability and reliability. You will work closely with software engineering teams to ensure that applications, databases, pipelines and APIs run reliably. You will be expected to create, set, and exceed service level objectives as key indicators of application health. You will be working on a mission critical software program whose goal is to support the ecosystem of Centers for Medicare & Medicaid Services (CMS). Our core work hours are 10am - 4pm Eastern Time with the option to start earlier or work later depending on your time zone.

Requirements

5+ years experience in a software development environment and a Bachelor’s degree; OR 3+ years experience in a software development environment and a Master’s degree
5+ years supporting a high‑availability production environment (cloud or on‑prem)
3+ years of working in a SRE role in a large scale cloud implementing high availability and scalability
3+ years of experience focused on SRE, DevOps, or Platform Engineering
Must be able to obtain and maintain a public trust clearance
Candidate must reside in the US, be authorized to work in the US, and work must be performed in the US
Must have lived in the US 3 full years out of the last 5 years
Cloud platform experience with AWS
Observability: CloudWatch, New Relic or similar
Infrastructure: Kubernetes, Docker
IaC: Terraform
CI/CD: Git, Jenkins or GitHub Actions
Database: SQL relational database
Docker: Thorough understanding of Docker and Docker Compose. Understand best practices, caching, volume mounts, etc
Highly effective analytical, problem-solving, and decision-making capabilities.
Strong written and verbal communication skills
Ability to clearly articulate and communicate complex technical ideas to non-SRE colleagues.
Ability to understand project requirements and be innovative in finding solutions in highly regulated government environments.
Flexibility and the ability to accept a change in priorities as necessary.
Demonstrated time management skills.
Strong organizational skills with attention to detail.

Nice To Haves

Previous work in a regulated healthcare or federal agency environment
Full stack web development experience
Expert in deployment techniques to minimize down-time like Blue-Green, Canary, A/B testing approaches, and zero downtime deployments
Understanding of security groups and access controls
Experience with Atlassian tooling such as Jira and Confluence

Responsibilities

Define and maintain SLIs, SLOs, and SLAs for the Internet-based Quality Improvement and Evaluation System (iQIES) application.
Performance tuning that will model load scenarios, forecasting capacity, and optimize scaling strategies
Design and optimize the observability stack through New Relic, CloudWatch, and Jenkins CI/CD pipelines
Participate in root cause analysis for operational issues and improve incident response process
Participate in creating, monitoring, and optimizing actionable alerts to respond to issues in a timely manner
Develop tools and scripts
Develop and maintain Jenkins CI/CD pipelines, using declarative Jenkinsfiles and foundational Groovy for pipeline logic and enhancements
Deploy services to Fargate, EKS, Lambda, Airflow, Databases
Manage security groups and access controls. Thoroughly understand fundamentals like security groups, IAM, managing RDS
Apply patch management and hardening practices
Align with DevOps and Technical Leads to ensure overall strategy
Actively participate in releases and product launches with expectation of being online during release windows