Site Reliability Engineer - Remote

ICFReston, VA
2d$108,476 - $184,409Remote

About The Position

ICF is a mission-driven company filled with people who care deeply about improving the lives of others and making the world a better place. Our core values include Embracing Difference; we seek candidates who are passionate about building a culture that encourages, embraces, and hires dimensions of difference. Our Health Engineering Solutions (HES) team works side by side with customers to articulate a vision for success, and then make it happen. We know success doesn't happen by accident. It takes the right team of people, working together on the right solutions for the customer. We are looking for a seasoned SRE to establish a culture of improvement in observability and reliability. You will work closely with software engineering teams to ensure that applications, databases, pipelines and APIs run reliably. You will be expected to create, set, and exceed service level objectives as key indicators of application health. You will be working on a mission critical software program whose goal is to support the ecosystem of Centers for Medicare & Medicaid Services (CMS). Our core work hours are 10am - 4pm Eastern Time with the option to start earlier or work later depending on your time zone.

Requirements

  • 5+ years experience in a software development environment and a Bachelor’s degree; OR 3+ years experience in a software development environment and a Master’s degree
  • 5+ years supporting a high‑availability production environment (cloud or on‑prem)
  • 3+ years of working in a SRE role in a large scale cloud implementing high availability and scalability
  • 3+ years of experience focused on SRE, DevOps, or Platform Engineering
  • Must be able to obtain and maintain a public trust clearance
  • Candidate must reside in the US, be authorized to work in the US, and work must be performed in the US
  • Must have lived in the US 3 full years out of the last 5 years
  • Cloud platform experience with AWS
  • Observability: CloudWatch, New Relic or similar
  • Infrastructure: Kubernetes, Docker
  • IaC: Terraform
  • CI/CD: Git, Jenkins or GitHub Actions
  • Database: SQL relational database
  • Docker: Thorough understanding of Docker and Docker Compose. Understand best practices, caching, volume mounts, etc
  • Highly effective analytical, problem-solving, and decision-making capabilities.
  • Strong written and verbal communication skills
  • Ability to clearly articulate and communicate complex technical ideas to non-SRE colleagues.
  • Ability to understand project requirements and be innovative in finding solutions in highly regulated government environments.
  • Flexibility and the ability to accept a change in priorities as necessary.
  • Demonstrated time management skills.
  • Strong organizational skills with attention to detail.

Nice To Haves

  • Previous work in a regulated healthcare or federal agency environment
  • Full stack web development experience
  • Expert in deployment techniques to minimize down-time like Blue-Green, Canary, A/B testing approaches, and zero downtime deployments
  • Understanding of security groups and access controls
  • Experience with Atlassian tooling such as Jira and Confluence

Responsibilities

  • Define and maintain SLIs, SLOs, and SLAs for the Internet-based Quality Improvement and Evaluation System (iQIES) application.
  • Performance tuning that will model load scenarios, forecasting capacity, and optimize scaling strategies
  • Design and optimize the observability stack through New Relic, CloudWatch, and Jenkins CI/CD pipelines
  • Participate in root cause analysis for operational issues and improve incident response process
  • Participate in creating, monitoring, and optimizing actionable alerts to respond to issues in a timely manner
  • Develop tools and scripts
  • Develop and maintain Jenkins CI/CD pipelines, using declarative Jenkinsfiles and foundational Groovy for pipeline logic and enhancements
  • Deploy services to Fargate, EKS, Lambda, Airflow, Databases
  • Manage security groups and access controls. Thoroughly understand fundamentals like security groups, IAM, managing RDS
  • Apply patch management and hardening practices
  • Align with DevOps and Technical Leads to ensure overall strategy
  • Actively participate in releases and product launches with expectation of being online during release windows
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service