Site Reliability Engineer

Karsun SolutionsHerndon, VA
109d$150,000 - $170,000

About The Position

Join Karsun Solutions to grow your career with the company transforming possible for the US Government. At Karsun, collaboration drives our community. We're committed to building an environment where team members from diverse backgrounds can innovate, learn and grow with us. Here at Karsun, the only limit to your potential is the limit of your curiosity. As a Site Reliability Engineer, you will help build out and run production environments, automate operations and maintain and support infrastructure. Drive and establish Service level objectives (SLOs) and metrics to meet reliability expectations of multiple applications.

Requirements

  • Bachelor's degree in computer science, Engineering, or a related field and 8-10 years of relevant experience
  • 5+ years of experience supporting operations and maintenance for cloud-native applications in production that are fault-tolerant, self-healing, scalable and high available
  • Deep understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Kubernetes)
  • Experience with monitoring, logging, and observability tools like DataDog, AWS Cloudwatch, ELK, Prometheus, Splunk etc.
  • Knowledge of infrastructure as code tools (e.g., Terraform, Ansible, ArgoCD) and CI/CD pipelines
  • Experience deploying enterprise software within AWS Services such as EKS, RDS, EC2, Elastic Load Balancers, Lambda, DynamoDB, multi regions, and API Gateway
  • Strong problem-solving and analytical skills, with a keen attention to detail
  • Ability to obtain and maintain a Public Trust clearance

Nice To Haves

  • Understanding of modern architecture, e.g. micro-services, EDA, etc., and cautious against overcomplexity and overengineering
  • Experience with monitoring and metrics platforms, e.g. New Relic, Prometheus, InfluxDB, Grafana, Splunk, etc.
  • Experience designing and operating distributed systems and cloud infrastructure at scale
  • Candidates in the eastern, central or mountain time zones
  • Experience supporting US federal government contracts

Responsibilities

  • Deploy and manage applications into Kubernetes container platforms such as AWS EKS, or OpenShift
  • Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues.
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance.
  • Implement and support integrated CI/CD pipelines for on-premises and/or cloud assets using tools such as Jenkins, GitHub/Bitbucket, Nexus/Artifactory
  • Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents
  • Implement, deploy and maintain infrastructure as code (IaC) for provisioning infrastructure using AWS CloudFormation or Terraform
  • Maintain, monitor, and improve application configurations using tools such as Ansible, Packer, Puppet, or Chef
  • Design, build, and maintain automated monitoring and notification services to support fault tolerant and highly available systems and metrics using tools such as AWS CloudWatch, EFK, and Prometheus
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service