Lead Site Reliability Engineer (SRE)

Capital OneMcLean, VA
1d

About The Position

Lead Site Reliability Engineer (SRE) Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of makers, breakers, doers and disruptors, who love to solve real problems and meet real customer needs. As a Site Reliability Engineer (SRE), you’ll tap into your passion for proactively finding and fixing inefficiencies to solve our reliability and performance issues. You’ll focus on availability, latency, performance, efficiency, change and problem management, monitoring, emergency response, and capacity planning of our services. Your projects will deliver enhanced infrastructure, development, and deployment automation at Capital One. General Responsibilities: Guide site reliability automation to help eliminate manual toil and create a self-healing capability. Fosters a culture of excellence and continuous learning within the chapter. Establishes and tracks appropriate OKRs to ensure outcomes are met. Creates solutions addressing high impact technology and business priorities. Competent in multiple contexts, such as programming languages, security, automation, testing, infrastructure, and performance and is the go-to person for many people (inside and outside of their team) Proactively identifies and mitigates issues based on intuition and experience in multiple domains

Requirements

  • High School Diploma, GED, or equivalent certification
  • At least 6 years of experience using build and deployment tools (Jenkins, GitHub, or Artifactory)
  • At least 4 years of experience with AWS
  • At least 2 years of team leadership experience

Nice To Haves

  • 5+ years of experience with AWS
  • 2+ years of experience in Agile practices
  • Experience with SRE design to address reliability and resiliency with availability of 5-9s
  • Experience in working in a cloud environment (OCP and AWS EMR).
  • Experience with application monitoring tools, observability, and performance assessments.
  • Experience with DevOps (CI/CD pipelines with Jenkins or similar; Git/GitHub)
  • Experience developing automation solutions in Python (or other similar languages)
  • Comfortable with production environments, firewalls, and networking
  • Experience with networking such as routing, load balancers, and VPC
  • Experience with Docker and Kubernetes.
  • Experience in deploying, observing, altering, logging, and monitoring systems (Splunk, Datadog, New Relic) with a mindset towards predictive analysis.
  • Working knowledge of the Incident Management process.

Responsibilities

  • Guide site reliability automation to help eliminate manual toil and create a self-healing capability.
  • Fosters a culture of excellence and continuous learning within the chapter.
  • Establishes and tracks appropriate OKRs to ensure outcomes are met.
  • Creates solutions addressing high impact technology and business priorities.
  • Proactively identifies and mitigates issues based on intuition and experience in multiple domains

Benefits

  • Capital One offers a comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being.
  • Learn more at the Capital One Careers website.
  • Eligibility varies based on full or part-time status, exempt or non-exempt status, and management level.
  • This role is also eligible to earn performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI).
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service