Lead Site Reliability Engineer (SRE)

Capital One•McLean, VA

About The Position

Lead Site Reliability Engineer (SRE) Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of makers, breakers, doers and disruptors, who love to solve real problems and meet real customer needs. As a Site Reliability Engineer (SRE), you’ll tap into your passion for proactively finding and fixing inefficiencies to solve our reliability and performance issues. You’ll focus on availability, latency, performance, efficiency, change and problem management, monitoring, emergency response, and capacity planning of our services. Your projects will deliver enhanced infrastructure, development, and deployment automation at Capital One. General Responsibilities: Guide site reliability automation to help eliminate manual toil and create a self-healing capability. Fosters a culture of excellence and continuous learning within the chapter. Establishes and tracks appropriate OKRs to ensure outcomes are met. Creates solutions addressing high impact technology and business priorities. Competent in multiple contexts, such as programming languages, security, automation, testing, infrastructure, and performance and is the go-to person for many people (inside and outside of their team) Proactively identifies and mitigates issues based on intuition and experience in multiple domains

Requirements

High School Diploma, GED, or equivalent certification
At least 6 years of experience using build and deployment tools (Jenkins, GitHub, or Artifactory)
At least 4 years of experience with AWS
At least 2 years of team leadership experience

Nice To Haves

5+ years of experience with AWS
2+ years of experience in Agile practices
Experience with SRE design to address reliability and resiliency with availability of 5-9s
Experience in working in a cloud environment (OCP and AWS EMR).
Experience with application monitoring tools, observability, and performance assessments.
Experience with DevOps (CI/CD pipelines with Jenkins or similar; Git/GitHub)
Experience developing automation solutions in Python (or other similar languages)
Comfortable with production environments, firewalls, and networking
Experience with networking such as routing, load balancers, and VPC
Experience with Docker and Kubernetes.
Experience in deploying, observing, altering, logging, and monitoring systems (Splunk, Datadog, New Relic) with a mindset towards predictive analysis.
Working knowledge of the Incident Management process.

Responsibilities

Guide site reliability automation to help eliminate manual toil and create a self-healing capability.
Fosters a culture of excellence and continuous learning within the chapter.
Establishes and tracks appropriate OKRs to ensure outcomes are met.
Creates solutions addressing high impact technology and business priorities.
Proactively identifies and mitigates issues based on intuition and experience in multiple domains

Benefits

Capital One offers a comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being.
Learn more at the Capital One Careers website.
Eligibility varies based on full or part-time status, exempt or non-exempt status, and management level.
This role is also eligible to earn performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume