Director, Site Reliability Engineering - Digital Assets

Fidelity•Jersey City, NJ

1d•$126,000 - $255,000•Hybrid

About The Position

The Role As a Director within the TechOps SRE team, you'll work closely with our engineering partners to help enable and drive initiatives from design to implementation. Our highly available multi-region Kubernetes (AWS EKS) environments are best-in-class and central to our enterprise-grade infrastructure strategy. These growing environments currently support numerous critical workloads. In this exciting role, you’ll have the opportunity to further develop and refine your skills, collaborate across numerous Fidelity teams, and continue to grow in a fun, collaborative, and rapidly changing environment. This is an extraordinary opportunity to have a direct impact on the emerging strategies of our infrastructure and deployments, while at the same time, helping enable the expansion of our business. In addition to bringing technical leadership and influence, this role also includes supervisory responsibilities for a small group of professionals specializing in site reliability. The Team Fidelity Digital Assets®, a Fidelity Investments Company, is developing a full-service enterprise-grade platform for storing, trading, and servicing digital assets, such as Bitcoin and Ethereum. Fidelity Digital Assets® adopts an entrepreneurial culture and startup approach while serving as one of the most innovative business units within Fidelity Investments. Our global, diverse team of hundreds of forward-thinking professionals lead with agility and creativity to build solutions that bridge the gap between traditional institutional investors and their exposure to digital assets. The firm’s tenure and experience across multiple business lines present our employees with unprecedented access to knowledge, technology, and resources that help our team reshape the future of finance. Within Fidelity Digital Assets®, the Technical Operations team plays a key role in our initiative of moving to the cloud. The team uses AWS services to secure our network and scale our applications to ensure their up-time and team members are hands-on engineers specializing in system reliability who promote a DevOps approach, with a focus on infrastructure-as-code, security and automation.

Requirements

8+ years of hands-on AWS production experience crafting and managing highly available, secure, scalable systems (microservices preferred).
Skilled in crafting and deploying resilient AWS infrastructure across multiple regions and availability zones.
Production experience running Kubernetes workloads and clusters (EKS preferred), encompassing the creation and upkeep of Helm charts and reusable templates/libraries.
Hands-on CI/CD experience with Jenkins, involving writing and managing declarative pipelines and shared libraries.
Proficiency with Linux/Unix systems and shell scripting.
Programming proficiency (Python preferred).
Proficient with Git-based workflows.
Experience building and maintaining observability (logging, monitoring, alerting) using tools such as Datadog and Splunk.
Experience working within Agile delivery models (Kanban/Scrum).

Nice To Haves

Terraform experience (preferred).
CDN experience (e.g., Akamai).
Kafka experience (Apache/Confluent).

Responsibilities

Provide technical and people leadership for a group of Site Reliability Engineers (SREs) / Cloud Engineers; hire, mentor, and develop a high-performing team.
Set clear goals and expectations; lead performance, career growth, and team health while fostering a culture of ownership and continuous improvement.
Own execution for reliability and cloud delivery initiatives—plan and prioritize work, remove blockers, and ensure predictable outcomes.
Partner cross-functionally with Product, and Engineering leaders to drive delivery of complex programs and platform improvements.
Establish, operate, and continuously improve on-call and incident management practices (escalation paths, runbooks, postmortems, and corrective action tracking).
Define, implement, and own service reliability metrics and practices, including SLIs/SLOs and error budgets, in partnership with engineering teams.
Lead blameless post-incident reviews and drive remediation to reduce repeat incidents and operational toil.
Champion automation and Infrastructure as Code (IaC) standards to improve delivery speed, reduce risk, and increase system resilience.

Benefits

comprehensive health care coverage and emotional well-being support
market-leading retirement
generous paid time off and parental leave
charitable giving employee match program
educational assistance including student loan repayment, tuition reimbursement, and learning resources to develop your career

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume