Director, Site Reliability Engineering

Fidelity Investments•Westlake, TX

55d•Hybrid

About The Position

The Fidelity Enterprise Infrastructure (EI) Production Support team is seeking a Director to help scale our growing public cloud presence. Fidelity's Site Reliability Engineers work with our cloud platform teams to deliver reliable runtimes for Fidelity's business critical workloads. This team is responsible for cross-cutting cloud management capabilities and are the experts on the state of Fidelity's cloud platforms at any moment. The team comes from diverse technical backgrounds, and the responsibilities provide opportunity for a variety of challenges that require engineers to work on software and systems challenges. Ideal candidates will have a background in either software engineering or systems engineering with a desire to learn the other or previous experience as an SRE. The Director for SRE will support Engineering and Systems Operational support for Business Unit aligned functions including Application Support, Cloud Enablement, Helpdesk, Environment Management, Mid-tier & Web Operations, & Platform Engineering. By demonstrating and promoting Fidelity and agile leadership behaviors, you will evolve and sustain an innovative agile culture. Our ever-evolving technology stack ensures a phenomenal learning culture in the team. We are always exploring new technologies and new ways to continually provide value to our customers. This team has a direct and positive impact on Fidelity's customers.

Requirements

Ability to automate with various scripting languages (Python, Shell scripting, etc.)
Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines
Hands-on Kubernetes skills and knowledge.
Hands on experience with Cloud services on AWS and Azure
Experience on building resiliency with Chaos Engineering practices
Hands on experience with one or more observability tools (Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, etc.)
Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale.
Proven experience in maintaining scalability and resiliency of complex environment.
Proven experience in implementing advanced observability practices and techniques at scale.
Demonstrated ability to utilize modern monitoring tools (DataDog, Prometheus, Splunk)
Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale.
Ability to triage, execute root cause analysis, and be decisive under pressure.
Experience managing and interpreting large datasets using query languages and visualization tools.
Proficient communication skills with an ability to reach both technical and non-technical audience.
Ability to learn new software, method and practices and bringing them to our developers.
Ability to work with a variety of individuals and groups, both in person and virtually, in a constructive and collaborative manner and build and maintain effective relationships.
Bridges the gap between lofty architecture ideas and development of feasible solutions.
Facilitates discussions among component owners to improve end-to-end understanding of transaction paths.
Provides consulting to architects and developers on common patterns and tactical, reusable solutions.
Influences adoption of stability principles by presenting facts and data.
Drives operational readiness discussions and reviews of new solutions and products.
Develops frameworks for self-assessment of applications on various stability and dependability pillars.
Participates, even unsolicited, in discussions and decisions that impact customer experience.
Selectively preserves and shares collective memory and successes of past.
Mindset of continuous learning and experimentation. Instinctive urge to improve current state by finding problems and recommending feasible solutions.

Responsibilities

Support Engineering and Systems Operational support for Business Unit aligned functions including Application Support, Cloud Enablement, Helpdesk, Environment Management, Mid-tier & Web Operations, & Platform Engineering.
Evolve and sustain an innovative agile culture.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume