Principal Site Reliability Engineer

Fidelity Investments•Durham, NC

9h•Hybrid

About The Position

Position Description: Combines Operational excellence with Development experience to deliver services at high scale, high availability with resilience. Builds reliability into the ecosystem by applying best practices in Resiliency Engineering, Automation, Observability and Chaos Testing. Streamlines and accelerates software delivery cycle by using DevOps practices and toolchain. Integrates Site Reliability Engineering (SRE) practices (Observability and Chaos) with DevOps processes and delivery pipelines to stop bad code from reaching production. Ensures business-critical enterprise systems are continuously available to internal and external customers. Implements technical standardization and process refinements within the engineering organization and for Site Reliability Engineers. Collaborates with production support teams to define and implement processes for the identification, collection, and analysis of incident data. Brings together technical, procedural, and financial data to reduce toil and increase efficiency.

Requirements

Bachelor’s degree in Computer Science, Engineering, Information Technology, Information Systems, or a closely related field (or foreign education equivalent) and five (5) years of experience as a Principal Site Reliability Engineer (or closely related occupation) implementing resilient container and cloud-based applications and infrastructure solutions, using DevOps or SRE practices, in a financial services environment.
Or, alternatively, Master’s degree (or foreign education equivalent) in Computer Science, Engineering, Information Technology, Information Systems, or a closely related field (or foreign education equivalent) and three (3) years of experience as a Principal Site Reliability Engineer (or closely related occupation) implementing resilient container and cloud-based applications and infrastructure solutions, using DevOps or SRE practices, in a financial services environment.
Demonstrated Expertise (“DE”) improving application resiliency by implementing chaos engineering to build system's capability to withstand turbulent conditions in production, using Chaos Mesh, Chaosd, Azure Chaos Studio, AWS FIS, or Gremlin; and driving automation to implement scalable approaches for the planning, design, execution, and reporting of chaos testing using Jenkins pipelines, standard frameworks, data visualization, and dashboards.
DE implementing advanced observability practices and techniques in production and pre-production environments, at scale using Datadog, Splunk, or Catchpoint; tracking the error budget, proactively identifying issues, minimizing Mean Time to Repair (MTTR); and balancing customer expectations by implementing Service-Level Indicators (SLIs) and Service-Level Objectives (SLOs) using logs, traces, monitors and synthetic tests.
DE migrating and maintaining cloud applications and creating cloud solutions using Amazon Web Services (AWS) or Azure cloud services; Implementing infrastructure as code for cloud; Onboarding new AWS or Azure services with required reviews and security controls in non-production and production environments; and researching evolving cloud ecosystem to adopt machine learning based tools (AWS DevOps guru) to boost AIOps abilities.
DE implementing CI/CD pipelines in both production and non-production environments using Application Lifecycle Management (ALM) tools (JIRA, GitHub, Jenkins, SonarQube, Artifactory, or uDeploy) to enable faster code delivery, enhanced software quality, reliability, and security; and developing products, and core and common capabilities for the organization to reduce toil and drive standardization, using containerization and orchestration technologies (Docker or Kubernetes), Infrastructure as Code (IaC) tools, scripting languages (Python or Groovy), and engineering best practices.

Responsibilities

Develops Chaos Testing capabilities using multiple Chaos Tools (AWS Fault Injection Service (FIS), Chaos Mesh, and Chaosd) and Chaos Toolkit.
Develops and enhances organization’s internal Chaos Framework to streamline Chaos Executions and reporting.
Provides specialized technical expertise in the adoption of Chaos Engineering by application teams.
Chaos tests and observes business-critical applications to understand the weaknesses and increase application resiliency.
Activates Observability for the critical applications with recommended Service Level Indicators and Service Level Objectives for Latency, Availability, Error Rate etc.
Utilizes modern monitoring tools (Datadog, Splunk, Catchpoint etc.) to reduce mean time to detect an issue and improve the response times.
Creates CI/CD pipelines with security and quality checks with Application Lifecycle management toolchain.
Helps in integrating Chaos and Observability with CI/CD pipelines.
Automates repetitive activities using scripting languages (Python, Groovy etc.).
Implements and supports solutions based on cloud platforms AWS/Azure and container orchestration Kubernetes.
Onboards /Evaluates New Cloud services that help to enhance the Resiliency of cloud ecosystem.
Serves as a liaison for vendor engagement.
Participates in incident management, problem management and incident postmortems.
Takes part in peer code reviews providing qualitative feedback.
Builds processes and capabilities to adapt and respond to risks, and disruptions, while maintaining business operations and data recovery with minimal disruptions.
Coaches peer SREs and application teams on SRE and DevOps.
Implements Agile methodologies in the team’s project completion using incremental and iterative steps.