Principal Site Reliability Engineer

Fidelity InvestmentsDurham, NC
12hHybrid

About The Position

Position Description: Creates and maintains performance test cases, suites, and frameworks using performance tools -- LoadRunner, CloudTest, JMeter, Locust, and K6. Implements performance tools via Continuous Integration/Continuous Deployment (CI/CD) pipelines using Jenkins and Github. Provides support for migration and tuning of Cloud-based applications in Amazon Web Services (AWS), Elastic Container Service (ECS), and Elastic Kubernetes Service (EKS). Builds monitoring framework or dashboard to improve visibility and observability using Splunk and Datadog. Possesses thorough understanding of microservices architecture on Docker swarm and Kubernetes infrastructure to provide tuning recommendations/triage incidents to help improve performance, stability and availability of customer-facing applications. Builds and improves standard methodologies for performance, load, stress, and chaos testing, along with analytics and reporting based on business requirements. Works with industry standard performance testing tools, methodologies and technologies for large-scale end-to-end systems. Solves complex performance and stability issues to ensure industry-leading platforms are high performing and are scalable to meet business needs. Primary Responsibilities: Establishes and enhances application performance testing frameworks and infrastructure to provide actionable metrics and reports to identify bottlenecks and issues using established observability. Reviews environment topology, software architecture, and Kubernetes platform metrics using established observability to ensure adequate capacity headroom and availability of systems. Applies Agile principles for continuous improvement on processes, efficiency, and quality of all deliverables. Coordinates at all levels of the organization for cross business site reliability engineering (SRE) initiatives to track and report status to all stakeholders. Prioritizes workloads in fast-paced dynamic environments and meets deadlines. Champions and encourages collaboration and adoption of new tools and processes within the team. Performs independent and complex technical analysis for multiple projects supporting initiatives across business units. Contributes to the team’s documented knowledge base and expertise in the SRE environment. Coaches junior members of the team. Education and Experience: Bachelor’s degree in Computer Science, Engineering, Information Technology, Information Technology Management, Information Systems, Business Administration, or a closely related field (or foreign education equivalent) and five (5) years of experience as a Principal Site Reliability Engineer (or closely related occupation) executing software performance engineering of online financial systems in a DevOps environment. Or, alternatively, Master’s degree in Computer Science, Engineering, Information Technology, Information Technology Management, Information Systems, Business Administration, or a closely related field (or foreign education equivalent) and three (3) years of experience as a Principal Site Reliability Engineer (or closely related occupation) executing software performance engineering of online financial systems in a DevOps environment. Skills and Knowledge: Candidate must also possess: Demonstrated Expertise (“DE”) orchestrating software performance benchmarking on online financial Web and mobile applications across scrum teams, deployed on AWS and Azure, using CloudTest, JMeter, Locust, K6 with Jenkins, and Github in a Cloud Kubernetes environment. DE creating and monitoring dashboards to improve visibility and observability using Splunk and Datadog for performance benchmarking; analyzing and monitoring hardware system performance metrics – CPU, memory, network usage, pod counts, container restarts, pod crashes, heap, and Garbage Collections (GC) – using Apache, Java, Angular, and Node.js software platforms deployed on Unix and Windows environments. DE identifying performance gaps and triage incidents in capacity and infrastructure configurations, providing corrective recommendations to development teams, using observability tools (Splunk and Datadog) and performance tools (CloudTest, JMeter, Locust, and K6). DE conducting resiliency, chaos, and failure testing on financial software applications to identify gaps in application and infrastructure configurations, using Chaos Mesh, Gremlin, AWS Fault Injection Service, Splunk, and Datadog for analysis; and virtualizing backends using Wiremock.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, Information Technology, Information Technology Management, Information Systems, Business Administration, or a closely related field (or foreign education equivalent) and five (5) years of experience as a Principal Site Reliability Engineer (or closely related occupation) executing software performance engineering of online financial systems in a DevOps environment.
  • Or, alternatively, Master’s degree in Computer Science, Engineering, Information Technology, Information Technology Management, Information Systems, Business Administration, or a closely related field (or foreign education equivalent) and three (3) years of experience as a Principal Site Reliability Engineer (or closely related occupation) executing software performance engineering of online financial systems in a DevOps environment.
  • Demonstrated Expertise (“DE”) orchestrating software performance benchmarking on online financial Web and mobile applications across scrum teams, deployed on AWS and Azure, using CloudTest, JMeter, Locust, K6 with Jenkins, and Github in a Cloud Kubernetes environment.
  • DE creating and monitoring dashboards to improve visibility and observability using Splunk and Datadog for performance benchmarking; analyzing and monitoring hardware system performance metrics – CPU, memory, network usage, pod counts, container restarts, pod crashes, heap, and Garbage Collections (GC) – using Apache, Java, Angular, and Node.js software platforms deployed on Unix and Windows environments.
  • DE identifying performance gaps and triage incidents in capacity and infrastructure configurations, providing corrective recommendations to development teams, using observability tools (Splunk and Datadog) and performance tools (CloudTest, JMeter, Locust, and K6).
  • DE conducting resiliency, chaos, and failure testing on financial software applications to identify gaps in application and infrastructure configurations, using Chaos Mesh, Gremlin, AWS Fault Injection Service, Splunk, and Datadog for analysis; and virtualizing backends using Wiremock.

Responsibilities

  • Establishes and enhances application performance testing frameworks and infrastructure to provide actionable metrics and reports to identify bottlenecks and issues using established observability.
  • Reviews environment topology, software architecture, and Kubernetes platform metrics using established observability to ensure adequate capacity headroom and availability of systems.
  • Applies Agile principles for continuous improvement on processes, efficiency, and quality of all deliverables.
  • Coordinates at all levels of the organization for cross business site reliability engineering (SRE) initiatives to track and report status to all stakeholders.
  • Prioritizes workloads in fast-paced dynamic environments and meets deadlines.
  • Champions and encourages collaboration and adoption of new tools and processes within the team.
  • Performs independent and complex technical analysis for multiple projects supporting initiatives across business units.
  • Contributes to the team’s documented knowledge base and expertise in the SRE environment.
  • Coaches junior members of the team.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service