Sr Site Reliability Engineer

PayPal•San Jose, CA

23d

About The Position

Take ownership of system performance monitoring, identify inefficiencies, and lead initiatives to improve the overall availability and reliability of digital platforms and applications. Lead and manage the response to complex, high-priority incidents, ensuring prompt resolution and a thorough root cause analysis to prevent future occurrences. Design and implement advanced automation frameworks to improve operational efficiency, streamline processes, and reduce human error. Lead reliability-focused initiatives, ensuring systems are highly available, resilient, and scalable, and promote best practices across engineering teams. Enhance the monitoring infrastructure by identifying key metrics, optimizing alerting, and improving system observability to ensure the reliability of large-scale systems. Forecast resource requirements and lead capacity planning activities to ensure systems can scale effectively to meet growing user demand. Ensure robust disaster recovery strategies are in place and conduct regular testing to ensure systems can recover quickly from failures. Partner with engineering and product teams to identify opportunities for improving system architecture, focusing on scalability, reliability, and fault tolerance. Provide mentorship and technical guidance to junior site reliability engineers, fostering skill development and knowledge sharing. Drive continuous improvement across operational workflows, identifying areas for optimization, cost reduction, and performance enhancement. Site Resiliency & Infrastructure Management Proactively identify and address vulnerabilities in cloud (AWS, GCP, Azure) and on-premises infrastructure Review Infrastructure as Code changes for reliability risks as part of change approval process Identify architectural anti-patterns in Kubernetes deployments and cloud migrations Participate in situation room activities for new product rollouts Implement automated monitoring solutions to detect single points of failure Participate in situation room activities for new product rollouts Manage multiple concurrent incidents during peak periods with efficiency and precision Provide training and guidance to engineering teams on change management best practices 3+ years relevant experience and a Bachelor's degree OR Any equivalent combination of education and experience. Review and approve changes to production systems, ensuring comprehensive risk assessment Automate change validation and rollback procedures to minimize service disruptions Streamline change management processes to reduce manual errors and bottlenecks Provide training and guidance to engineering teams on change management best practices Maintain change audit documentation and compliance requirements Leverage deep expertise in cloud platforms (AWS, GCP, Azure) to drive incident resolution Support Braintree and Venmo cloud infrastructure operations Guide teams toward solutions by providing architectural direction during incidents Stay current with emerging cloud technologies and best practices Mentor team members on cloud technologies and incident management techniques Implement automation, dashboards, and tooling to enhance the team's incident response capabilities Build runbooks and playbooks for cloud-native incident scenarios Develop internal tools and scripts to improve TDO operational efficiency Drive projects that advance the Command Center's operational capabilities Significant hands-on experience with at least one major cloud provider (AWS or GCP required; multi-cloud experience preferred) Strong proficiency with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or equivalent) including ability to read, review, and troubleshoot IaC configurations during incidents Significant hands-on experience with Kubernetes and CNCF ecosystem tools, including troubleshooting K8s deployments, manifests, and cluster issues Ability to quickly read and review code across multiple languages (Python, Go, Bash) and configuration formats (YAML, HCL, JSON)essential for effective incident troubleshooting Proven experience managing critical incidents in Infrastructure-as-Code driven environments, including troubleshooting IaC state issues, GitOps failures, and cloud-native deployment problems Professional-level certification in at least one major cloud platform (AWS Solutions Architect Professional, Google Cloud Professional Cloud Architect, or equivalent) Experience with monitoring and observability tools (Splunk, Datadog, Prometheus, Grafana) Experience with monitoring and observability tools (Splunk, Datadog, Prometheus, Grafana) Exceptional communication skills with ability to articulate complex technical issues to both technical and non-technical stakeholders Executive presence and ability to effectively communicate with senior leadership during high-pressure incidents and post-mortems Strong analytical and problem-solving abilities with a systematic approach to troubleshooting Ability to remain calm under pressure and make critical decisions during incidents Excellent collaboration skills with experience working across global, cross-functional teams Strong documentation skills and attention to detail Experience with multiple cloud providers (AWS + GCP + Azure) Broader toolset expertise across multiple IaC tools, CI/CD platforms, or GitOps solutions Experience with payment processing systems or fintech platforms Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)Experience with payment processing systems or fintech platforms Certifications in cloud platforms (AWS Solutions Architect, Google Cloud Professional, Azure Administrator, etc.) ITIL Foundation or higher certification Experience with infrastructure as code and GitOps practices Background in security operations or compliance (PCI-DSS, SOC 2, etc.) Experience mentoring or leading technical teams Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience). Background in security operations or compliance (PCI-DSS, SOC 2, etc.)

Requirements

3+ years relevant experience and a Bachelor's degree OR Any equivalent combination of education and experience.
Significant hands-on experience with at least one major cloud provider (AWS or GCP required; multi-cloud experience preferred)
Strong proficiency with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or equivalent) including ability to read, review, and troubleshoot IaC configurations during incidents
Significant hands-on experience with Kubernetes and CNCF ecosystem tools, including troubleshooting K8s deployments, manifests, and cluster issues
Ability to quickly read and review code across multiple languages (Python, Go, Bash) and configuration formats (YAML, HCL, JSON)essential for effective incident troubleshooting
Proven experience managing critical incidents in Infrastructure-as-Code driven environments, including troubleshooting IaC state issues, GitOps failures, and cloud-native deployment problems
Professional-level certification in at least one major cloud platform (AWS Solutions Architect Professional, Google Cloud Professional Cloud Architect, or equivalent)
Experience with monitoring and observability tools (Splunk, Datadog, Prometheus, Grafana)
Experience with monitoring and observability tools (Splunk, Datadog, Prometheus, Grafana)
Exceptional communication skills with ability to articulate complex technical issues to both technical and non-technical stakeholders
Executive presence and ability to effectively communicate with senior leadership during high-pressure incidents and post-mortems
Strong analytical and problem-solving abilities with a systematic approach to troubleshooting
Ability to remain calm under pressure and make critical decisions during incidents
Excellent collaboration skills with experience working across global, cross-functional teams
Strong documentation skills and attention to detail
Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).
Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).

Nice To Haves

Experience with multiple cloud providers (AWS + GCP + Azure)
Broader toolset expertise across multiple IaC tools, CI/CD platforms, or GitOps solutions
Experience with payment processing systems or fintech platforms
Experience with payment processing systems or fintech platforms
Certifications in cloud platforms (AWS Solutions Architect, Google Cloud Professional, Azure Administrator, etc.)
ITIL Foundation or higher certification
Experience with infrastructure as code and GitOps practices
Background in security operations or compliance (PCI-DSS, SOC 2, etc.)
Experience mentoring or leading technical teams
Background in security operations or compliance (PCI-DSS, SOC 2, etc.)

Responsibilities

Take ownership of system performance monitoring, identify inefficiencies, and lead initiatives to improve the overall availability and reliability of digital platforms and applications.
Lead and manage the response to complex, high-priority incidents, ensuring prompt resolution and a thorough root cause analysis to prevent future occurrences.
Design and implement advanced automation frameworks to improve operational efficiency, streamline processes, and reduce human error.
Lead reliability-focused initiatives, ensuring systems are highly available, resilient, and scalable, and promote best practices across engineering teams.
Enhance the monitoring infrastructure by identifying key metrics, optimizing alerting, and improving system observability to ensure the reliability of large-scale systems.
Forecast resource requirements and lead capacity planning activities to ensure systems can scale effectively to meet growing user demand.
Ensure robust disaster recovery strategies are in place and conduct regular testing to ensure systems can recover quickly from failures.
Partner with engineering and product teams to identify opportunities for improving system architecture, focusing on scalability, reliability, and fault tolerance.
Provide mentorship and technical guidance to junior site reliability engineers, fostering skill development and knowledge sharing.
Drive continuous improvement across operational workflows, identifying areas for optimization, cost reduction, and performance enhancement.
Proactively identify and address vulnerabilities in cloud (AWS, GCP, Azure) and on-premises infrastructure
Review Infrastructure as Code changes for reliability risks as part of change approval process
Identify architectural anti-patterns in Kubernetes deployments and cloud migrations
Participate in situation room activities for new product rollouts
Implement automated monitoring solutions to detect single points of failure
Participate in situation room activities for new product rollouts
Manage multiple concurrent incidents during peak periods with efficiency and precision
Provide training and guidance to engineering teams on change management best practices
Review and approve changes to production systems, ensuring comprehensive risk assessment
Automate change validation and rollback procedures to minimize service disruptions
Streamline change management processes to reduce manual errors and bottlenecks
Provide training and guidance to engineering teams on change management best practices
Maintain change audit documentation and compliance requirements
Leverage deep expertise in cloud platforms (AWS, GCP, Azure) to drive incident resolution
Support Braintree and Venmo cloud infrastructure operations
Guide teams toward solutions by providing architectural direction during incidents
Stay current with emerging cloud technologies and best practices
Mentor team members on cloud technologies and incident management techniques
Implement automation, dashboards, and tooling to enhance the team's incident response capabilities
Build runbooks and playbooks for cloud-native incident scenarios
Develop internal tools and scripts to improve TDO operational efficiency
Drive projects that advance the Command Center's operational capabilities