Sr Site Reliability Engineer

PayPalSan Jose, CA
22hHybrid

About The Position

The Company PayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy. We operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whether they are online or in person. PayPal is more than a connection to third-party payment networks. We provide proprietary payment solutions accepted by merchants that enable the completion of payments on our platform on behalf of our customers. We offer our customers the flexibility to use their accounts to purchase and receive payments for goods and services, as well as the ability to transfer and withdraw funds. We enable consumers to exchange funds more safely with merchants using a variety of funding sources, which may include a bank account, a PayPal or Venmo account balance, PayPal and Venmo branded credit products, a credit card, a debit card, certain cryptocurrencies, or other stored value products such as gift cards, and eligible credit card rewards. Our PayPal, Venmo, and Xoom products also make it safer and simpler for friends and family to transfer funds to each other. We offer merchants an end-to-end payments solution that provides authorization and settlement capabilities, as well as instant access to funds and payouts. We also help merchants connect with their customers, process exchanges and returns, and manage risk. We enable consumers to engage in cross-border shopping and merchants to extend their global reach while reducing the complexity and friction involved in enabling cross-border trade. Our beliefs are the foundation for how we conduct business every day. We live each day guided by our core values of Inclusion, Innovation, Collaboration, and Wellness. Together, our values ensure that we work together as one global team with our customers at the center of everything we do – and they push us to ensure we take care of ourselves, each other, and our communities. Job Summary: This is an incident command role, not a hands-on DevOps engineering position. You won't be migrating workloads or writing Terraform modules for product teams. Instead, you'll direct application and infrastructure teams during incidents making work assignments, prioritizing troubleshooting paths, and authorizing critical actions like rollbacks and regional failovers. You need the technical depth to rapidly read Infrastructure as Code, Kubernetes manifests, and CI/CD configurations to make informed decisions under pressure. You'll also regularly interface with executive leadership during critical incidents and post-mortems, and drive implementation of tooling that advances the Command Center's capabilities.

Requirements

  • 3+ years relevant experience and a Bachelor’s degree OR Any equivalent combination of education and experience.
  • Significant hands-on experience with at least one major cloud provider (AWS or GCP required; multi-cloud experience preferred)
  • Strong proficiency with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or equivalent) including ability to read, review, and troubleshoot IaC configurations during incidents
  • Significant hands-on experience with Kubernetes and CNCF ecosystem tools, including troubleshooting K8s deployments, manifests, and cluster issues
  • Ability to quickly read and review code across multiple languages (Python, Go, Bash) and configuration formats (YAML, HCL, JSON)essential for effective incident troubleshooting
  • Proven experience managing critical incidents in Infrastructure-as-Code driven environments, including troubleshooting IaC state issues, GitOps failures, and cloud-native deployment problems
  • Experience with monitoring and observability tools (Splunk, Datadog, Prometheus, Grafana)
  • Strong knowledge of networking, load balancing, CDN technologies, and DNS management
  • Proficiency in scripting for operational automation (Python, Bash, PowerShell)
  • 5+ years of experience in site reliability engineering, infrastructure operations, or similar technical operations roles
  • Strong expertise in cloud platforms (AWS, GCP, and/or Azure)
  • Proficiency in infrastructure automation tools (Terraform, Ansible, CloudFormation, etc.)
  • Deep understanding of distributed systems, microservices architecture, and containerization (Docker, Kubernetes)
  • Exceptional communication skills with ability to articulate complex technical issues to both technical and non-technical stakeholders
  • Executive presence and ability to effectively communicate with senior leadership during high-pressure incidents and post-mortems
  • Strong analytical and problem-solving abilities with a systematic approach to troubleshooting
  • Ability to remain calm under pressure and make critical decisions during incidents
  • Excellent collaboration skills with experience working across global, cross-functional teams
  • Strong documentation skills and attention to detail

Nice To Haves

  • Professional-level certification in at least one major cloud platform (AWS Solutions Architect Professional, Google Cloud Professional Cloud Architect, or equivalent)
  • Experience with payment processing systems or fintech platforms
  • Certifications in cloud platforms (AWS Solutions Architect, Google Cloud Professional, Azure Administrator, etc.)
  • ITIL Foundation or higher certification
  • Experience with infrastructure as code and GitOps practices
  • Background in security operations or compliance (PCI-DSS, SOC 2, etc.)
  • Experience mentoring or leading technical teams
  • Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
  • Experience with multiple cloud providers (AWS + GCP + Azure)
  • Broader toolset expertise across multiple IaC tools, CI/CD platforms, or GitOps solutions
  • ITIL Foundation or higher certification
  • Background in security operations or compliance (PCI-DSS, SOC 2, etc.)
  • Experience mentoring or leading technical teams
  • Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)

Responsibilities

  • Take ownership of system performance monitoring, identify inefficiencies, and lead initiatives to improve the overall availability and reliability of digital platforms and applications.
  • Lead and manage the response to complex, high-priority incidents, ensuring prompt resolution and a thorough root cause analysis to prevent future occurrences.
  • Design and implement advanced automation frameworks to improve operational efficiency, streamline processes, and reduce human error.
  • Lead reliability-focused initiatives, ensuring systems are highly available, resilient, and scalable, and promote best practices across engineering teams.
  • Enhance the monitoring infrastructure by identifying key metrics, optimizing alerting, and improving system observability to ensure the reliability of large-scale systems.
  • Forecast resource requirements and lead capacity planning activities to ensure systems can scale effectively to meet growing user demand.
  • Ensure robust disaster recovery strategies are in place and conduct regular testing to ensure systems can recover quickly from failures.
  • Partner with engineering and product teams to identify opportunities for improving system architecture, focusing on scalability, reliability, and fault tolerance.
  • Provide mentorship and technical guidance to junior site reliability engineers, fostering skill development and knowledge sharing.
  • Drive continuous improvement across operational workflows, identifying areas for optimization, cost reduction, and performance enhancement.
  • Proactively identify and address vulnerabilities in cloud (AWS, GCP, Azure) and on-premises infrastructure
  • Review Infrastructure as Code changes for reliability risks as part of change approval process
  • Identify architectural anti-patterns in Kubernetes deployments and cloud migrations
  • Conduct regular disaster recovery drills and readiness tests before major events (Thanksgiving, Cyber 5, peak shopping seasons)
  • Participate in situation room activities for new product rollouts
  • Drive site resilience projects to enhance system reliability and uptime
  • Implement automated monitoring solutions to detect single points of failure
  • Lead new datacenter and CDN certification initiatives
  • Act as incident commander with final decision authority -- directing engineering teams, authorizing rollbacks, and commanding regional failovers
  • Direct application and infrastructure teams during incidents by making work assignments and prioritizing troubleshooting paths
  • Rapidly assess incidents by reading Infrastructure as Code (Terraform, CloudFormation), Kubernetes manifests, and CI/CD configurations
  • Give final authorization for critical actions including production rollbacks, regional failovers, and emergency changes
  • Interface with executive leadership during critical incidents and post-mortems to provide technical guidance and impact assessments
  • Identify when incidents stem from teams deviating from established cloud-native patterns
  • Command cross-functional teams during high-severity incidents affecting PayPal core and brand platforms (Venmo, Xoom, Zettle, Braintree)
  • Lead blameless postmortem sessions and contribute to Root Cause Analysis (RCA) processes
  • Drive continuous improvement initiatives based on incident learningsServe as the primary technical escalation point during critical incidents
  • Accelerate incident response times through standardized playbooks and automated workflows
  • Coordinate cross-functional teams during high-severity incidents affecting PayPal core and brand platforms (Venmo, Xoom, Zettle, Braintree)
  • Manage multiple concurrent incidents during peak periods with efficiency and precision
  • Serve as final approver for emergency changes and provide expert guidance on all production changes
  • Act as advisor and technical authority during change approval processes, identifying potential reliability risks
  • Provide training and guidance to engineering teams on change management best practices
  • Maintain change audit documentation and compliance requirements
  • Review and approve changes to production systems, ensuring comprehensive risk assessment
  • Automate change validation and rollback procedures to minimize service disruptions
  • Streamline change management processes to reduce manual errors and bottlenecks
  • Leverage deep expertise in cloud platforms (AWS, GCP, Azure) to drive incident resolution
  • Support Braintree and Venmo cloud infrastructure operations
  • Guide teams toward solutions by providing architectural direction during incidents
  • Stay current with emerging cloud technologies and best practices
  • Mentor team members on cloud technologies and incident management techniques
  • Implement automation, dashboards, and tooling to enhance the team's incident response capabilities
  • Build runbooks and playbooks for cloud-native incident scenarios
  • Develop internal tools and scripts to improve TDO operational efficiency
  • Drive projects that advance the Command Center's operational capabilities

Benefits

  • medical, dental, vision, life and disability insurance, parental and family leave, 401(k) savings plan, paid time off
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service