Staff Site Reliability Engineer (Mobile)

PayPal•San Jose, CA

7d•Hybrid

About The Position

The Company PayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy. We operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whether they are online or in person. PayPal is more than a connection to third-party payment networks. We provide proprietary payment solutions accepted by merchants that enable the completion of payments on our platform on behalf of our customers. We offer our customers the flexibility to use their accounts to purchase and receive payments for goods and services, as well as the ability to transfer and withdraw funds. We enable consumers to exchange funds more safely with merchants using a variety of funding sources, which may include a bank account, a PayPal or Venmo account balance, PayPal and Venmo branded credit products, a credit card, a debit card, certain cryptocurrencies, or other stored value products such as gift cards, and eligible credit card rewards. Our PayPal, Venmo, and Xoom products also make it safer and simpler for friends and family to transfer funds to each other. We offer merchants an end-to-end payments solution that provides authorization and settlement capabilities, as well as instant access to funds and payouts. We also help merchants connect with their customers, process exchanges and returns, and manage risk. We enable consumers to engage in cross-border shopping and merchants to extend their global reach while reducing the complexity and friction involved in enabling cross-border trade. Our beliefs are the foundation for how we conduct business every day. We live each day guided by our core values of Inclusion, Innovation, Collaboration, and Wellness. Together, our values ensure that we work together as one global team with our customers at the center of everything we do – and they push us to ensure we take care of ourselves, each other, and our communities. Job Summary: At PayPal, Staff Site Reliability Engineers (SREs) ensure the reliability, scalability, and performance of our global systems. We’re launching a new Mobile SRE team to bridge mobile clients and backend systems, delivering seamless end-to-end customer experiences. This role focuses on mobile reliability, performance, and observability. As the Staff Mobile SRE, you’ll define strategy, set reliability standards, and align practices across mobile and backend SRE teams. You’ll drive improvements in app stability, reduce crashes, and elevate performance through automation, monitoring, and mentorship—fostering a culture of operational excellence across the organization.

Requirements

5+ years relevant experience and a Bachelor’s degree OR Any equivalent combination of education and experience.
Expertise defining and implementing SLIs/SLOs for distributed and client-server systems.
Hands-on experience with Datadog or similar platforms for monitoring, alerting, and dashboards.
Proven ability to lead on-call rotations, incident response, and postmortems.
Strong programming skills in Python, Go, or similar, with working knowledge of Swift or Kotlin for client instrumentation.
Experience building automation and internal tools to improve reliability and efficiency.
Skilled in integrating CI/CD systems (Harness, Jenkins, Fastlane) for mobile deployments.
Strong communication and leadership skills with a proven ability to mentor and influence across teams.

Nice To Haves

Strong knowledge of iOS and/or Android performance and reliability challenges.
Experience with Bazel, Gradle, or similar build systems.
Familiarity with backend reliability and distributed systems concepts.
Proven success introducing on-call or observability practices within engineering teams.
Experience with large-scale, customer-facing mobile or fintech systems.

Responsibilities

Manage and deliver large-scale reliability improvement projects, ensuring systems are performant, available, and resilient.
Drive the identification of performance bottlenecks and lead initiatives to optimize and scale critical systems and services.
Architect and implement scalable infrastructure solutions to support growing user demands while maintaining system reliability.
Lead the design and enhancement of monitoring frameworks, ensuring systems are highly observable, and support the response to production incidents.
Take ownership of improving system resilience by designing fault-tolerant architectures and implementing disaster recovery strategies.
Lead capacity planning initiatives to ensure system resources are proactively managed, preventing downtime or performance degradation under high load.
Work closely with development, operations, and other technical teams to ensure seamless system integration and align on best practices for reliability.
Act as a technical mentor within the organization, guiding teams through complex reliability challenges and promoting a culture of excellence.
Help define and execute long-term reliability engineering strategies and standards to ensure the scalability and performance of core services.
Develop and enforce best practices for operational excellence, including automation, incident management, and system monitoring, across engineering teams.
Define mobile-specific SLIs/SLOs (e.g., crash-free sessions, ANRs, startup times, network success rates) and establish observability and alerting best practices in Datadog.
Ensure consistency in how mobile reliability is measured and tracked across iOS and Android teams.
Lead development of reliability tools and automation—covering regression detection, performance benchmarking, and release health dashboards.
Integrate crash/ANR triage systems with Datadog, Crashlytics, and CI/CD pipelines (Harness, Gradle, Bazel).
Act as liaison with backend/web SRE teams to ensure unified visibility and incident response.
Partner with Product, QA, and Release Engineering to meet operational readiness standards and influence architecture for reliability from design to delivery.
Lead the rollout of on-call practices, incident response, and blameless postmortems.
Mentor senior SREs across regions and drive adoption of reliability ownership among mobile engineering teams.
Collaborate with infrastructure and developer productivity teams to integrate mobile builds into reliable CI/CD pipelines.
Establish long-term roadmaps that align mobile reliability with PayPal’s global SRE strategy.
Partner with engineering teams to ensure robust monitoring, alerting, and dashboards for critical mobile services.
Create and maintain runbooks and playbooks to standardize operational practices and empower teams to self-manage reliability.
Lead post-incident reviews, identify areas for improvement, and help implement proactive reliability measures.
Collaborate with the Datadog observability team to enhance signal quality and alerting efficiency.