Staff Site Reliability Engineer (Mobile)

PayPal•San Jose, CA

47d

About The Position

Manage and deliver large-scale reliability improvement projects, ensuring systems are performant, available, and resilient. Drive the identification of performance bottlenecks and lead initiatives to optimize and scale critical systems and services. Architect and implement scalable infrastructure solutions to support growing user demands while maintaining system reliability. Lead the design and enhancement of monitoring frameworks, ensuring systems are highly observable, and support the response to production incidents. Take ownership of improving system resilience by designing fault-tolerant architectures and implementing disaster recovery strategies. Lead capacity planning initiatives to ensure system resources are proactively managed, preventing downtime or performance degradation under high load. Work closely with development, operations, and other technical teams to ensure seamless system integration and align on best practices for reliability. Act as a technical mentor within the organization, guiding teams through complex reliability challenges and promoting a culture of excellence. Help define and execute long-term reliability engineering strategies and standards to ensure the scalability and performance of core services. Develop and enforce best practices for operational excellence, including automation, incident management, and system monitoring, across engineering teams. Standards & Governance: Define mobile-specific SLIs/SLOs (e.g., crash-free sessions, ANRs, startup times, network success rates) and establish observability and alerting best practices in Datadog. Ensure consistency in how mobile reliability is measured and tracked across iOS and Android teams. Tooling & Automation: Lead development of reliability tools and automation—covering regression detection, performance benchmarking, and release health dashboards. Integrate crash/ANR triage systems with Datadog, Crashlytics, and CI/CD pipelines (Harness, Gradle, Bazel). Cross-Team Leadership: Act as liaison with backend/web SRE teams to ensure unified visibility and incident response. Partner with Product, QA, and Release Engineering to meet operational readiness standards and influence architecture for reliability from design to delivery. Cultural Enablement & Mentorship: Lead the rollout of on-call practices, incident response, and blameless postmortems. Mentor senior SREs across regions and drive adoption of reliability ownership among mobile engineering teams. Strategic Enablement: Collaborate with infrastructure and developer productivity teams to integrate mobile builds into reliable CI/CD pipelines. 5+ years relevant experience and a Bachelor's degree OR Any equivalent combination of education and experience.

Requirements

5+ years relevant experience and a Bachelor's degree OR Any equivalent combination of education and experience
Expertise defining and implementing SLIs/SLOs for distributed and client-server systems
Hands-on experience with Datadog or similar platforms for monitoring, alerting, and dashboards
Proven ability to lead on-call rotations, incident response, and postmortems
Strong programming skills in Python, Go, or similar, with working knowledge of Swift or Kotlin for client instrumentation
Experience building automation and internal tools to improve reliability and efficiency
Skilled in integrating CI/CD systems (Harness, Jenkins, Fastlane) for mobile deployments
Strong communication and leadership skills with a proven ability to mentor and influence across teams
Strong knowledge of iOS and/or Android performance and reliability challenges

Nice To Haves

Experience with Bazel, Gradle, or similar build systems
Familiarity with backend reliability and distributed systems concepts
Proven success introducing on-call or observability practices within engineering teams
Experience with large-scale, customer-facing mobile or fintech systems

Responsibilities

Manage and deliver large-scale reliability improvement projects
Drive the identification of performance bottlenecks and lead initiatives to optimize and scale critical systems and services
Architect and implement scalable infrastructure solutions to support growing user demands while maintaining system reliability
Lead the design and enhancement of monitoring frameworks
Take ownership of improving system resilience by designing fault-tolerant architectures and implementing disaster recovery strategies
Lead capacity planning initiatives to ensure system resources are proactively managed
Work closely with development, operations, and other technical teams to ensure seamless system integration and align on best practices for reliability
Act as a technical mentor within the organization
Help define and execute long-term reliability engineering strategies and standards
Develop and enforce best practices for operational excellence
Define mobile-specific SLIs/SLOs
Establish observability and alerting best practices in Datadog
Ensure consistency in how mobile reliability is measured and tracked across iOS and Android teams
Lead development of reliability tools and automation
Integrate crash/ANR triage systems with Datadog, Crashlytics, and CI/CD pipelines (Harness, Gradle, Bazel)
Act as liaison with backend/web SRE teams to ensure unified visibility and incident response
Partner with Product, QA, and Release Engineering to meet operational readiness standards and influence architecture for reliability from design to delivery
Lead the rollout of on-call practices, incident response, and blameless postmortems
Mentor senior SREs across regions and drive adoption of reliability ownership among mobile engineering teams
Collaborate with infrastructure and developer productivity teams to integrate mobile builds into reliable CI/CD pipelines
Partner with engineering teams to ensure robust monitoring, alerting, and dashboards for critical mobile services
Create and maintain runbooks and playbooks to standardize operational practices and empower teams to self-manage reliability
Lead post-incident reviews, identify areas for improvement, and help implement proactive reliability measures
Collaborate with the Datadog observability team to enhance signal quality and alerting efficiency