Staff Site Reliability Engineer

Visa•Austin, TX

11h•Hybrid

About The Position

The Staff Platform Engineer is an individual contributor within the SRE / Platform organization, responsible for operating, maintaining, and improving cloud‑native platforms that support critical workloads. This role focuses on platform reliability, operational excellence, and automation, ensuring systems are stable, scalable, and well‑run in production. The Staff Platform Engineer works primarily on Azure‑based platforms, while actively contributing to AWS environments as required by current initiatives. This role is execution‑focused, with strong involvement in day‑to‑day platform operations and continuous improvement efforts. This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.

Requirements

5 or more years of relevant work experience with a Bachelors Degree or at least 2 years of work experience with an Advanced degree (e.g. Masters, MBA, JD, MD) or 0 years of work experience with a PhD
Strong hands-on experience with: Public Cloud platforms (Azure preferred, and AWS)
Kubernetes at scale (AKS, EKS, or equivalent)
Infrastructure as Code (e.g., Terraform)
Containerized, cloud‑native microservices architectures
Background in Platform Engineering, SRE, or DevOps roles, supporting production systems and day‑to-day platform operations.
Strong understanding of: Observability tooling and Golden Signals concepts
Incident management concepts and on-call operations
Platform reliability, availability, and operational best practices
Networking, ingress, and service discovery in cloud and Kubernetes environments
Strong collaboration and communication skills

Nice To Haves

6 or more years of work experience with a Bachelors Degree or 4 or more years of relevant experience with an Advanced Degree (e.g. Masters, MBA, JD, MD) or up to 3 years of relevant experience with a PhD

Responsibilities

Operate and support core platform components, including: Cloud infrastructure primitives
Kubernetes clusters and supporting services
Networking, ingress, and service discovery
Ensure platforms meet reliability and availability expectations through proactive monitoring and maintenance.
Identify operational issues and contribute to improvements that reduce instability and recurring incidents.
Participate in on‑call rotations, acting as a responder for platform‑related incidents.
Troubleshoot production issues, perform root cause analysis, and contribute to post-incident reviews.
Maintain and improve operational runbooks, alerts, and dashboards.
Implement and maintain Infrastructure-as-Code for platform resources and environments.
Contribute to automation initiatives that reduce manual work and operational toil.
Support standardized deployment, upgrade, and rollback processes.
Assist in simplifying day‑2 operations and improving platform operability.
Contribute to efforts that reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
Follow established platform standards and best practices, providing feedback for improvement.
Work closely with other platform engineers, SREs, and application teams.
Support platform adoption by helping application teams troubleshoot and operate their workloads.
Escalate complex issues to senior engineers when needed, while learning from hands-on experience.