Staff Site Reliability Engineer, Azure/AWS

Visa•Austin, TX

2d•Hybrid

About The Position

Visa is a world leader in payments technology, facilitating transactions between consumers, merchants, financial institutions and government entities across more than 200 countries and territories, dedicated to uplifting everyone, everywhere by being the best way to pay and be paid. At Visa, you’ll have the opportunity to create impact at scale — tackling meaningful challenges, growing your skills and seeing your contributions impact lives around the world. Join Visa and do work that matters – to you, to your community, and to the world. Progress starts with you. The Staff Platform Engineer is an individual contributor within the SRE / Platform organization, responsible for operating, maintaining, and improving cloud‑native platforms that support critical workloads. This role focuses on platform reliability, operational excellence, and automation, ensuring systems are stable, scalable, and well‑run in production. The Staff Platform Engineer works primarily on Azure‑based platforms, while actively contributing to AWS environments as required by current initiatives. This role is execution‑focused, with strong involvement in day‑to‑day platform operations and continuous improvement efforts.

Requirements

This role primarily supports cloud platform operations on Microsoft Azure, with active involvement in AWS‑based environments based on current organizational needs.
The successful candidate is expected to work hands‑on with production systems and participate in operational activities while Azure initiatives continue to expand.
This is an individual contributor role with no people management responsibility.
The position includes participation in on‑call rotations and close collaboration with platform and application teams across different time zones.
U.S. Applicants Only

Responsibilities

Operate and support core platform components, including: Cloud infrastructure primitives, Kubernetes clusters and supporting services, Networking, ingress, and service discovery.
Ensure platforms meet reliability and availability expectations through proactive monitoring and maintenance.
Identify operational issues and contribute to improvements that reduce instability and recurring incidents.
Participate in on‑call rotations, acting as a responder for platform‑related incidents.
Troubleshoot production issues, perform root cause analysis, and contribute to post‑incident reviews.
Maintain and improve operational runbooks, alerts, and dashboards.
Implement and maintain Infrastructure‑as‑Code for platform resources and environments.
Contribute to automation initiatives that reduce manual work and operational toil.
Support standardized deployment, upgrade, and rollback processes.
Assist in simplifying day‑2 operations and improving platform operability.
Contribute to efforts that reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
Follow established platform standards and best practices, providing feedback for improvement.
Work closely with other platform engineers, SREs, and application teams.
Support platform adoption by helping application teams troubleshoot and operate their workloads.
Escalate complex issues to senior engineers when needed, while learning from hands‑on experience.