Associate Site Reliability Engineer

Visa•Austin, TX

1d•Hybrid

About The Position

The Associate Site Reliability Engineer (SRE I) is a key contributor within the Product Reliability Engineering (PRE) organization, supporting Risk and Identity Products. As part of the PRE team, you will be responsible for availability, performance, efficiency, change management, monitoring, and emergency response for production systems. This role emphasizes operational excellence, reliability engineering fundamentals, and automation-first thinking. You will participate in architectural and operational reviews, identify design and reliability gaps, detect and remediate issues, perform root cause analysis, and contribute practical, production-ready technical solutions that improve system availability and resilience. You will also have opportunities to use to help design and enhance cutting-edge Agentic AI solutions to reduce toil, improve incident response, and strengthen resilience across enterprise systems.

Requirements

Bachelor's degree, OR 3+ years of relevant work experience
Hands-on experience with Linux/Unix administration and basic network troubleshooting via CLI.
Experience supporting and debugging distributed enterprise applications (e.g., Java/Tomcat).
Familiarity with logging, monitoring, and observability tools at scale.
Understanding of HTTP/S, SSL/TLS, DNS, and core web/network security fundamentals.
Foundational experience with at least one scripting or programming language commonly used in SRE workflows (Python, Bash, Go, or Ruby) to support automation and debugging.
Bachelor’s degree in Computer Science, IT, Information Systems, or Engineering.
Two or more years of experience with one or more programming languages (Shell scripting, Java, JavaScript).
Understanding of ITIL processes (Incident, Problem, Change).
Exposure to CI/CD pipelines, automation frameworks, and containerized or cloud-based environments.
Familiarity with AI-assisted tooling or interest in applying automation and intelligent tooling to operational challenges.
Strong analytical and diagnostic skills, including structured root cause analysis.
Exposure to GenAI frameworks and libraries—such as LangChain, CrewAI, or OpenAI-compatible APIs—to develop and integrate AI capabilities.

Responsibilities

Provide 24x7x365 production support across multiple systems and technologies, including participation in a rotational on-call schedule.
Respond to incidents, perform triage and mitigation, and follow documented escalation and recovery procedures.
Apply configuration updates, break-fix changes, and proactive maintenance activities to maintain system availability and performance.
Perform root cause analysis and contribute to post-incident follow-ups and corrective actions.
Provide direct support during production deployments and release activities to minimize disruption, reduce risk, and enable safe rollbacks.
Validate system health through pre- and post-change checks, monitoring for deviations from established baselines.
Follow ITIL-aligned incident, problem, and change management processes.
Maintain and enhance monitoring, dashboards, alerts, and runbooks to support uptime and rapid detection of issues.
Tune alerts to reduce noise and ensure they are actionable and aligned with service impact.
Monitor system health using golden signals such as latency, traffic, error rates, and saturation.
Design, develop, and maintain automation and tooling to reduce operational toil and improve efficiency.
Contribute code using sound engineering practices, including version control, testing, documentation, and CI/CD workflows.
Use scripting or programming languages (e.g., Python, Bash) to support debugging, automation, and operational tooling.
Implement small, well-scoped automation components (e.g., Python-based utilities) that integrate with enterprise systems.
Collaborate closely with development, platform, DevSecOps, and support teams for issue triage, resolution, and operational readiness.
Participate in functional and technical meetings throughout the SDLC to ensure systems are operable, observable, and supportable.
Partner with DevSecOps to ensure new applications meet security, high-availability, and operational handoff standards.
Work hands-on with Linux/Unix systems, distributed applications, and containerized environments.
Remediate cybersecurity findings and ensure systems meet audit, compliance, and regulatory requirements.
Execute L2 support activities using a follow-the-sun operational model.