About The Position

Every time someone taps, swipes, or clicks to pay-Visa infrastructure makes it happen in milliseconds, across 200+ countries. As a Senior SRE on the Product Reliability Engineering (PRE) team, you’ll own the reliability of critical production systems, drive automation that eliminates toil at scale, and help shape how we integrate AI into our engineering practices. This isn’t a monitoring-and-tickets role. You’ll write production code, design resilient architectures, build agentic AI tools, and lead incident response for systems that process billions of transactions. You’ll have real ownership and influence on technical direction.

Requirements

  • 2+ years of relevant work experience and a Bachelor’s degree, OR 5+ years of relevant work experience.
  • 3-5 years in SRE, DevOps, or Platform Engineering with a BS in CS/SE or equivalent experience.
  • Proficient in Python; working knowledge of Go, Java, or Bash.
  • Hands-on with IaC (Terraform, Ansible, or similar) and CI/CD pipelines.
  • Strong distributed systems understanding: failure modes, resilience patterns, capacity planning.
  • Proven incident management experience — on-call rotations, incident command, postmortem facilitation.
  • Experience with observability platforms (Prometheus, Grafana, Splunk, ELK, or Datadog). Linux/Unix fluency.
  • Genuine curiosity about GenAI and agentic systems — hands-on experience is a plus, willingness to learn is a must.

Nice To Haves

  • Cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes).
  • Database reliability exposure: performance tuning, replication, backup/recovery, or schema change management.
  • Hands-on with AI/ML tools: LangChain, prompt engineering, or model fine-tuning.
  • SLO frameworks, error budgets, chaos engineering, or fault injection testing.
  • A GitHub profile or side project that shows us how you think and build.

Responsibilities

  • Own production reliability end-to-end- SLO definition, error budget tracking, proactive risk identification, and incident command during high-severity events.
  • Lead root cause analysis and postmortems that drive lasting systemic improvements. Mentor I4 engineers in on-call and diagnostic best practices.
  • Build production-grade automation in Python (Go/Bash where appropriate) for deployment pipelines, infrastructure provisioning, and operational workflows. Develop and maintain IaC with Terraform or Ansible.
  • Enhance CI/CD pipelines and observability-design monitoring, alerting, and dashboards that give teams real-time clarity on globally distributed systems.
  • Build GenAI-powered tools that augment incident triage, automate runbook execution, or surface predictive reliability insights. Integrate LLMs into operational workflows.
  • Bring curiosity and creative thinking to identify novel AI/ML applications the team hasn’t tried yet- we want builders who see possibilities, not just problems.

Benefits

  • Medical
  • Dental
  • Vision
  • 401 (k)
  • FSA/HSA
  • Life Insurance
  • Paid Time Off
  • Wellness Program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service