Principal SRE | San Francisco Bay Area

VirtasantAustin, TX
50dHybrid

About The Position

We are looking for a Principal-Level Site Reliability Engineer (Operations) to provide hands-on, day-to-day operational support for one of our largest global clients. This role is not a leadership or people-management position — it is a senior individual contributor SRE role focused on incident response, system diagnostics, dashboard monitoring, operational maintenance, and ensuring platform reliability. You will be directly responsible for keeping critical systems healthy, resolving incidents, improving operational workflows, and working with engineering teams to maintain high reliability across large-scale distributed systems. If you’re a senior SRE who enjoys solving problems in the system, not managing teams or driving strategy, this is the right role.

Requirements

  • 5–10+ years in SRE, Production Operations, or Infrastructure Engineering roles.
  • Strong hands-on experience troubleshooting distributed systems in production.
  • Proficiency in Linux fundamentals, including process management, networking, storage, and diagnostics.
  • Solid understanding of cloud-native architectures, containers, and modern infrastructure tooling.
  • Experience with: Monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.)
  • Incident management workflows
  • Root-cause analysis / postmortems
  • CI/CD operational processes
  • Strong Linux debugging and performance troubleshooting skills.
  • Familiarity with Kubernetes, containers, or cloud-native runtime environments.
  • Ability to write or modify scripts (Python, Bash, or similar) for operational automation.
  • Hands-on experience with logs, metrics, traces, and alert lifecycle management.
  • Calm, structured decision-making under pressure.
  • Excellent communication — clear, concise, and reliable.
  • Strong attention to detail and consistency in documentation.
  • A proactive, ownership-driven mindset for reliability and operations.

Responsibilities

  • Monitor dashboards, alerts, and system health in real time.
  • Respond to incidents quickly and decisively, driving issues to resolution.
  • Perform root-cause analysis and contribute to post-incident reviews.
  • Troubleshoot complex system and infrastructure issues across distributed environments.
  • Maintain and improve runbooks, playbooks, and operational documentation.
  • Support and enhance the observability tooling used for metrics, logs, and alerting.
  • Work cross-functionally with engineering teams to escalate system-level issues when required.
  • Run routine operational checks to ensure platform stability.
  • Tune alerts, update dashboards, and ensure monitoring accuracy.
  • Identify recurring operational issues and recommend improvements.
  • Implement small automation and scripting solutions to improve operational workflows.
  • Keep services running smoothly through proactive maintenance.
  • Partner with Engineering, SRE, and Product teams to ensure transparent communication during incidents.
  • Provide clear, concise updates and documentation for operational work.
  • Participate in shift patterns or rotational incident coverage depending on client needs.

Benefits

  • Build and lead a new SRE-focused customer success function from day one.
  • Work at the intersection of reliability engineering, customer engagement, and cloud transformation.
  • Partner with global enterprises on cutting-edge cloud and DevOps programs.
  • Join a global, remote-first consultancy with 4,000+ experts across 130 countries.
  • Thrive in a culture that values autonomy, agility, and innovation.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

No Education Listed

Number of Employees

51-100 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service