Principal Software Engineer, Site Reliability

Jobgether
16h$195,300 - $270,400Remote

About The Position

This role offers the opportunity to shape and lead site reliability practices across a large-scale, AI-driven platform, ensuring systems are resilient, observable, and self-healing. You will collaborate with cross-functional teams including Product Engineering, Machine Learning, DevOps, and Development Productivity to influence both technical and operational strategies. The position emphasizes thought leadership, mentoring, and driving adoption of reliability best practices across the organization. You will design and implement frameworks for distributed tracing, real user monitoring, performance metrics, and automation to minimize downtime. This role requires hands-on technical contributions while aligning initiatives with business goals, ultimately improving engineering velocity, operational efficiency, and user experience. Operating in a remote-first environment, you will have the opportunity to lead enterprise-wide reliability initiatives while fostering a culture of operational excellence.

Requirements

  • 10+ years of combined experience in Software Engineering and Site Reliability Engineering.
  • Proven SRE thought leader with experience driving adoption of reliability best practices.
  • Strong mentoring and communication skills to influence engineers across disciplines.
  • Proficiency in Python, Go, and JavaScript/TypeScript.
  • Experience with Infrastructure as Code (Terraform, CDK, CloudFormation, etc.).
  • Hands-on experience building internal tooling in agile environments.
  • Expertise in observability, distributed tracing, RUM, LCP, and performance monitoring tools (e.g., Datadog, Prometheus).
  • Experience with on-call and incident management, including large-scale or ML-related incidents.
  • Strong background in automation and building self-healing systems.
  • Familiarity with LLM/GenAI tools to improve SRE efficiency and processes.
  • Program management skills to propose solutions, influence leadership, improve processes, and drive cross-functional projects.

Nice To Haves

  • experience with service mesh, full-stack development, building or extending observability platforms, Development Productivity or Quality Platforms, and high-scale SaaS/microservices cloud environments.

Responsibilities

  • Define, advocate, and promote Site Reliability Engineering (SRE) principles across engineering teams.
  • Partner with leadership to develop long-term strategies for reliability, resiliency, and observability.
  • Lead the implementation of distributed tracing, real user monitoring (RUM), and key performance metrics to enhance system visibility and user experience.
  • Build and scale self-healing systems to reduce manual intervention and minimize downtime.
  • Drive improvements in enterprise-wide incident response processes, including for Machine Learning systems.
  • Collaborate with Development Productivity and Quality teams to maintain engineering velocity without sacrificing reliability.
  • Influence technical and operational roadmaps through data-driven insights and hands-on contributions.
  • Own and execute cross-functional initiatives from concept through delivery, applying program management skills to align stakeholders and achieve results.

Benefits

  • Competitive base salary: $195,300 – $270,400 USD annually, plus target bonuses and equity compensation.
  • Remote-first work environment with flexibility to work from anywhere in the U.S.
  • Comprehensive medical, dental, and vision coverage with significant employer contributions.
  • Generous 401(k) plan with company matching.
  • Paid time off, sick leave, and company holidays.
  • Paid family and parental leave.
  • Family support programs including fertility, parenthood, and caregiving benefits.
  • Employee Assistance Program (EAP) for mental health and life resources.
  • Financial wellness resources and annual wellness allowance.
  • Annual productivity allowance for tools and resources to optimize work performance.
  • Opportunities to participate in team events, onsites, and employee resource groups (ERGs).
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service