About The Position

We're seeking a Senior Site Reliability Engineer – Product Reliability to help scale, operate, and improve the reliability of our AI-powered communication platform. This role sits at the intersection of software engineering, infrastructure, operations, and product support. You'll be responsible for ensuring the stability, scalability, and performance of systems powering thousands of real-time interactions across distributed, event-driven architectures. You'll also serve as the first layer of technical investigation for production issues and product-related failures, partnering closely with engineering teams to identify root causes, improve observability, and drive long-term reliability improvements. This is a highly technical, hands-on role for someone who enjoys debugging complex systems, improving operational excellence, and building reliable infrastructure at scale.

Requirements

  • 5+ years of experience in Site Reliability Engineering, Production Engineering, Backend Engineering, or related roles
  • Strong hands-on experience with Node.js and TypeScript in production environments
  • Proven experience operating and troubleshooting distributed systems and microservices architectures
  • Experience managing production workloads on AWS, including ECS, Lambda, SQS, and API Gateway
  • Hands-on experience with Kafka, AWS SQS, or other messaging/event-streaming systems
  • Strong understanding of observability, monitoring, alerting, and incident response best practices
  • Experience debugging complex production issues across application, infrastructure, and networking layers
  • Deep understanding of system reliability concepts including concurrency, async workflows, resiliency, fault tolerance, and eventual consistency
  • Experience with MongoDB and Redis in high-scale production environments
  • Ability to analyze logs, traces, metrics, and system behavior to identify root causes efficiently
  • Strong communication skills and ability to collaborate across engineering, product, and support teams
  • Experience mentoring engineers and contributing to operational excellence initiatives

Nice To Haves

  • Kubernetes and container orchestration in production
  • Broader AWS infrastructure experience (networking, infrastructure-as-code, observability, cost optimization)
  • Experience with relational databases such as PostgreSQL
  • Experience developing load tests, resilience tests, and chaos engineering exercises
  • Prior customer support experience or direct work with customers to understand business impact

Responsibilities

  • Serve as the first line of technical investigation for production incidents, product failures, and performance issues
  • Analyze logs, traces, metrics, and system behavior to identify root causes efficiently and implement solutions
  • Partner closely with backend engineering and DevOps teams to diagnose issues impacting stability, latency, and reliability
  • Design and implement observability improvements, including monitoring, alerting, and structured logging across distributed systems
  • Establish and improve incident response processes, including escalation procedures, post-mortem analysis, and prevention of recurring incidents
  • Participate in architectural design of backend services, event-driven systems, and asynchronous messaging pipelines to ensure reliability and disaster recovery
  • Optimize performance and resilience of systems operating under high load, powering thousands of real-time interactions
  • Develop and maintain operational documentation, runbooks, and dashboards to support production operations
  • Collaborate with product and customer support teams to understand business impact and prioritization
  • Mentor junior engineers on reliability best practices and resilient design principles

Benefits

  • Competitive compensation
  • Opportunities for advancement
  • Work remotely with flexible hours
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service