Senior Site Reliability Engineer

HiveWatchEl Segundo, CA
20d$183,000 - $235,000

About The Position

HiveWatch is seeking a Staff Site Reliability Engineer to join our Platform Team, where you'll architect and maintain mission-critical edge infrastructure that connects our SaaS platform to customer systems. You'll ensure exceptional performance, reliability, and observability across our distributed environment while providing technical leadership to our growing engineering team. This role reports directly to our VP of Engineering.

Requirements

  • 7+ years of software engineering experience with strong coding skills in production environments
  • 5+ years of SRE, DevOps, or production operations experience
  • Expertise with cloud platforms (AWS preferred) and containerized applications (Docker, Kubernetes)
  • Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
  • Proficiency in at least one object oriented programming language in our tech stack (Java, Kotlin, Python)
  • Hands-on experience with relational databases and SQL performance optimization
  • Experience with monitoring and observability tools (Prometheus, Grafana, DataDog, or equivalent)
  • Strong debugging skills across distributed systems and microservices architectures
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience

Nice To Haves

  • Experience with our tech stack: Kotlin, Rust, TypeScript, Python
  • Expertise in AWS architecture and services
  • Experience in physical security, IoT, or edge computing environments
  • Expertise with advanced AWS services (Kinesis, Lambda, EKS, RDS)
  • Experience with Terraform and Terragrunt specifically
  • Background in high-availability, multi-tenant SaaS environments
  • Experience establishing SRE practices and culture from the ground up
  • Track record of leading incident response and post-mortem processes
  • Experience mentoring and developing junior engineers
  • Knowledge of security best practices and compliance requirements
  • Experience with edge computing and distributed system architectures
  • Previous experience in a startup or high-growth environment (50-200 employees)

Responsibilities

  • Own the reliability of mission-critical systems including production monitoring, alerting, and capacity planning
  • Debug and resolve complex production issues across the full stack, from infrastructure to application code
  • Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
  • Perform root cause analysis requiring deep code-level investigation and implement preventive measures
  • Build automation and tooling to reduce operational toil and improve system reliability
  • Maintain CI/CD pipelines, observability infrastructure, and database performance optimization
  • Increase the resiliency, scalability, and maintainability of production environments
  • Establish on-call procedures and disaster recovery processes
  • Provide technical leadership and mentorship to foster engineering excellence and reliability culture

Benefits

  • Comprehensive health coverage: medical, dental, vision, and life insurance
  • Cutting-edge work in an emerging field with huge growth potential
  • Competitive compensation packages designed to reward top talent
  • A modern, newly renovated HQ right on Main Street in El Segundo, CA
  • 401(k) with a 4% company match to help you invest in your future (match launches in 2026)
  • Flexible paid time off so you can recharge when you need it
  • Additional benefits include ClassPass credits and a discount on pet insurance
  • A family-friendly, compassionate culture that values balance and belonging
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service