Senior Site Reliability Engineer

HiveWatch•El Segundo, CA

20d•$183,000 - $235,000

About The Position

HiveWatch is seeking a Staff Site Reliability Engineer to join our Platform Team, where you'll architect and maintain mission-critical edge infrastructure that connects our SaaS platform to customer systems. You'll ensure exceptional performance, reliability, and observability across our distributed environment while providing technical leadership to our growing engineering team. This role reports directly to our VP of Engineering.

Requirements

7+ years of software engineering experience with strong coding skills in production environments
5+ years of SRE, DevOps, or production operations experience
Expertise with cloud platforms (AWS preferred) and containerized applications (Docker, Kubernetes)
Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
Proficiency in at least one object oriented programming language in our tech stack (Java, Kotlin, Python)
Hands-on experience with relational databases and SQL performance optimization
Experience with monitoring and observability tools (Prometheus, Grafana, DataDog, or equivalent)
Strong debugging skills across distributed systems and microservices architectures
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience

Nice To Haves

Experience with our tech stack: Kotlin, Rust, TypeScript, Python
Expertise in AWS architecture and services
Experience in physical security, IoT, or edge computing environments
Expertise with advanced AWS services (Kinesis, Lambda, EKS, RDS)
Experience with Terraform and Terragrunt specifically
Background in high-availability, multi-tenant SaaS environments
Experience establishing SRE practices and culture from the ground up
Track record of leading incident response and post-mortem processes
Experience mentoring and developing junior engineers
Knowledge of security best practices and compliance requirements
Experience with edge computing and distributed system architectures
Previous experience in a startup or high-growth environment (50-200 employees)

Responsibilities

Own the reliability of mission-critical systems including production monitoring, alerting, and capacity planning
Debug and resolve complex production issues across the full stack, from infrastructure to application code
Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
Perform root cause analysis requiring deep code-level investigation and implement preventive measures
Build automation and tooling to reduce operational toil and improve system reliability
Maintain CI/CD pipelines, observability infrastructure, and database performance optimization
Increase the resiliency, scalability, and maintainability of production environments
Establish on-call procedures and disaster recovery processes
Provide technical leadership and mentorship to foster engineering excellence and reliability culture

Benefits

Comprehensive health coverage: medical, dental, vision, and life insurance
Cutting-edge work in an emerging field with huge growth potential
Competitive compensation packages designed to reward top talent
A modern, newly renovated HQ right on Main Street in El Segundo, CA
401(k) with a 4% company match to help you invest in your future (match launches in 2026)
Flexible paid time off so you can recharge when you need it
Additional benefits include ClassPass credits and a discount on pet insurance
A family-friendly, compassionate culture that values balance and belonging

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume