About The Position

We're looking for a Senior Site Reliability Engineer who approaches operational problems as engineering challenges. You won't just monitor dashboards and respond to pages — you'll help define and drive service level objectives, identify reliability risks, and work alongside engineering teams to ensure reliability and performance are first-class concerns from design through to production. Your mission is not only to keep the platform running but also to make the platform more reliable by default — through better practices, smarter automation, and a culture where every engineer thinks about failure modes.

Requirements

  • Fluent English - ideallyon native level
  • Education: Bachelor's or Master's in Computer Science, Engineering, or equivalent practical experience.
  • Demonstrated experience applying SRE principles: SLOs/SLIs, error budgets, toil reduction, and capacity planning.
  • Experience building or significantly evolving observability and monitoring solutions (we use Prometheus, Grafana, and ELK, but we care more about your approach than your tool familiarity).
  • Experience with AWS.
  • Linux systems administration background (RHEL/CentOS).
  • Hands-on experience operating services on container orchestration platforms (Kubernetes preferred).
  • A track record of improving the reliability of production systems at scale — through better automation, observability, and process, not just firefighting.
  • Strong communication skills and the ability to influence engineering culture across teams.
  • An analytical, systems-thinking mindset — you instinctively ask "why did this fail?" and "how do we make sure it can't?"

Nice To Haves

  • Infrastructure-as-code and configuration management experience (Terraform, Ansible).
  • Strong scripting and automation skills (Bash, Python, or Go) — you're comfortable writing the glue that keeps systems healthy and eliminates repetitive work.
  • Networking fundamentals (TCP/IP, DNS, load balancing).
  • Database experience — relational (PostgreSQL, MySQL) or NoSQL (Redis).
  • Telephony domain knowledge (SIP, VoIP).
  • Familiarity with chaos engineering tools and practices.

Responsibilities

  • Act as a first responder during incidents; lead root cause analysis and blameless post-mortems.
  • Turn incident learnings into systemic improvements — better tooling, better runbooks, better architecture.
  • Provide input and guidance to squads on troubleshooting documentation and operational runbooks, ensuring they are practical and effective for production support.
  • Define, implement, and iterate on SLIs, SLOs, and error budgets to drive data-informed reliability decisions.
  • Identify and measure operational toil; build software and automation to systematically reduce it.
  • Conduct capacity planning and performance analysis to stay ahead of scaling challenges.
  • Design and evolve observability platforms (metrics, logs, traces, dashboards) that give engineering teams genuine insight into system behaviour — not just noise.
  • Continuously improve alert quality: reduce false positives, increase signal, and ensure every alert is actionable.
  • Partner with development teams to embed reliability thinking into the software delivery lifecycle — from design reviews to deployment strategies.
  • Champion practices like chaos engineering, progressive rollouts, and failure injection testing.
  • Mentor engineers across teams on reliability principles and operational best practices.
  • Join on-call rotations and continuously improve the on-call experience for yourself and others.

Benefits

  • Fixed compensation
  • Long-term employment with the working days vacation
  • Development in professional growth (courses, training, etc)
  • Being part of successful cutting-edge technology products that are making a global impact in the service industry
  • Proficient and fun-to-work-with colleagues
  • Apple gear
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service