Senior Site Reliability Engineer

Omilia

About The Position

We're looking for a Senior Site Reliability Engineer who approaches operational problems as engineering challenges. You won't just monitor dashboards and respond to pages — you'll help define and drive service level objectives, identify reliability risks, and work alongside engineering teams to ensure reliability and performance are first-class concerns from design through to production. Your mission is not only to keep the platform running but also to make the platform more reliable by default — through better practices, smarter automation, and a culture where every engineer thinks about failure modes.

Requirements

Fluent English - ideallyon native level
Education: Bachelor's or Master's in Computer Science, Engineering, or equivalent practical experience.
Demonstrated experience applying SRE principles: SLOs/SLIs, error budgets, toil reduction, and capacity planning.
Experience building or significantly evolving observability and monitoring solutions (we use Prometheus, Grafana, and ELK, but we care more about your approach than your tool familiarity).
Experience with AWS.
Linux systems administration background (RHEL/CentOS).
Hands-on experience operating services on container orchestration platforms (Kubernetes preferred).
A track record of improving the reliability of production systems at scale — through better automation, observability, and process, not just firefighting.
Strong communication skills and the ability to influence engineering culture across teams.
An analytical, systems-thinking mindset — you instinctively ask "why did this fail?" and "how do we make sure it can't?"

Nice To Haves

Infrastructure-as-code and configuration management experience (Terraform, Ansible).
Strong scripting and automation skills (Bash, Python, or Go) — you're comfortable writing the glue that keeps systems healthy and eliminates repetitive work.
Networking fundamentals (TCP/IP, DNS, load balancing).
Database experience — relational (PostgreSQL, MySQL) or NoSQL (Redis).
Telephony domain knowledge (SIP, VoIP).
Familiarity with chaos engineering tools and practices.

Responsibilities

Act as a first responder during incidents; lead root cause analysis and blameless post-mortems.
Turn incident learnings into systemic improvements — better tooling, better runbooks, better architecture.
Provide input and guidance to squads on troubleshooting documentation and operational runbooks, ensuring they are practical and effective for production support.
Define, implement, and iterate on SLIs, SLOs, and error budgets to drive data-informed reliability decisions.
Identify and measure operational toil; build software and automation to systematically reduce it.
Conduct capacity planning and performance analysis to stay ahead of scaling challenges.
Design and evolve observability platforms (metrics, logs, traces, dashboards) that give engineering teams genuine insight into system behaviour — not just noise.
Continuously improve alert quality: reduce false positives, increase signal, and ensure every alert is actionable.
Partner with development teams to embed reliability thinking into the software delivery lifecycle — from design reviews to deployment strategies.
Champion practices like chaos engineering, progressive rollouts, and failure injection testing.
Mentor engineers across teams on reliability principles and operational best practices.
Join on-call rotations and continuously improve the on-call experience for yourself and others.

Benefits

Fixed compensation
Long-term employment with the working days vacation
Development in professional growth (courses, training, etc)
Being part of successful cutting-edge technology products that are making a global impact in the service industry
Proficient and fun-to-work-with colleagues
Apple gear

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume