Site Reliability Engineer

Cinemark•Plano, TX

About The Position

This position contributes to the design, implementation, and continuous improvement of Cinemark’s reliability engineering practices across cloud-native and distributed systems. The ideal candidate has a strong technical foundation in software engineering and infrastructure automation, with hands-on experience in building and maintaining scalable, fault-tolerant systems. A deep understanding of cloud platforms (e.g., Azure, AWS), container orchestration (e.g., Kubernetes), and infrastructure-as-code tools (e.g., Terraform, Ansible) is essential. Familiarity with observability stacks, incident response workflows, and performance tuning is expected. This role will actively participate in Agile Scrum ceremonies and collaborate with development teams to embed reliability into the software delivery lifecycle. The ideal candidate is comfortable working in a fast-paced environment, contributing to architectural decisions, and driving initiatives around service-level objectives (SLOs), error budgets, and operational excellence.

Requirements

Strong understanding of distributed systems and cloud-native architectures.
Proficiency in one or more programming languages (e.g., Python, Go, Java, C#).
Experience with CI/CD pipelines and infrastructure as code (e.g., Terraform, Ansible, Azure DevOps).
Familiarity with observability tools (e.g., Prometheus, Grafana, ELK, Datadog).
Ability to perform root cause analysis and drive incident resolution.
Experience with container orchestration (e.g., Kubernetes).
Knowledge of SLIs, SLOs, and error budgets.
Background in performance tuning, capacity planning, and high-availability systems.
Comfort working in cross-functional teams and communicating technical concepts clearly.

Responsibilities

Review system health dashboards and overnight alerts to identify anomalies or trends.
Participate in daily stand-ups with development and infrastructure teams to align on priorities.
Investigate and resolve production incidents, performing root cause analysis and documenting findings.
Collaborate with developers to improve CI/CD pipelines and reduce deployment risks.
Write automation scripts to eliminate manual operational tasks and improve system reliability.
Define and refine service-level indicators (SLIs), objectives (SLOs), and error budgets.
Conduct reliability reviews for upcoming releases or infrastructure changes.
Lead or contribute to post-incident reviews and share learnings across teams.
Update internal documentation and plan reliability-focused improvements for upcoming sprints.