Cloud Reliability Engineer

Versant•Englewood Cliffs, NJ

3d•$135,000 - $165,000•Hybrid

About The Position

The Cloud Reliability Engineer is responsible for ensuring the availability, performance, scalability, and operational excellence of VERSANT’s cloud platforms and services. This role works closely with cloud engineering, application development, networking, security, and operations teams to build and maintain highly reliable systems across a large multi-account AWS environment. The engineer will leverage automation, observability, and reliability engineering practices to improve platform resilience, reduce operational risk, and enhance the customer experience. As a leading media company, VERSANT operates digital products, streaming platforms, content delivery systems, and media workflows that demand high levels of uptime and performance. The Cloud Reliability Engineer will help ensure these services remain resilient, scalable, and operationally mature. The ideal candidate has strong experience with AWS, monitoring and observability platforms, incident management, automation, infrastructure as code, and operational best practices. Experience with AWS Organizations, Control Tower, Identity Center, Terraform, and modern cloud operations tooling is highly desirable.

Requirements

Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.
3–7 years of experience in Site Reliability Engineering, Cloud Engineering, DevOps, Infrastructure Engineering, or related roles.
Strong hands-on experience with AWS cloud services and enterprise-scale AWS environments.
Experience with monitoring and observability platforms
Experience with incident management and root cause analysis
Experience with operational troubleshooting and performance tuning
Experience with AWS Organizations, Control Tower, and Identity Center
Experience with Infrastructure as Code: Terraform
Experience with Infrastructure as Code: CloudFormation
Experience with CI/CD platforms and deployment automation.
Experience with scripting and automation using Python, PowerShell, Bash, or similar languages.
Strong understanding of AWS networking, resiliency, and cloud architecture concepts.
Experience with logging, metrics, tracing, and alerting technologies.
Strong troubleshooting, communication, and collaboration skills.

Nice To Haves

Experience with AWS Organizations, Control Tower, Identity Center, Terraform, and modern cloud operations tooling is highly desirable.

Responsibilities

Design, implement, and maintain reliability practices for cloud infrastructure and platform services.
Define and monitor service-level objectives (SLOs), service-level indicators (SLIs), and operational metrics.
Identify reliability risks and implement solutions that improve availability, scalability, and resilience.
Drive continuous improvement initiatives focused on operational excellence and system stability.
Design and maintain monitoring, logging, alerting, and observability solutions across AWS environments.
Develop dashboards and reporting that provide visibility into platform health and performance.
Analyze system behavior, identify bottlenecks, and implement performance improvements.
Establish proactive monitoring practices that detect issues before they impact customers.
Participate in incident response, troubleshooting, and root cause analysis activities.
Lead post-incident reviews and identify corrective actions to prevent recurrence.
Improve operational processes, runbooks, and recovery procedures.
Support disaster recovery and business continuity initiatives.
Support the reliability and operational health of large-scale AWS environments utilizing AWS Organizations, Control Tower, and Identity Center.
Partner with cloud engineering teams to improve platform architecture, resiliency, and operational consistency.
Assist in maintaining secure, scalable, and highly available cloud services.
Develop automation that reduces operational toil and improves system reliability.
Support infrastructure-as-code solutions using Terraform, CloudFormation, and related technologies.
Automate operational workflows, monitoring, remediation, and recovery activities.
Contribute to CI/CD pipelines and deployment automation initiatives.
Support the reliability of streaming platforms, content delivery systems, media workflows, APIs, and customer-facing applications.
Collaborate with engineering teams to improve application reliability and operational readiness.
Assist in capacity planning and scaling efforts for high-traffic events and media workloads.
Partner with cloud, networking, security, and application teams to identify and address operational risks.
Promote reliability engineering best practices throughout the organization.
Contribute to documentation, standards, and operational procedures.
Evaluate emerging technologies and recommend improvements to platform reliability and observability.