Senior, Cloud Ops

Forterro•Ashburn, VA

12d

About The Position

We are looking for a Site Reliability Engineer to join our international Cloud Operations team. In this role, you will help design, deploy, and operate our customer‑facing SaaS platforms, ensuring high availability, strong performance, and a high level of automation across multiple product lines. Working closely with Platform Engineering and Development teams, you will contribute to building scalable architectures, improving reliability practices, and evolving our global cloud environment on AWS. This role requires solid experience in cloud engineering, automation, observability, and modern SRE methodologies.

Requirements

3–7+ years in SRE, DevOps, CloudOps, or cloud engineering roles.
Strong background working with AWS services and SaaS architectures.
Experience managing reliability metrics and applying SRE principles in production environments.
Proficiency with AWS (networking, compute, storage, IAM, multi-account environments).
Strong understanding of containers and Kubernetes (EKS preferred).
Experience with Terraform, Git, CI/CD, ArgoCD, and Infrastructure-as-Code practices.
Scripting skills (Python, Bash/PowerShell, YAML) and experience with tools like Crossplane or Ansible.
Solid experience with observability stacks (Grafana, Prometheus, Loki, Datadog, OpenTelemetry).
Good knowledge of system design, troubleshooting, and performance analysis.
Clear communicator with strong organizational skills.
Ability to simplify complex problems and propose pragmatic solutions.
Comfortable working in cross-functional, international teams.
Familiarity with Agile methodologies.

Responsibilities

You design, deploy, and operate our SaaS platforms on AWS.
You work with Kubernetes, Terraform, Crossplane, and GitOps practices to automate infrastructure, streamline deployments, and improve platform scalability.
You develop and maintain ArgoCD pipelines, reusable automation assets, and contribute to the definition of best practices for cloud delivery.
You establish and manage monitoring and observability across our environments using tools such as Prometheus, Grafana, Loki, OpenTelemetry, and Datadog.
You define and track SLIs/SLOs, manage error budgets, and actively work to improve reliability, resilience, and performance through testing and continuous optimization.
You investigate and resolve system, application, and network issues, ensuring timely recovery and minimal impact.
You participate in the on-call rotation, lead post-incident reviews, and contribute to improving operational processes and troubleshooting practices across teams.
You ensure that platforms adhere to security and compliance standards, support architectural decisions for new workloads, and contribute to AWS cost management efforts using FinOps principles.
You work closely with development, platform, and customer-facing teams, supporting deployments and contributing to cross-team initiatives.
You help maintain documentation, mentor CloudOps engineers, and evaluate technologies that can strengthen our cloud platforms.