Senior DevOps Engineer

Forrester Research•Cambridge, MA

145d

About The Position

The Senior DevOps Systems Engineer will play a pivotal role in designing, securing, scaling, and operating modern cloud-native platforms with a strong emphasis on agentic AI systems, Kubernetes, Karpenter, Ray Serve, Terraform-based infrastructure-as-code, and Amazon Web Services (AWS)-centric architectures. This role demands a hands-on technical leader who thrives in a highly collaborative environment, takes ownership of complex problems, and drives them to resolution with minimal oversight. You will partner across engineering, SRE, security, product, data, and AI-focused teams to ensure system resiliency, observability, and strong security practices at every layer. This role expects a strong troubleshooting instinct, the ability to navigate a broad observability stack, and an obsession with identifying root cause rather than symptoms.

Requirements

Master’s degree in technology related, engineering, or computer science (a plus).
Relevant work experience (eight-plus years) in software development or systems engineering.
Deep experience with AWS (EC2, EKS, IAM, VPC, networking, load balancers, S3, Lambda, RDS, MSK, Secrets Manager, etc.).
Experience in supporting AI/ML or agentic AI systems, especially in production environments.
Extensive experience with continuous integration/continuous delivery tools (CI/CD) — Argo CD, Jenkins, etc.
Experience in working collaboratively with various applications development teams throughout the organization to resolve problems.
Strong Kubernetes proficiency: cluster operations, Karpenter, Helm, networking, and cluster security.
Expertise with Terraform: maintaining/developing modules from scratch.
Strong troubleshooting capabilities across distributed systems, with the ability to interpret logs, metrics, and traces to rapidly identify root cause.
Familiarity with observability stacks (e.g., Prometheus/Grafana, CloudWatch, OpenTelemetry, Dynatrace, etc.).
Solid understanding of security best practices (network segmentation, IAM least privilege, secrets management, pipeline integrity, and patching).
Excellent written and oral communication skills necessary to produce and process technical documentation.
Demonstrated ability to independently lead initiatives, drive tasks to completion, and manage priorities in a fast-paced environment.
The ability to participate in an on-call rotation.
Provide mission-critical production support in case of an outage during off business hours if necessary.

Nice To Haves

Master’s degree in technology related, engineering, or computer science (a plus).
Professional IT certifications, such as CKA, CKS, and AWS certifications (a plus).

Responsibilities

Design, build, maintain, and automate infrastructure supporting various platforms and technologies across the organization.
Implement and enforce security best practices across cloud, network, and application layers; security must be foundational, not an afterthought.
Ensure maximum availability and reliability of our mission-critical platforms, complying with our SLAs.
Drive root cause analysis using logs, traces, metrics, and dashboards across multiple observability platforms.
Troubleshoot complex production issues across the stack (infrastructure, network, and application), ensuring minimal downtime and rapid recovery.
Collaborate closely with engineering, SRE, QA, security, data/AI, and product teams.
Participate in the disaster recovery/business continuity (DRBC) routine exercises.
Participate in an on‑call rotation, improving incident response, runbooks, and documentation.
Lead initiatives with minimal oversight, clearly communicating progress, risks, and outcomes to technical and nontechnical stakeholders.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume