Senior DevOps Engineer

Forrester ResearchCambridge, MA
9h

About The Position

The Senior DevOps Systems Engineer will play a pivotal role in designing, securing, scaling, and operating modern cloud-native platforms with a strong emphasis on agentic AI systems, Kubernetes, Karpenter, Ray Serve, Terraform-based infrastructure-as-code, and Amazon Web Services (AWS)-centric architectures. This role demands a hands-on technical leader who thrives in a highly collaborative environment, takes ownership of complex problems, and drives them to resolution with minimal oversight. You will partner across engineering, SRE, security, product, data, and AI-focused teams to ensure system resiliency, observability, and strong security practices at every layer. This role expects a strong troubleshooting instinct, the ability to navigate a broad observability stack, and an obsession with identifying root cause rather than symptoms.

Requirements

  • Master’s degree in technology related, engineering, or computer science (a plus).
  • Relevant work experience (eight-plus years) in software development or systems engineering.
  • Deep experience with AWS (EC2, EKS, IAM, VPC, networking, load balancers, S3, Lambda, RDS, MSK, Secrets Manager, etc.).
  • Experience in supporting AI/ML or agentic AI systems, especially in production environments.
  • Extensive experience with continuous integration/continuous delivery tools (CI/CD) — Argo CD, Jenkins, etc.
  • Experience in working collaboratively with various applications development teams throughout the organization to resolve problems.
  • Strong Kubernetes proficiency: cluster operations, Karpenter, Helm, networking, and cluster security.
  • Expertise with Terraform: maintaining/developing modules from scratch.
  • Strong troubleshooting capabilities across distributed systems, with the ability to interpret logs, metrics, and traces to rapidly identify root cause.
  • Familiarity with observability stacks (e.g., Prometheus/Grafana, CloudWatch, OpenTelemetry, Dynatrace, etc.).
  • Solid understanding of security best practices (network segmentation, IAM least privilege, secrets management, pipeline integrity, and patching).
  • Excellent written and oral communication skills necessary to produce and process technical documentation.
  • Demonstrated ability to independently lead initiatives, drive tasks to completion, and manage priorities in a fast-paced environment.
  • The ability to participate in an on-call rotation.
  • Provide mission-critical production support in case of an outage during off business hours if necessary.

Nice To Haves

  • Master’s degree in technology related, engineering, or computer science (a plus).
  • Professional IT certifications, such as CKA, CKS, and AWS certifications (a plus).

Responsibilities

  • Design, build, maintain, and automate infrastructure supporting various platforms and technologies across the organization.
  • Implement and enforce security best practices across cloud, network, and application layers; security must be foundational, not an afterthought.
  • Ensure maximum availability and reliability of our mission-critical platforms, complying with our SLAs.
  • Drive root cause analysis using logs, traces, metrics, and dashboards across multiple observability platforms.
  • Troubleshoot complex production issues across the stack (infrastructure, network, and application), ensuring minimal downtime and rapid recovery.
  • Collaborate closely with engineering, SRE, QA, security, data/AI, and product teams.
  • Participate in the disaster recovery/business continuity (DRBC) routine exercises.
  • Participate in an on‑call rotation, improving incident response, runbooks, and documentation.
  • Lead initiatives with minimal oversight, clearly communicating progress, risks, and outcomes to technical and nontechnical stakeholders.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service