Sr. DevOps Platform Engineer

Berkley•Wilmington, DE

54d

About The Position

As a Senior DevOps Platform Engineer, you will play a critical role in ensuring the reliability, scalability, security, and performance of Berkley’s software systems. You will collaborate closely with product engineering, infrastructure, and architecture teams to build, mature, and operate an enterprise DevOps platform that enables teams to deliver software safely, efficiently, and at scale. This role blends DevOps platform engineering and SRE practices, with a focus on CI/CD, observability, automation, and reliability across both cloud and on‑premises environments. Maintain a strong understanding of the entire technology stack (networking, storage, OS, virtualization, databases, development frameworks, and applications) to design, observe, troubleshoot, and automate systems across the Berkley environment.

Requirements

5+ years of experience in DevOps and Site Reliability Engineering, with hands ‑ on ownership of infrastructure, CI/CD platforms, and software delivery in enterprise environments.
Strong software engineering and automation skills, including proficiency in Python, Go, Bash, or JavaScript, and experience building production ‑ grade automation.
Proven expertise in enterprise CI/CD, GitOps, and containerized platforms, including Kubernetes, Helm, and cloud ‑ native delivery patterns.
Deep experience with reliability and observability, including monitoring, alerting, logging, and tracing platforms (e.g., Dynatrace, Datadog, ELK), and defining SLIs, SLOs, and reliability metrics.
Strong understanding of cloud, on ‑ prem, and hybrid architectures, including high availability, disaster recovery, capacity planning, and scalability.
Hands ‑ on experience with infrastructure as code and configuration management (e.g., Terraform, Ansible, GitHub Actions) to reduce operational toil and enable self ‑ service.
Solid knowledge of security and networking fundamentals, including applying industry ‑ standard security frameworks in enterprise environments.
Demonstrated ability to lead technical initiatives, influence system design decisions, mentor engineers, and collaborate effectively across product, engineering, infrastructure, and security teams.
Bachelor’s degree with emphasis in related field or equivalent experience.

Responsibilities

Design, build, and mature enterprise CI/CD pipelines and shared DevOps platform services, enabling secure, reliable, and scalable software delivery for multiple teams.
Define, implement, and track reliability and observability OKRs, including SLIs and SLOs, to guide reliability engineering, deployment practices, and operational decision ‑ making.
Implement and evolve monitoring, alerting, and observability solutions, including AIOps capabilities, to proactively assess system health, detect anomalies, enable self ‑ healing, and support rapid incident response.
Drive automation initiatives to eliminate operational toil, streamline platform and pipeline workflows, reduce manual intervention, and improve efficiency for product engineering and SRE teams.
Identify and address performance, scalability, and reliability bottlenecks across applications, infrastructure, and delivery pipelines to improve system efficiency and user experience.
Partner with incident management and operations teams to respond to, resolve, and prevent system outages or degradation, minimizing downtime and customer impact.
Collaborate actively with development, operations, and platform teams to embed resiliency, observability, security, and reliability requirements into system design, CI/CD pipelines, and runtime environments.
Lead cross ‑ functional coordination with product, development, infrastructure, and architecture teams to perform capacity planning, anticipate growth, and ensure systems scale reliably with business demand.
Continuously improve platform resilience by identifying and closing gaps in architecture, tooling, processes, and operational practices.
Modernize and strengthen disaster recovery capabilities for both on ‑ premises and cloud ‑ based Berkley solutions, ensuring recoverability, resilience, and compliance with enterprise standards.